AI did not make software delivery faster. It just moved the bottleneck.

For well-defined tasks, where success can be verified deterministically through tests, CI, and other gates, the act of producing correct code is largely solved as a bottleneck. Linters catch bugs, compilers verify syntax, CI gates bad code, security scanners block vulnerabilities. An AI with access to these tools can iterate efficiently once the task is precisely specified.

But getting to that well-defined task? That’s where projects die.

In practice, β€œwell-defined” is doing enormous work here.

Most real business initiatives begin far away from executable specificity, the point at which a task is described precisely enough that an engineer or AI can implement it without discovering new constraints mid-flight.

In long-lived, production systems, executable specificity is rare. Critical assumptions live outside tickets and PRDs: undocumented invariants, historical compromises, partial data guarantees, and behavior that only exists because β€œbreaking it would be risky.” These constraints surface only when code is written, integrated, and exercised.

This is why AI-assisted coding shines on toy problems and greenfield services, not because production systems are harder to code, but because they are harder to specify.

This does not mean understanding systems, designing architectures, or making judgment calls has become easier, only that once intent is fixed and constraints are explicit, execution against deterministic feedback loops is no longer the dominant cost.

In large systems, the difficulty compounds because no single context, human or machine, contains the whole system.

There is no prompt or document that captures the full dependency graph, operational constraints, production edge cases, and historical rationale behind existing behavior. Engineers carry this context implicitly across years of experience and incidents. AI systems do not unless that context is made explicit.

I’ve been leading multi-team initiatives for the better half of my career. The last one and half year have been interesting as AI coding tools became standard practice.

At the individual level, the acceleration is real. Developers finish implementation tasks faster. Code reviews move quicker when the first-pass quality is higher. Small bug fixes that used to drag out now close quickly.

But when you zoom out to the initiative levelβ€”the time from business saying “we need fraud detection” to actually having it running in production, I’m not seeing the same acceleration. The big projects still take roughly the same amount of time they always did.

The speedup at the leaf nodes isn’t translating to speedup in the overall timeline. That gap is what I want to examine.

Table of Contents

  1. What Changed and What Didn’t
  2. A Real Multi-Team Initiative
  3. Where the Time Actually Goes
  4. The Deterministic Gap
  5. What Machine-Checkable Contracts Look Like
  6. Why This Gets Worse at Scale
  7. The Missing Infrastructure Layer

What Changed and What Didn’t

Between 2022 and 2025:

Individual coding velocity: 5-10x improvement with AI assistance. GitHub’s 2024 research on Copilot showed developers completing tasks 55% faster. Cursor and similar tools pushed this further.

Code quality gates: Faster iteration against deterministic feedback. An AI can run tests, fix linting errors, adjust type mismatches, and iterate until CI passes in minutes rather than hours.

What didn’t change:

Time from business intent to production. Integration issues discovered late. Rework cycles.

AI reduced the cost of writing code, which means teams write more code faster, including code that faithfully implements the wrong assumptions. AI is an amplifier. It makes execution cheaper but doesn’t solve coordination.

The pattern: We optimized the leaf nodes (individual coding tasks) but left the tree structure (how work decomposes from business intent to executable units) completely manual.

This gap is easy to miss in small or greenfield systems.

When building new services, constraints are few, integration surfaces are shallow, and most relevant context fits in a single engineer’s head or a single prompt. In that environment, AI acceleration feels transformative.

As systems grow, the dominant work shifts from writing code to discovering constraints. AI compresses execution time, which means teams hit those constraints faster not fewer of them. The faster you can write code that implements wrong assumptions, the more expensive your rework becomes.


A Real Multi-Team Initiative

Let me walk through a scenario synthesized from patterns I’ve observed across multiple organizations.

The Initiative: Real-Time Fraud Detection at Checkout

Business requirement: “Implement real-time fraud scoring at checkout to block suspicious transactions before payment processing. We’re losing $2M annually to fraud.”

Teams involved:

  • Checkout (web/mobile UI and orchestration)
  • Payment (gateway integrations)
  • Risk (fraud detection models and scoring)
  • Data platform (event streaming, pipelines)
  • Customer support (dispute resolution)

Week 1-2: Decomposition and Planning

Architecture review happens. Everyone agrees on the approach:

Checkout Flow (After):
User β†’ Cart β†’ Checkout β†’ Fraud Check β†’ Payment β†’ Confirmation
                              ↓
                         Block if risky

The plan:

  1. Checkout calls Risk API before payment authorization
  2. Risk returns fraud score (0-100)
  3. High risk (>80): Block immediately
  4. Medium risk (50-80): Require verification
  5. Low risk (<50): Proceed to payment
  6. Log all decisions for audit

PRD gets written. Tickets created. Dependencies mapped. Clear ownership.

This seems fine. No obvious process failures. Good architecture session.

Week 3-6: The Coding Phase (Remarkably Fast)

AI tools accelerate execution:

  • Checkout: Integrates Risk API, implements routing (3 days vs 2 weeks pre-AI)
  • Risk: Builds ML inference service (5 days vs 3 weeks pre-AI)
  • Data: Sets up event streaming (2 days vs 1 week pre-AI)
  • Payment: Adds fraud metadata (2 days vs 1 week pre-AI)
  • Support: Builds dashboard (4 days vs 2 weeks pre-AI)

Everything passes unit tests. Code reviews are clean. CI/CD green. Each team ships to staging on schedule.

Week 7-8: Integration Testing (The Collapse)

Discovery 1: The Performance Cascade

Load testing reveals:

Checkout Request Latency:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Before (no fraud check): 450ms P95 β”‚
β”‚ After (with fraud check): 3.8s P95 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Risk service does its job correctly. To calculate fraud score, it queries user purchase history, fetches device fingerprint, analyzes cart composition, cross-references fraud patterns, runs ML inference. Each dependency adds latency. P95 hits 3.2 seconds.

Checkout’s assumption: Risk API returns in <200ms. They allocated 5 seconds total (UI, fraud, payment, confirmation). Needed headroom for payment gateway.

Risk’s assumption: Real-time fraud detection is expensive. 3 seconds for high-accuracy scoring seemed reasonable.

Both assumptions are defensible in isolation. Together they break the user experience.

The gap: “Call the Risk API” doesn’t encode “within 200ms.” The architecture diagram showed the integration. It didn’t show the performance contract.

Discovery 2: Data Consistency and Fraud Detection

Risk needs recent purchase history to detect velocity fraud (e.g., 10 purchases in 1 hour).

Implementation: Risk queries Orders database replica.

Problem: Standard read replica with 30-second replication lag.

Attack scenario:

Time 0:00 - User makes purchase #1 (fraud)
Time 0:10 - User makes purchase #2 (fraud)  
Time 0:20 - User makes purchase #3 (fraud)
Time 0:25 - User attempts purchase #4

Risk queries replica:
  - Sees only purchase #1 (2-3 not replicated)
  - Velocity: 1 purchase/hour (normal)
  - Returns low score
  - Purchase #4 proceeds β†’ fraud succeeds

Risk’s assumption: Database replicas are standard for read queries. They knew about eventual consistency.

Data’s assumption: 30-second replication lag is acceptable for most read patterns.

Both choices are reasonable. The combination breaks velocity detection.

The gap: “Check purchase velocity” doesn’t specify “requires data freshness <10s.” The implementation constraint was “30s replica lag.” These conflict, but only surface when testing actual fraud scenarios.

Discovery 3: Retry Semantics and Idempotency

Chaos testing (simulating service degradation):

Checkout calls Risk. Risk experiences P99 spike (8 seconds, database slow query). Checkout’s 5-second timeout fires. Following standard retry patterns, Checkout retries.

Checkout’s retry logic:

async function checkFraud(cart, userId) {
  const txId = generateId();
  try {
    return await riskApi.score(txId, cart, userId);
  } catch (timeout) {
    const newTxId = generateId(); // Fresh ID
    return await riskApi.score(newTxId, cart, userId);
  }
}

Risk’s idempotency:

def score_transaction(tx_id, cart, user_id):
    if cache.exists(tx_id):
        return cache.get(tx_id)
    
    score = calculate_fraud_score(cart, user_id)
    cache.set(tx_id, score)
    fraud_api.record_check(tx_id)  # Bill external API
    return score

What happens: First request (ID: abc123) takes 8s, times out client-side but completes server-side. Retry with new ID (xyz789) treated as new request. Same transaction scored twice, billed twice, duplicate audit events.

Checkout’s assumption: Fresh transaction ID on retry for clean logging/debugging.

Risk’s assumption: Clients retry with same ID for idempotency.

Both patterns are defensible. Together they create duplicate processing.

The gap: “The service should be idempotent” doesn’t specify whether idempotency keys are stable across retries.

Discovery 4: Observability vs Security

First customer dispute. User claims: “My purchase was blocked unfairly.”

Support pulls up dashboard:

Transaction: tx_891xj2
Status: BLOCKED
Fraud Score: 87

Support needs to explain why. Which signals contributed? Device fingerprint? Velocity? Cart composition? Data quality issues?

Risk’s implementation: Logs detailed breakdown internally. External API returns aggregate score only.

Security decision: Minimize PII in support tools. Detailed fraud signals contain device fingerprints, behavioral patterns. Risk team access only.

Support can’t resolve dispute without escalation.

The trade-off (discovered late):

  • Support workflow needs: Signal breakdown for disputes
  • Security requires: Minimize PII in support tier
  • Solution: New API with PII-sanitized summary, privacy review, new access controls

The gap: Both requirements are legitimate. The conflict between “support needs visibility” and “minimize PII” is a real trade-off with no obvious answer. Surfaces only when actual dispute workflow is tested.

Week 9-12: The Rework Phase

Not because code quality is poor. Code works. Tests pass.

Rework because implicit contracts were wrong:

Performance: Risk redesigns for <200ms P95. Requires caching layer, approximate algorithms, architecture changes. Coding with AI: 4 days. Coordination: design review, capacity planning, cache strategy, monitoring: 2 weeks.

Data consistency: Can’t use replica for velocity. Switch to event stream. Coding: 3 days. Coordination: schema design, capacity, backfill, monitoring: 2 weeks.

Idempotency: Checkout preserves transaction ID across retries. Risk handles duplicate in-flight requests. Coding: 2 days. Coordination: test scenarios, validate billing, update runbooks: 1 week.

Observability: Build PII-safe signal summary. Coding: 3 days. Coordination: privacy review, access control, training: 2 weeks.

This is not a process failure. This is not “we should have talked more.” This is structural. The space of possible conflicts is too large to enumerate in planning.


Where the Time Actually Goes

Tracking multiple initiatives like this:

Time Distribution (Multi-Team Initiative):

Coding & Implementation        β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 20%
Integration Testing            β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 15%
Rework (Coding)               β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 15%
────────────────────────────────────────────────────
Discovery & Decomposition     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 45%
Rework (Coordination)         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 25%
────────────────────────────────────────────────────
Planning & Operational        β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 20%

Coding parts (35% total): AI helps tremendously.

Coordination parts (70% total): AI barely helps.

This is not about “better communication.” Humans are bad at exhaustively enumerating cross-service constraints. The problem isn’t missing information, it’s missing machine-checkable representations.

The Pattern Across Team Sizes

Single team, single service (5 people): Coding dominates. AI provides significant acceleration.

Multi-team, related services (15-30 people, 3-5 teams): Coordination starts dominating. AI accelerates you into integration problems faster.

Enterprise, many teams (100+ people, 10+ teams): Coordination overwhelms everything. Each team interface is a potential contract mismatch.


The Deterministic Gap

What Works: Single-Service Verification

Within a service boundary, we have excellent deterministic tools:

Code β†’ Deterministic Gates β†’ Verified Artifact
  ↓
Compiler      (type safety)
Linter        (code patterns)
Unit Tests    (behavior contracts)
Integration   (API contracts)
Performance   (latency/throughput)
Security      (vulnerability scan)
CI/CD         (orchestrates all gates)

An AI coding agent iterates against these gates efficiently:

  1. Write code
  2. Run against gates
  3. Get deterministic feedback
  4. Fix and retry
  5. Repeat until green

This loop is fast, well-defined, automatable.

What’s Missing: Cross-Service Verification

Between services, we have almost no deterministic gates:

Business Intent β†’ ??? β†’ Coordinated Tasks β†’ ??? β†’ Integration

The gaps:

No machine-checkable latency contracts: Checkout assumes <200ms, Risk implements 3s. Discovery: Integration testing.

No machine-checkable data freshness contracts: Risk needs <10s, gets 30s replica lag. Discovery: Fraud scenario testing.

No machine-checkable retry semantics: Checkout generates new ID, Risk expects stable ID. Discovery: Chaos testing.

No machine-checkable observability contracts: Support needs breakdown, Security limits PII. Discovery: First customer dispute.

These aren’t edge cases. These are core contracts. But they’re implicit, discovered through failure, not specified upfront in machine-verifiable ways.


What Machine-Checkable Contracts Look Like

Not prescribing a specific implementation. Exploring what characteristics would help.

1. Performance Contracts

Current (implicit):

“Checkout will call Risk API for fraud scoring”

What we need (explicit and verifiable):

# checkout-service/contracts/dependencies.yaml
service: checkout
depends_on:
  - service: risk
    operation: scoreTransaction
    latency_budget:
      target: p50 < 100ms
      acceptable: p95 < 200ms
      maximum: p99 < 500ms
    timeout: 500ms
    fallback: allow_with_monitoring
    retry: none
# risk-service/contracts/sla.yaml
service: risk
provides:
  - operation: scoreTransaction
    latency:
      current: p50=2800ms, p95=3200ms
      target: p50=80ms, p95=150ms (requires caching)

Machine-checkable:

ERROR: Latency contract violation
  checkout expects: p95 < 200ms
  risk provides: p95 = 3200ms (current)
  
Status: BLOCKS_INTEGRATION
Action: Risk must implement caching before integration

Surfaces during decomposition, not integration testing.

2. Data Freshness Contracts

Current (implicit):

“Risk will check purchase history for velocity”

What we need:

# risk-service/contracts/data.yaml
service: risk
data_dependencies:
  - source: orders-db
    dataset: user_purchases
    freshness_required: <10s
    reason: velocity_fraud_detection
    impact_if_stale: false_negatives
# orders-service/contracts/data.yaml
service: orders
provides:
  - dataset: user_purchases
    via: database_replica
    freshness: 15-30s typical, 60s max

Machine-checkable:

ERROR: Data freshness violation
  risk requires: <10s for velocity detection
  orders provides: 15-30s (replica lag)
  
Impact: Velocity fraud detection unreliable
Resolutions:
  1. orders: Provide event stream (infrastructure)
  2. risk: Relax requirement (accuracy loss)
  3. risk: Alternative algorithm (redesign)

3. Retry/Idempotency Contracts

Current (implicit):

“Services should handle retries gracefully”

What we need:

# checkout-service/contracts/retry.yaml
service: checkout
retry_policies:
  - target: risk.scoreTransaction
    on_timeout:
      generates_new_request_id: true
      reason: separate_attempts_in_logs
# risk-service/contracts/idempotency.yaml
service: risk
operations:
  - name: scoreTransaction
    idempotency_key: transaction_id
    expects: stable_across_retries
    window: 5m

Machine-checkable:

ERROR: Idempotency contract violation
  checkout: generates new transaction_id on retry
  risk: expects stable transaction_id
  
Impact: Duplicate processing, billing, audit logs
Resolution: Preserve transaction_id OR use separate key

4. Observability/Access Contracts

Current (implicit):

“Support needs fraud decision visibility”

What we need:

# support-service/contracts/data-needs.yaml
service: support
requires:
  - from: risk
    data: fraud_decision_breakdown
    access_level: support_tier_2
    use_case: dispute_resolution
# risk-service/contracts/data-exposure.yaml
service: risk
provides:
  - endpoint: /fraud-score
    returns: aggregate_only
    pii: minimal
  - endpoint: /fraud-signals-detailed
    returns: full_breakdown
    access: risk_team_only
    pii: high

Machine-checkable:

ERROR: Data access policy conflict
  support needs: breakdown at support_tier_2
  risk provides: aggregate (insufficient) OR
                detailed (requires risk_team_only)

Trade-off: Support workflow vs PII minimization
Decision required: Product + Security + Support

Surfaces trade-off during planning, not after first dispute.


Why This Gets Worse at Scale

The Combinatorial Problem

Fraud detection: 5 teams, reasonable choices each.

Checkout: Timeout (500ms vs 1s vs 5s) Γ— Retry (new ID vs stable vs none) Γ— Fallback (allow vs block vs manual) = 27 combinations

Risk: Algorithm (accurate+slow vs fast+approximate) Γ— Data (replica vs stream vs cache) Γ— Idempotency (tx_id vs separate vs stateless) = 27 combinations

Data: Replication (async vs sync vs stream) Γ— Caching (none vs short vs long) = 9 combinations

Each decision is reasonable in isolation. Most combinations work. Finding the ones that don’t: manual discovery through testing.

This is why architecture sessions can’t prevent all conflicts. The space is too large.

Research on Coordination Costs

Brooks’ Law (1975): Adding people to late projects makes them later. Communication paths grow O(nΒ²).

Herbsleb & Grinter (1999) studying distributed teams: 2.5x more coordination time than co-located, most overhead in “architectural mismatches discovered during integration.”

Cataldo et al. (2008) on large projects: “Mismatch between required and actual coordination was the strongest predictor of integration failures.”

The pattern:

  • 10 people: Coordination informal
  • 50 people: Coordination structured, starts slowing
  • 200 people: Coordination overhead dominates
  • 1000+ people: Coordination costs explode without systematic solutions

The AI Acceleration Paradox

Before AI:

  • Week 1-3: Planning
  • Week 4-9: Coding (slow, natural coordination time)
  • Week 10: Integration
  • Week 11-12: Fix integration issues

Questions arise organically during coding. “What latency should I target?”

With AI:

  • Week 1-3: Planning
  • Week 4-5: Coding (fast, less coordination)
  • Week 6: Integration (earlier)
  • Week 7-12: Fix integration (code is “done,” changes feel expensive)

AI accelerates you into the wall when everyone has shipped. The fast coding phase provides less natural coordination time. Issues surface when code feels finished.


The Missing Infrastructure Layer

We have deterministic tools at the code level. Type systems prevent type errors. Linters enforce patterns. Tests verify behavior. CI orchestrates gates.

These work because they operate on explicit, machine-readable contracts.

We don’t have deterministic tools at the system level. Performance assumptions are implicit. Data freshness requirements are implicit. Retry semantics are implicit. Observability needs are implicit.

These remain implicit because we lack machine-readable representations.

What’s Needed (Pieces of the Puzzle)

1. Explicit contract specifications

Write:

  • “I need <200ms P95”
  • “I need <10s stale data”
  • “I retry with new ID”
  • “I need these fields for debugging”

Make these verifiable, not just documentation.

2. Cross-service contract validation

Check:

  • Does A’s latency budget match B’s SLA?
  • Does A’s freshness requirement match B’s guarantee?
  • Are A’s retry semantics compatible with B’s idempotency?
  • Does A’s observability need conflict with B’s security policy?

At decomposition time, not integration time.

3. Contract evolution and versioning

When contracts change:

  • Which services are affected?
  • Is this backward compatible?
  • What’s the migration path?

Partially solved for API schemas (OpenAPI, protobuf). Not solved for performance, freshness, retry semantics, observability.

4. Trade-off surfacing

Some conflicts have no clear answer (security vs debuggability, accuracy vs latency). The system should:

  • Detect the trade-off exists
  • Surface to decision-makers
  • Document the decision
  • Make trade-off explicit in code

Why Existing Tools Don’t Solve This

OpenAPI/Protobuf: Data schemas, not performance/freshness/retry contracts.

Service mesh: Runtime traffic management, not design-time validation.

Distributed tracing: Debug what happened, not prevent incompatible assumptions.

SLO monitoring: Detect violations in production, not during planning.

All valuable. Wrong layer. They catch issues in production or testing. We need to catch them during decomposition.

A Feasibility Note: Service Contract Discovery

What if you could extract implicit contracts from code automatically?

Pattern detection that’s feasible today:

# Detectable:
requests.get(url, timeout=0.2)  β†’ expects <200ms
db.replica.query(...)  β†’ uses replica (potential staleness)
@retry(max=3, backoff=exp)  β†’ retry policy
transaction_id = uuid4()  β†’ generates new ID (in retry loop?)

Scan code for timeout configs, database connections, retry decorators, ID generation patterns. Extract to machine-readable spec. Validate cross-service compatibility.

Hard parts:

  • Inferring intent (why 200ms? hard requirement or guess?)
  • Implicit assumptions not in code (velocity detection needs fresh data, but code doesn’t say “<10s or breaks”)
  • Dynamic behavior (timeout from config file)

But static analysis could extract explicit configurations. Runtime analysis could validate actual behavior. Cross-service checking could detect conflicts.

This is not science fiction. It’s feasible with current static analysis techniques, pattern matching, and cross-repository coordination.

The infrastructure gap is real. But it’s solvable.


Closing: The Next Bottleneck

We’ve 10x’d coding speed. The next bottleneck isn’t writing code. It’s defining what to write in a way that coordinates across teams.

The individual pieces exist. Type systems catch errors before runtime. Contract testing verifies API compatibility. Performance testing measures latency. Security scanning enforces policies.

We’re missing the layer that connects these pieces across service boundaries, at decomposition time.

Until we build machine-checkable representations of cross-service contracts (performance, data freshness, retry semantics, observability needs), we’re stuck with human discovery and integration-time failures.

That’s not a process problem you solve with better meetings. That’s an infrastructure problem.

The gap is specific:

  • Business intent β†’ coordinated technical tasks (currently manual and lossy)
  • Implicit assumptions β†’ explicit, verifiable contracts (currently discovered through failure)
  • Integration-time discovery β†’ decomposition-time validation (currently backwards)

Until coordination is machine-verifiable, AI will continue to optimize the cheapest part of the system and ignore the most expensive one.

Execution under deterministic constraints is largely solved. Coordination and the work required to make intent executable is not. Until coordination becomes systematic rather than tribal, AI will accelerate us into walls faster, not help us avoid them.

Share this post: