Small enough that 'done' fits in one sentence with no 'and.' Allen Holub's guidance — stories sized to roughly a couple of days — is a useful upper bound; three days is the outer limit before feedback loops degrade.

When shouldn't I use task-gated development?

Pure research spikes and large architectural decisions. The workaround is to make the spike itself a task with an Architecture Decision Record as the deliverable, not to abandon the gate.

Predictable Software Delivery: The Batch-Size Fix

Q: Doesn't this micromanage the engineer?

No, if acceptance criteria describe observable behavior rather than implementation. The engineer owns the how; the client owns the what. Misuse looks like AC dictating function names; correct use looks like AC dictating what a user can do.

Half of outsourcing relationships fail within five years, and the most common reason clients cite is a communication breakdown — they didn't see the wrong direction until it was finished [R15]. The fix isn't a better vendor. It's smaller batches and a gate the client controls — the same lever DORA, Kent Beck, and Allen Holub have been pointing at for a decade [R6, R10, R11].

Key takeaways

Predictable software delivery is a batch-size problem, not a vendor-quality problem. 20–25% of outsourcing relationships fail within two years and 50% fail within five [R15].
Task-gated development breaks work into 1–3 day micro-deliverables, each with written acceptance criteria the client approves before the next task starts [R10, R14].
PR size is the single largest driver of velocity across the PR lifecycle, and code review consumes the majority of cycle time [R8].
AI raises throughput but reduces delivery stability unless feedback loops shrink to match [R4]. Smaller batches close the gap.
The on-demand development process pairs naturally with task gating because the cadence is continuous; there is no two-week project-rebuild tax.

The black-box failure mode

The pattern repeats across every engineering org that has outsourced work, whether to a freelancer, an agency, or an internal team behaving like one. A sprint kicks off. Two to six weeks later, something gets demoed. It's the wrong thing, or it's the right thing built on the wrong assumption, or it's 60% of the thing with a tail of "minor cleanup" that turns into another sprint.

This is not a vendor problem. Per Dun & Bradstreet's Barometer of Global Outsourcing, 20–25% of outsourcing relationships fail within two years and 50% fail within five [R15]. Per PMI Pulse of the Profession reporting, 73% of projects exceed their original budget, with scope creep the primary culprit in 52% of cases [R2]. A 10% scope increase typically drives a 15–25% budget overrun [R3]. The numbers describe a structural failure mode, not a few bad shops.

The mechanism is batch size. When the unit of work between client decisions is two weeks of effort, the client is reviewing a finished artifact built on a hypothesis that's two weeks stale. Corrections compound. By the time the wrong assumption surfaces, the team has built three more things on top of it.

Why batch size is the lever for predictable software delivery in 2026

DORA's 2025 report, published September 23, 2025, made a finding that should have changed more conversations than it did: AI adoption has a positive relationship with throughput but a negative relationship with delivery stability [R4]. The report's framing is direct: "AI doesn't fix a team; it amplifies what's already there. Strong teams use AI to become even better and more efficient. Struggling teams will find that AI only highlights and intensifies their existing problems" [R4].

If your engineers are shipping more code per day because of AI assistance, and the feedback loop between "ship" and "client approves direction" is still two weeks long, you are now wrong faster. The 2025 report also moved away from the old Elite/High/Medium/Low buckets toward seven team archetypes including "Foundational Challenges" and "Harmonious High Achievers," reflecting that team behaviors — not tooling — explain outcomes [R19].

DORA's own prescription is unchanged: "A common approach to improving the five key metrics discussed in this guide is reducing the batch size of changes for an application. Smaller changes are easier to rationalize and to move through the delivery process" [R6]. LinearB's benchmark data, spanning 6.1 million PRs and 3,000-plus organizations, points to where smaller batches pay off [R7]. The dominant cost in a typical cycle is code review, not coding. PR size is "the most significant driver of velocity across the PR lifecycle" [R8].

The bottleneck is review. The lever is batch size. The buyer controls both.

What is task-gated development?

Task-gated development is a delivery process in which work is broken into micro-deliverables of one to three days each, every task has explicit written acceptance criteria, and the engineer cannot start the next task until the client approves the previous one. The gate is the client's. The cadence is the engineer's.

It is not sprint-based. There is no two-week container. It is not fixed-bid. Scope is per-task, not per-project. It is not pure Kanban. Kanban controls WIP but does not require client approval at each handoff. The defining feature is the per-task approval gate held by the buyer.

How to break a roadmap into 1–3 day tasks

A task is a vertical slice: one user-visible behavior change, end to end, small enough to describe what "done" looks like in a single sentence. "Add password reset" is not a task. "User receives password reset email when they submit the form on /forgot" is.

Allen Holub argues stories should fit into roughly a couple of days, and recommends dropping point-based estimation entirely [R10]. Kent Beck's framing is gentler but points the same direction: "You don't always have to take tiny steps, but they are always an option" [R11]. The limit case is single-piece flow — WIP capped at one — which "has the clear advantage of reducing lead time, depreciation of stock-on-hand, and the cost of delay on each item to the absolute minimum" [R13]. Few teams operate at WIP=1, but the principle holds. The smaller the in-flight unit, the cheaper a wrong turn.

The break-down rule is mechanical. Take the roadmap item. Ask what one sentence describes done. If the sentence needs an "and," split it. Repeat until each sentence is true on its own.

The per-task approval gate — what acceptance criteria look like

Definition of Done is a team-wide quality gate for the product increment. Tests pass, code reviewed, deployed. Acceptance Criteria are per-story behavior gates. Both are gates; they operate at different granularities [R14]. Task-gated development leans hard on AC because that is what the client signs off on.

A good AC block is three to five bullets, each describing observable behavior in language a non-engineer can verify. No jargon. No implementation detail. For the password reset task above: user sees a confirmation message on submit; an email arrives within sixty seconds containing a reset link; the link expires after one hour; clicking an expired link shows a clear error. That is a gate.

When a task fails the gate, the rule is rollback, not patch-on-top. The engineer reverts, the AC gets rewritten to capture what was missed, and the task is re-queued. This sounds wasteful at the task level and is the opposite at the project level. It prevents the compounding-error pattern that turns sprints into surprise sprints.

The review cadence — what this actually costs the client

The objection arrives immediately: "I don't have time to review work every two days." The LinearB data is the response. Review consumes the bulk of cycle time on most engineering teams [R7, R8]. The question is whether that review time is concentrated in one painful end-of-sprint session — where corrections are expensive because work has compounded — or spread across small, cheap, mid-flight gates.

The empirical cost of a per-task gate runs to roughly fifteen minutes per task. Read the AC. Click through the change. Approve or reject with one specific reason. Total weekly client commitment for a single engineer at one-to-three-day cadence: under two hours.

The hidden cost on the engineer side cuts the same way. Dr. Gloria Mark's research at UC Irvine puts the time to fully return to a task after interruption at 23 minutes 15 seconds [R9]. Separate cognitive research finds multitasking can reduce efficiency by up to 40% [R9]. Smaller batches reduce in-flight WIP, which reduces context switches, which preserves the engineer's deep-work time. The 15-minute client gate buys back hours of engineer focus.

When task-gated development falls down

The honest counter-argument is Basecamp's Shape Up. Shape Up runs six-week cycles with a "circuit breaker" — unfinished work at cycle end does not roll over, the team re-pitches [R17]. The case for the long cycle is that some problems require sustained exploration, and a per-task gate turns engineers into ticket-takers. The risk Shape Up accepts is throwing away five weeks of work if the initial scoping was wrong.

Both critiques have weight. Task-gated development can drift into micromanagement if the client treats AC as design-by-committee rather than behavior verification. The fix is discipline about what AC describes: outcomes, not implementation.

Large architectural changes and research spikes are the other genuine failure mode for fine-grained tasks. The pattern that works is to treat the spike itself as a task with its own AC — typically a written Architecture Decision Record as the deliverable. The engineer is gated on producing a defensible recommendation, not on shipping code. Once the ADR is approved, implementation tasks flow from it at the normal one-to-three-day cadence.

How to manage on-demand developers without losing predictability

Task-gated development is a workflow, not a vendor type. It works with employees, contractors, and agencies. It happens to align particularly well with the on-demand development process because the cadence is continuous. There is no two-week project-rebuild cost. The client owns the priority queue. The engineer stays loaded with context across tasks rather than rebuilding it every sprint.

YS Dev On Demand operates on this model: $3,495 per month flat, a dedicated engineer matched same-day, first ship within five days, cancel anytime, no lock-in. The fit is structural rather than promotional. The subscription removes the procurement overhead that makes per-task gating impractical with traditional vendors. The broader context is a fractional executive market that has expanded sharply since 2022, with fractional CTO services increasingly listed alongside traditional vendor categories [R16]. Buyers are already moving toward smaller, continuous engagements. The workflow needs to match.

FAQ

What is task-gated development?

A delivery process where work is broken into one-to-three-day micro-deliverables, each with written acceptance criteria, and the client approves each task before the next begins.

How big is a "task?"

Small enough that "done" fits in one sentence with no "and." Allen Holub's guidance — stories sized to roughly a couple of days [R10] — is a useful upper bound. Three days is the outer limit before feedback loops degrade.

Doesn't this micromanage the engineer?

No, if AC describes observable behavior rather than implementation. The engineer owns the how; the client owns the what. Misuse looks like AC dictating function names; correct use looks like AC dictating what a user can do.

When shouldn't I use it?

Pure research spikes and large architectural decisions. The workaround is to make the spike itself a task with an ADR as the deliverable, not to abandon the gate.

Sources

[R2] PMI scope creep and budget overrun data — https://www.firebreak.ai/blog/the-hidden-cost-of-project-scope-creep
[R3] Scope increase to budget overrun ratio — https://www.firebreak.ai/blog/the-hidden-cost-of-project-scope-creep
[R4] DORA 2025 report on AI throughput vs. stability — https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report
[R6] DORA metrics guide on reducing batch size — https://dora.dev/guides/dora-metrics/
[R7] LinearB benchmarks: sample size across 6.1M PRs and 3,000+ orgs — https://linearb.io/blog/engineering-metrics-benchmarks-what-makes-elite-teams
[R8] LinearB: PR size as primary velocity driver — https://linearb.io/blog/engineering-metrics-benchmarks-what-makes-elite-teams
[R9] Gloria Mark research on context switching cost — https://www.crownest.dev/blog/hidden-cost-context-switching-developers
[R10] Allen Holub on story sizing and #NoEstimates — https://holub.com/noestimates-an-introduction/
[R11] Kent Beck on tiny steps — https://tidyfirst.substack.com/p/first-one-then-many
[R13] Single-piece flow pattern — https://dzone.com/articles/pattern-of-the-month-single-piece-flow
[R14] Acceptance Criteria vs. Definition of Done — https://www.altexsoft.com/blog/acceptance-criteria-definition-of-done/
[R15] Outsourcing failure rates — https://winatalent.com/blog/why-software-development-outsourcing-fails/
[R16] Aiken House — fractional CTO services landscape — https://www.aikenhouse.com/post/2025s-best-fractional-cto-services-companies-the-tech-leaders-powering-innovation
[R17] Basecamp Shape Up six-week cycles — https://productmanagementresources.com/shape-up-method/
[R19] DORA 2025 team archetypes — https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report

Dev On Demand

Predictable Software Delivery: The Batch-Size Fix

Key takeaways

The black-box failure mode

Why batch size is the lever for predictable software delivery in 2026

What is task-gated development?

How to break a roadmap into 1–3 day tasks

The per-task approval gate — what acceptance criteria look like

The review cadence — what this actually costs the client

When task-gated development falls down

How to manage on-demand developers without losing predictability

FAQ

What is task-gated development?

How big is a "task?"

Doesn't this micromanage the engineer?

When shouldn't I use it?

Sources

Related Articles

Predictable Software Delivery: The Batch-Size Fix

Subscription engineering vs hiring vs marketplace: the 2026 CTO build-team math

Agentic AI for Service Agencies: The Capacity Playbook