arefiva

Acceptance Criteria That Actually Work

2026-04-18T08:00:00+00:00

Every post in this series so far has circled back to the same idea: the quality of specification determines the quality of execution. The Red, Green, Refactor post framed this as the red phase of TDD applied to agentic development. The Specification-First Development post explored why that shift is harder than it sounds and what it takes to sustain it. And ARC showed what happens when a well-formed specification reaches an execution loop that can verify each story against its criteria before moving to the next.

What none of those posts did is examine the criteria themselves. What makes an acceptance criterion actually verifiable? Where is the line between a criterion that drives correct implementation and one that looks specific enough but still leaves the hard questions unanswered? That is what this post is about.

Assertions, Not Descriptions

The most common failure in acceptance criteria is that they describe behavior rather than assert it. A description tells you what the feature should do in general terms. An assertion tells you what a specific input produces as a specific output, under specific conditions, with a specific observable result.

Consider the difference:

Description: “The endpoint should return paginated results efficiently.”

Assertion: “GET /api/trades?page_size=20&cursor=abc123 returns at most 20 results, a next_cursor value when more results exist, and no next_cursor when the final page is reached.”

The description appears complete at first glance. Efficiently names a broad concept without resolving the decisions underneath it. But it defers every decision that matters: what pagination scheme, what happens at the boundary, what the response shape looks like. Those decisions will still need to be made. The question is whether they get made during specification, when the cost of a wrong answer is a revised document, or during implementation, when the cost is rework.

The assertion leaves less room for interpretation. It specifies cursor-based pagination, a concrete page size, the presence and absence of the next_cursor field, and the boundary behavior at the last page. An agent implementing against this criterion has enough information to produce the right implementation on the first pass, not because the agent is more capable, but because the criterion removed the ambiguity the agent would otherwise have to guess about.

The Specificity Spectrum

Not all criteria need the same level of detail. The appropriate specificity depends on how much interpretation the criterion leaves open and how costly a wrong interpretation would be.

Too vague to verify

“The system should handle errors gracefully.”

This is not a criterion so much as an aspiration. Gracefully means different things to different people, and there is no way to verify whether the implementation satisfies it without first defining what gracefully means in this specific context. A criterion at this level does not constrain the implementation at all.

Directionally useful but still ambiguous

“Invalid requests should return a 400 status code with an error message.”

This is closer. The status code is specific and verifiable. But what constitutes an invalid request? What structure does the error message take? Is it a plain string, a structured JSON body, or an error code from a predefined set? Each of those choices produces a different implementation, and the criterion does not specify which one.

Verifiable and complete

“A POST to /api/trades with a missing required field (instrument_id, quantity, or side) returns HTTP 400 with a JSON body containing an errors array, where each entry has a field name and a message string. The response includes one entry per missing field.”

This criterion can be implemented and verified without asking a follow-up question. It specifies the trigger condition (missing required fields), the response shape (JSON with errors array), the structure of each error entry (field and message), and the multiplicity behavior (one entry per missing field). An implementation that satisfies this criterion either matches or it does not.

The practical question

In practice, the useful level of specificity is usually the point where the person or agent implementing the criterion does not need to make a design decision that was not explicitly made in the specification. If the criterion requires the implementer to choose between two reasonable approaches, it is underspecified. This does not mean every criterion needs to be a paragraph. It means every criterion needs to close the decisions that matter for that particular behavior.

Edge Cases Surface at Specification Time

One of the less obvious benefits of writing specific acceptance criteria is that they force edge cases into view before any code exists. A vague criterion hides its edge cases because the language is broad enough to encompass them without addressing them. A specific criterion either covers the edge case explicitly or makes the gap visible by its absence.

Consider a date range filter. A vague criterion might say “supports filtering by date range.” Writing specific criteria for this feature surfaces questions that would otherwise appear during implementation or, worse, in production:

What happens when the start date is after the end date? Is that a 400, or does the system swap them silently?
Are the boundaries inclusive or exclusive? Does a trade executed at exactly the start timestamp appear in the results?
What date format is accepted? What happens when the format is wrong?
What happens when only one boundary is provided? Is that a valid partial range or a missing parameter error?

Each of these questions represents a design decision. Writing the criterion forces the decision to be made explicitly, by whoever understands the domain well enough to make it. When that decision is deferred to implementation, whoever is writing the code, human or agent, makes the choice based on whatever seems reasonable, and “reasonable” may not match what the rest of the system does.

Writing Criteria for Agents

Human developers reading acceptance criteria can interpolate. They recognize patterns from the codebase, infer unstated constraints from convention, and ask a colleague when something is ambiguous. An agent, at least in its current form, does not do those things reliably. It reads the criterion literally and implements what it says, which means the criterion needs to say what it means with less margin for interpretation.

This does not require a fundamentally different style of writing. It requires closing gaps that a human reader would fill automatically.

What agents need explicitly stated

Existing patterns to follow. A human developer working in a codebase notices that all endpoints use a specific error response format and follows the same pattern. An agent might or might not discover that pattern depending on which files it reads. Stating “follow the error response pattern established in src/api/errors.py” removes the guesswork.

Boundary behavior. Humans often have reasonable intuitions about what should happen at boundaries. Agents do not have intuitions. If the criterion says “returns paginated results,” the agent needs to know what happens when there are zero results, what happens at the last page, and whether an empty page is a valid response.

Negative cases. What the system should not do is as important as what it should do. “The endpoint must not return soft-deleted records in the result set” is the kind of constraint a human developer might infer from domain context but an agent will not apply unless told.

What stays the same

The fundamental skill is the same regardless of who reads the criteria. A well-written criterion for an agent is also a well-written criterion for a human. The difference is not in kind but in tolerance: an agent has zero tolerance for ambiguity where a human has some, so criteria written for agents tend to be more precise, and that precision benefits everyone.

From Criteria to Tests

Acceptance criteria that are specific enough to verify are, in a meaningful sense, already tests. The translation from a criterion to a test case tends to be mechanical rather than creative, which is part of what makes the approach work.

Take the error response criterion from earlier:

A POST to /api/trades with a missing required field (instrument_id, quantity, or side) returns HTTP 400 with a JSON body containing an errors array, where each entry has a field name and a message string. The response includes one entry per missing field.

The test cases emerge directly:

POST with all required fields missing: expect 400, errors array with three entries
POST with one field missing: expect 400, errors array with one entry matching the missing field
POST with all fields present: expect success, no errors array
POST with an extra unknown field: expect success (the criterion says nothing about rejecting unknown fields, so the test verifies the system tolerates them)

The fourth test reveals something interesting: a gap in the criterion. Should unknown fields be silently ignored or rejected? If the answer matters, the criterion should be updated. If it does not matter, the test documents the behavior that was chosen.

This is the feedback loop that makes acceptance criteria powerful. Writing criteria surfaces design decisions. Translating criteria to tests surfaces gaps in those decisions. Both happen before implementation, when the cost of a revision is a line in a document rather than a refactored function.

Anti-Patterns

There are a few patterns that consistently produce criteria which look specific but fail during implementation.

The implementation prescription

“Use a recursive CTE to aggregate positions by book hierarchy.”

This is not an acceptance criterion. It is an implementation instruction. It tells the implementer how to build something without specifying what the correct behavior is. If the recursive CTE produces the wrong result, this criterion still passes. Acceptance criteria describe observable outcomes, not internal mechanisms.

Implementation guidance belongs in the implementation notes field of the specification, not in the acceptance criteria. The distinction matters because acceptance criteria are what get verified. If the criterion is “use a CTE,” the verification checks whether a CTE was used, not whether the aggregation is correct.

The untestable quality

“The UI should feel responsive.”

There is no test for feel, which makes this not a criterion but a goal that belongs in the feature description rather than the acceptance criteria. A testable version might be “page load completes within 200ms at the 95th percentile” or “the search input debounces at 300ms and shows a loading indicator within 100ms of the query firing.” Both can be measured and verified.

The hidden dependency

“The report generates correctly.”

Correctly according to what? This criterion assumes the reviewer already knows what correct means, which may be true today but tends not to hold when someone new joins the team or when the feature changes six months from now. Criteria that work well over time are self-contained: they carry enough context that someone unfamiliar with the feature can determine whether the implementation satisfies them.

The scope creep criterion

“All edge cases are handled.”

This is unfalsifiable. You cannot verify that all edge cases are handled, because you cannot enumerate all edge cases. Good criteria enumerate the specific edge cases that are in scope, which makes the boundary of the feature explicit and reviewable.

Where This Lands

The practice of writing acceptance criteria well is not separate from the broader discipline of specification-first development. It is the most concrete expression of that discipline, the place where abstract intent becomes verifiable behavior. When the criteria are specific enough to implement without interpretation and complete enough to test without guessing, the execution phase, whether performed by a human or an agent, tends to become a matter of translation rather than invention. The thinking that mattered most already happened, and it was captured in a form that persists after the implementation is done and remains useful when the feature eventually needs to change.

ARC: The Agentic Release Cycle

2026-04-17T07:00:00+00:00

The Red, Green, Refactor With an AI in the Loop post described a development cycle where a human writes an AES (Autonomous Execution Specification), an agent implements against it, and a human reviews the result. What that post did not describe in detail is what “autonomous execution” actually looks like in practice, and specifically how you get an AI agent to work through a structured backlog of stories without losing track of intent, accumulating context noise, or silently failing on a story and then building the next one on top of a broken foundation.

That is what ARC is for.

The Problems With Unstructured Agent Usage

Using an AI coding agent without structure introduces a set of failure modes that compound quietly rather than loudly. Context rot is one of the more insidious ones: as a conversation with an agent grows longer, earlier parts of the context get compressed or pushed beyond the effective attention window, and the agent begins to forget decisions it made earlier in the session, repeat work, or contradict architectural choices it already committed to. This is not a flaw in the agent so much as a fundamental characteristic of how large language models process long contexts, and it is worth designing around deliberately rather than hoping it will not matter.

Beyond context rot

An agent working through a multi-story feature without orchestration has no persistent memory of intent across sessions. It knows what files it can see and what you told it in the current prompt, but not what the overall feature was supposed to achieve, which stories were already done, or what architectural constraints were established in earlier sessions. The result is an agent that is technically capable of implementing each story in isolation but has no way to ensure the stories cohere.

Quality enforcement

Quality enforcement is a third gap. Without something checking the result of each story before the next begins, a failed or partial implementation becomes the foundation for subsequent work, and the errors compound until the feature fails at integration time rather than at the story level where they are cheap to fix.

ARC addresses all of these systematically.

The Loop

The core of ARC is a structured execution loop that runs from an AES. Each iteration of the loop follows the same sequence: read the AES, identify the highest-priority story whose dependencies have all passed, select the appropriate AI model for that story’s complexity and risk level, construct a prompt with full project context injected directly, invoke the Copilot CLI agent, check the result against the story’s acceptance criteria, and either mark the story as passed or retry. Once all stories pass, ARC runs an automated code review against the feature branch, then plans and executes a second-pass fix loop to address whatever the review surfaced.

The key word in that sequence is “inject.” Rather than letting the agent read files from disk to orient itself, ARC injects the full project operating manual, a COPILOT.md file containing the stack, conventions, build instructions, and constraints, and the entire AES directly into every session prompt. The agent always has the complete picture of what the feature is supposed to do, which stories are already done, and what the current story requires, without needing to infer any of it from file reads that might be stale or incomplete.

Context as a First-Class Problem

The context injection is only part of the solution. Before the first story runs, ARC auto-generates two additional files that persist across sessions. The first is FEATURE_CONTEXT.md, an at-a-glance map of the entire feature that clusters related stories by keyword, identifies which data model fields are introduced, flags which modules are touched by multiple stories, and lists any explicit constraints from the AES. This gives the agent a structural overview of the feature without reading through every story description in sequence.

The second is DECISIONS.md, a decision log with one row per story. As each story is implemented, the agent fills in the architectural decision and its rationale. ARC merges new rows with existing ones on each run, so the decision log accumulates across sessions. An agent in session eight has access to the decisions made in sessions one through seven, not because the full conversation history is still in context, but because the decisions were maintained as a persistent structured record.

Context Window Management

Context window utilization is also tracked explicitly. When cumulative token usage across a run exceeds eighty percent of the model’s context limit, ARC starts a fresh session. The fresh session receives the same injected context as every prior session, so no intent is lost, but the conversation history that has accumulated up to that point is dropped, preventing the gradual degradation that happens when an agent’s context fills with its own prior work. The boundary event is logged to a structured progress file, so it is visible in the metrics rather than something that happens invisibly.

Model Selection as a Quality Signal

One detail I find practically significant is how ARC routes stories to models. The AES carries two fields per story: complexity (low, medium, high) and riskLevel (low, medium, high, critical). ARC uses these to automatically select a model, routing high-risk or high-complexity stories to a more capable model and trivial stories to a faster and cheaper one. This is not only a cost optimization, though the cost difference is real and accumulates over a large feature. It is also a signal embedded in the specification: when you mark a story as high complexity or critical risk during the AES authoring phase, you are making a deliberate judgment that this story requires more careful reasoning, and ARC enforces that judgment at execution time without any additional configuration.

Completing the Cycle

The connection back to the TDD cycle is explicit in how ARC frames its own operation. The Red phase is writing the AES: defining stories, acceptance criteria, implementation notes, complexity, and risk before a line of code exists. The Green phase is ARC running the loop, implementing each story against the specification. The Refactor phase is the automated code review followed by the fix loop, and then human review of the final diff. All three phases run in sequence without manual handoff once the AES is written.

What this means in practice is that the discipline invested in the Red phase, the same discipline that the Specification-First Development post explores in detail and that the Red, Green, Refactor With an AI in the Loop post argues is the central intellectual work of agentic development, is what determines whether ARC can execute well. A vague AES produces a vague feature because ARC has nothing precise to enforce against. A precise AES with specific acceptance criteria, clear implementation guidance, and thoughtful risk and complexity annotations gives ARC the structure it needs to route correctly, inject meaningfully, and verify cleanly.

The automated Refactor phase adds something that pure TDD does not have by default: the code review runs after all stories pass but before the human reviews the branch, surfacing logic errors, missing edge cases, and architectural mismatches as new stories rather than as comments for the developer to act on manually. The review findings become the next wave of Red-phase specification, and ARC executes them the same way it executed the original feature.

Where this becomes tangible: the combination of structured context injection, persistent decision logging, context window management, and quality gates between stories means the final diff the human reviews was produced by an agent that always knew what it was supposed to build, always had the same project context in every session, and always had its output verified before the next story began. The quality of that verification depends entirely on the acceptance criteria in the AES, which is why the discipline of writing criteria that are specific enough to verify without interpretation matters as much as the execution machinery itself.

Specification-First Development: The Mental Shift

2026-04-16T08:00:00+00:00

There is a specific pull toward starting with code. Requirements arrive with enough shape to begin, the architecture is familiar, and writing code gives you something concrete to hold, something that compiles or fails and tells you where you are. Specification work has none of that feedback. You are reasoning about behavior that does not exist yet, writing assertions that no system can evaluate, and the result is a document rather than a function. The cost of writing the document is visible; the cost of not writing it is not, at least not immediately.

The shift from code-first to spec-first thinking is harder than adopting a new framework or changing a testing strategy, because it requires changing what you treat as productive at the beginning of a cycle. And that change is not just procedural.

The Code-First Reflex

When a requirement lands, the mind immediately begins organizing the implementation: the data model, the service boundary, the function signature. These are concrete and manipulable in a way that specification questions are not. Code provides feedback through tests and build output. A specification provides feedback only through the quality of the thinking that produced it, which is harder to evaluate and harder to feel progress against.

This creates a pattern where the specification phase is treated as a formality rather than as the actual work of understanding the requirement. The questions that would surface during specification, about what the edge case behavior should be, which existing patterns to follow, what the precise scope of the feature is, still have to be answered. They surface later, in the middle of implementation, where answering them requires switching out of implementation mode and often discarding work that was built on the wrong assumption.

The reflex toward code is not irrational. In small, well-understood, low-stakes features, it is often the right call. The problem is that the features where it fails are not reliably distinguishable in advance from the features where it succeeds, which shifts the expected value calculation toward specification even for features that feel straightforward.

Specification as Thinking

Writing a requirement down with enough specificity to be verified changes what you know about it. A requirement held as a mental model can contain contradictions and ambiguities that are invisible because the mind fills in the gaps automatically. Writing the same requirement as acceptance criteria forces those gaps into view. You discover that two criteria conflict, or that a behavior you assumed was obvious turns out to be a design decision that was never made, or that the feature as imagined touches a domain boundary that was not part of the original discussion.

None of these discoveries feel productive at the moment they surface. They represent work that did not seem to exist until the specification tried to surface it. But they are progress, because finding them at specification time costs far less than finding them after the implementation is already built around an assumption that turns out to be wrong.

The Constraint-Based Requirement

A requirement stated as “should handle large inputs gracefully” can sit in the mind as a complete idea, because graceful is a concept everyone understands. Writing down what gracefully means in this context, whether it is a timeout with a 400 response, a queued job, or a streaming response with early flush, requires making a decision that felt already-made. The specification does not just record the decision. It reveals that the decision was never made, and forces it to happen before any code depends on it.

The Red, Green, Refactor With an AI in the Loop post describes this in terms of the red phase in TDD: writing a failing test is itself a form of specification, and the discipline it enforces is not about tests but about the thinking that tests require before implementation begins. Specification-first development extends that discipline upstream, into the requirement itself, before the tests exist.

The Tempting Shortcut

The failure mode in agentic development has a recognizable shape. You describe a feature in broad strokes, confident you know what you mean, and the agent produces something that satisfies the description without quite fitting the codebase: the wrong abstraction, a pattern that diverges from what the rest of the system does, or edge cases that were obvious in retrospect but were never specified and therefore never handled. The output is not wrong by any objective measure, but it requires correction cycles that would have been specification cycles if the thinking had happened earlier.

The parallel in TDD is the same one experienced practitioners recognize: writing code first and retrofitting tests to match produces a test suite that is green but is measuring the implementation rather than verifying the requirement. The gap between those two things is where bugs live, and where shared understanding starts to erode.

What makes the shortcut attractive is that it sometimes works well, particularly in well-bounded features in familiar domains. The problem is that it tends to work reliably in exactly the situations where the full specification cycle would have been quick anyway, and fails most visibly in the situations where that specification work would have been most valuable.

The Same Feature, Two Paths

Consider a straightforward feature: a search endpoint that returns paginated results filtered by status and date range. Both paths begin with the same description.

On the shortcut path, the agent reads the codebase, generates an endpoint, writes tests against the implementation, and delivers a diff that appears complete. On review, you notice the pagination implementation uses offset-based pagination while the rest of the API uses cursor-based pagination, because the requirement did not specify. The date range filter does not cover boundary conditions because none were in scope per anything written. The status filter accepts values the domain model does not support because no constraint was specified. Each issue is individually correctable. Together, they require a correction cycle that would not have been necessary if the specification had answered these questions before any code existed.

On the spec-first path, writing the acceptance criteria surfaces all of those questions before implementation begins. What pagination scheme should this endpoint follow? What is the behavior when the date range start falls after the end? Which status values are valid, and what response should an invalid value produce? By the time the specification is complete, every question that would otherwise surface in a correction cycle has been answered, and the implementation that follows fits cleanly not because the agent is more capable, but because it was built against a complete requirement rather than an underspecified one.

The total time is not necessarily shorter. The difference is in when the work happens and what the output looks like when the cycle ends.

Sustaining the Discipline in a Team

The mental shift to spec-first thinking is harder to maintain in a team than in isolation, because the pressures that generate the code-first reflex are amplified by collaboration. Schedule pressure, visible progress requirements, and the tendency to treat specification work as less tangible than implementation all compress the phases where no code is being written.

A few structural patterns tend to hold the discipline in place. Treating the specification as the actual deliverable of the planning phase, rather than a prerequisite to the deliverable, changes how it is reviewed and how seriously ambiguities in it are taken. Requiring acceptance criteria to be written as verifiable assertions rather than descriptive statements, the distinction between “should handle errors gracefully” and “returns a 400 with a structured error body when the status value is not in the allowed set,” makes vagueness visible in a way that descriptive criteria do not. Reviewing the implementation against the specification during code review, rather than reviewing the implementation against itself, keeps the connection between what was intended and what was built from eroding over time.

These are not rules that produce spec-first culture automatically. They are supports that make the discipline easier to maintain than to abandon.

Where the Investment Shows

The benefits of specification-first development do not appear at the moment the specification is written. They appear in the implementation phase, which tends to be shorter because the hard questions were answered before any code depended on the answers. They appear in code review, which surfaces real issues rather than resolving ambiguities that were never clarified. And they appear when a feature needs to change, because the team can read the specification and understand not just what the code does but why it was built that way.

Cognitive debt, which the Red, Green, Refactor post describes as the erosion of shared understanding that accumulates when output is accepted without genuine comprehension, does not accumulate in the same way in codebases where specifications were clear. The reasoning behind each feature is preserved in the document that existed before any code was written, and that document is what makes the subsequent implementation readable rather than opaque.

Where this becomes concrete: the features that required the most correction after implementation are almost always the ones where the thinking happened during coding rather than before it, and the teams that can modify their own code with confidence are almost always the ones where clarity about what was built, and why, was established before a line of it existed. Once the discipline of spec-first thinking is in place, the next question becomes execution: how an agent works through a structured backlog of stories without losing track of intent or silently building on broken foundations. The ARC: The Agentic Release Cycle post explores that machinery in detail.

Red, Green, Refactor With an AI in the Loop

2026-04-15T09:00:00+00:00

Test-Driven Development rests on a deceptively simple mantra: Red goes to Green goes to Refactor. Write a failing test, make it pass, then clean it up. Repeat. What makes TDD powerful is not the tests themselves, but the discipline the cycle enforces. The red phase forces you to understand what you are building before you build it, because you cannot write a failing test for something you have not thought through. That clarity shapes how quickly and smoothly the green and refactor phases unfold. The green phase follows: write the minimum code to make the test pass, because the test is the specification. Then refactor, using a passing suite as a safety net to clean up the implementation while locking behavior in place. Do this across a career, and you end up with a codebase that is both well-tested and well-understood, because every piece was thought through before it existed.

I have been building software with an agentic workflow for a while now, and I notice the workflow follows the same cycle as TDD, just with different tools and a different executor.

The Agentic Cycle

The core of this approach is a specification that takes the familiar structure of a Product Requirements Document and adapts it for a fundamentally different audience: an autonomous agent rather than a human reader. That shift in audience shapes everything about how the specification is written, because a human reader can interpret ambiguous requirements, ask follow-up questions, and fill in gaps from context, while an agent cannot. The rigor it demands is borrowed from TDD: before a single line of code is written, the specification must answer the same questions a practitioner asks when writing a test, such as what specific behavior must happen, what edge cases matter, what state changes are observable, and what verification would prove this is complete.

Writing the Specification

When starting a feature, you move through a structured conversation in plan mode, describing the feature, the user stories that compose it, and the acceptance criteria that would verify each story is done. This conversation forces you to articulate not just what to build, but what done looks like. This is the red phase: defining the specification before any code exists, with the same discipline as writing a failing test. Once complete, the specification is captured as an AES, or Autonomous Execution Specification, a structured machine-consumable document that descends from the conventional PRD but is designed for a different audience and a different executor. Each story is decomposed to touch a small, coherent slice of the codebase, carrying specific implementation guidance about which files and patterns to follow, and acceptance criteria expressed as verifiable assertions rather than vague descriptions. A single story might look like this:

{
  "id": "US-012",
  "title": "Position Aggregation by Book",
  "description": "As the risk engine, I need to aggregate open positions by book so that downstream P&L calculations receive pre-netted exposures.",
  "priority": 2,
  "complexity": "standard",
  "riskLevel": "low",
  "preferredModel": "gpt-5.3-codex",
  "dependsOn": ["US-010"],
  "implementationNotes": "Add an aggregate_by_book method to PositionService in src/risk/position_service.py. Group positions by book_id and sum net_quantity. Follow the existing pattern in aggregate_by_counterparty.",
  "acceptanceCriteria": [
    "aggregate_by_book returns a dict keyed by book_id with summed net_quantity",
    "Positions with zero net_quantity are excluded from the result",
    "Unit tests cover empty input, single book, and multiple books with offsetting legs",
    "ruff check passes with no errors"
  ],
  "passes": false
}

The rigor came first, during the red phase, which is what makes the downstream execution reliable.

Autonomous Execution

An autonomous agent then executes the green phase automatically, story by story. For each one, it reads the codebase to understand conventions, generates code that fits the established architecture, runs the build and test suite, and marks the story as passed only when acceptance criteria are satisfied. Once all stories are implemented, an automated code review runs across the entire branch, identifying logic errors, missing edge cases, and architectural mismatches. Issues discovered here feed back into the cycle as new stories, which the agent executes until both acceptance criteria and review findings are addressed.

Human Review

The refactor phase remains entirely human. You open the diff, read the commits in the context of the specification you wrote, and look for implementation choices that are technically correct but architecturally wrong for your codebase. You refactor what needs it and stay genuinely in touch with the codebase, not as a rubber stamp but as an informed collaborator who understands the requirement and can evaluate whether the execution matches it.

The Red Phase Remains Central

In traditional TDD, what distinguishes a productive red phase from a rushed one is the time spent interrogating the requirement before any code exists. Writing an AES is the same exercise: answering hard questions about what the user story actually is, what acceptance criteria would verify it is done, and which patterns the codebase already has that should be followed. If you skip that thinking and hand a vague brief to an agent, you get vague code, just as a vague test produces vague software. The difference from a conventional PRD is not in the questions asked but in the precision required to answer them: a specification read by humans can imply intent and trust interpretation, but one executed by an agent cannot, because there is no one on the other side to fill in the gaps. The thinking that goes into a well-formed AES is where the real intellectual work of the cycle lives, and it is deliberately kept with the human.

Connection Through Understanding

The obvious benefit of agentic development is throughput. As I explored in the Welcome to Agentic Development post, an agent can reason about a goal and execute a sequence of actions without needing explicit procedural instructions for each step. Less obvious, and what took me longer to appreciate, is that the cycle keeps you genuinely in the codebase. When you review the agent’s output, you are not approving a pull request you half-skimmed, but reading code that implements a specification you worked through in detail before a single function existed. You know what each story was supposed to do, which edge cases were in scope, and what acceptance criteria the implementation was measured against, so when something looks off, you recognize it immediately. The review is not a formality but an informed evaluation, and that is only possible because you still carry the reasoning behind the spec. This is the same dynamic that gives TDD its staying power: the act of writing tests forces understanding before implementation, and that understanding is what makes every subsequent review meaningful rather than cursory.

Combining the Approaches

The two approaches strengthen each other precisely because they enforce the same discipline at specification time. When writing an AES, you apply the same rigor a TDD practitioner brings to writing tests, framing acceptance criteria as specific, testable behaviors. What does the function return when given empty input? What happens when an edge case occurs? How should the system behave when constraints conflict? These questions distinguish a thorough AES from a vague one, and they are exactly the questions that make tests meaningful rather than ornamental.

You get the throughput benefits of agentic development and the agent’s ability to handle scaffolding and boilerplate quickly, while maintaining the upfront thinking that prevents vague code. The human writes the specification in structured form, the agent implements against it, and the human reviews the result with the spec in hand, knowing exactly what was supposed to happen.

Avoiding Cognitive Debt

The Risk

Agentic development carries a specific risk worth naming directly: cognitive debt. The failure mode is subtle, because it does not show up in the build or the tests. An agent generates code, the acceptance criteria pass, and the temptation is to skim the diff and move on, treating the output as a black box rather than as a system you genuinely understand. Over time, technical understanding quietly erodes. The team knows the code runs, but not why it was designed that way, and not what breaks when requirements shift. What makes this distinct from technical debt is where it lives: technical debt sits in the code, where it is at least visible and can be measured, but cognitive debt sits in people, in the team’s shared understanding of the system, and once it accumulates, the team gradually loses the ability to modify their own codebase with confidence.

The Prevention

The agentic cycle is explicitly designed to prevent this. The AES phase is where the genuine intellectual work happens: articulating what to build, why, what the constraints are, and what done looks like. The agent handles the mechanical repetition of implementation, but the thinking about how a feature fits into the existing architecture, what the edge cases are, and which patterns belong remains entirely with you, recorded in the specification before a line of code is written. When the AES is clear and thorough, the subsequent implementation becomes comprehensible because you already understand what it was supposed to achieve. The mandatory human review reinforces this: you do not glance at code and approve it, but read the commits in the context of acceptance criteria you defined. When something looks off, you recognize it because the mismatch between what you specified and what was implemented is now visible. This is how shared understanding persists and how cognitive debt is kept at bay.

The Shape of the Work

The Shortcut

There is a tempting shortcut in agentic development: tell an autonomous agent what you want in broad strokes and let it figure out the details. Sometimes this works. More often, you get code that technically compiles but does not quite fit, with the wrong abstraction or a pattern that clashes with the rest of the codebase. And often, you lose track of why certain decisions were made in the first place. The same failure mode exists in TDD when you skip the red phase and write code first, then retrofit tests to match. The tests pass, but they are testing the wrong thing.

The Discipline

The discipline is the same in both cases. Slow down at the specification phase, because the more clearly you can express what you need in a test or in acceptance criteria, the better the execution phase goes, whether the executor is you, an automated test runner, or an autonomous agent. Treat understanding as something you deliberately build, not something that just happens, and write down the why when decisions are made so that shared understanding persists. The agentic cycle has not replaced TDD’s core insight but given it a new set of tools to run on, and when combined deliberately, especially in how the red phase is structured, the two approaches reinforce each other in ways that neither achieves alone.

Where this becomes tangible: the specification is where you articulate what done looks like, not in broad strokes but with enough specificity that someone, or something, could verify it without asking a follow-up question. Frame acceptance criteria as assertions, not descriptions. When the agent delivers, review the result against what you specified, not just against whether it compiles and the tests pass. The difference between getting throughput without cognitive debt and getting technically correct code that misses the mark often comes down to that one phase: how clearly the requirement was thought through before a line existed.

The Specification-First Development: The Mental Shift post explores this transition in deeper detail, examining both the psychological resistance to shifting from code-first thinking and the structural patterns that help teams maintain specification discipline at scale.

Welcome to Agentic Development

2026-04-15T08:00:00+00:00

Welcome to the Agentic Development section of this blog. This is where I explore building software systems that use AI agents to reason about problems, make decisions, and take action with significant autonomy. These systems are not merely orchestration layers or workflow runners that execute predetermined sequences. Rather, they represent a meaningful shift in how we think about delegating work and structuring software itself.

What is Agentic Development?

Agentic development is the practice of designing and building software where AI agents operate as active participants in accomplishing tasks, not as passive executors of fixed scripts. What distinguishes this from traditional automation is the degree of reasoning involved. Traditional automation follows a predefined sequence: if condition A, then execute step B, then execute step C. Agentic systems, by contrast, receive a goal or a problem description and reason about which actions to take, in what order, based on what they observe and the feedback they receive along the way. An agent operates by perceiving its environment, reasoning about possible actions, executing them, observing the results, and adjusting its approach based on what it learns. This means the agent can:

Reason about goals and decompose them into a sequence of actions rather than following a rigid predetermined path
Use tools such as search, code execution, APIs, and file systems to gather information, take action, and observe the results
Adapt its approach based on what it discovers at each step, adjusting its strategy or trying alternative paths when an approach does not yield the expected results
Collaborate with other agents or with humans, asking for clarification when needed and incorporating feedback into its reasoning

How an Agent Works in Practice

Consider a code review task. You provide the agent with a high-level instruction such as “Review src/auth.py for potential bugs and architectural issues” along with tools it can use: the ability to read files, run tests, execute commands, and query the codebase. The agent then reasons about how to accomplish this. It might start by reading the file to understand what the code does, then run the existing test suite to see what behaviors are currently verified, then run static analysis tools to check for obvious issues, and finally synthesize a review based on all of this information. At no point did you specify the exact sequence of steps, the order in which to run the tools, or what decision to make when one tool’s output contradicts another. The agent made those decisions by reasoning about the goal and the results of each action. This is fundamentally different from a script that says “always read the file first, then always run tests, then always check style.” The agent’s sequence emerges from its reasoning about what information is needed to accomplish the goal.

The Shift in How We Specify Work

There is a fundamental difference between building traditional software and building agentic systems, and it lies in how we specify what needs to be done. Traditional software development requires developers to write imperative code: this happens first, then that happens, then check this condition, and if true then do that. Every decision point, every fallback path, every sequence must be written out explicitly. The software is a series of instructions for a computer to follow. Agentic systems change this dynamic. Instead of specifying the how, you specify the what and the why. You describe the goal, the constraints, the tools available, and what success looks like. The agent then reasons about how to achieve it.

This shift is not merely a convenience, it reflects a different way of thinking about problem-solving. When you write imperative code, you must anticipate all possible paths through a system. When you specify a goal for an agent, you articulate what you want to happen and trust the agent’s reasoning to find a path. This requires a different discipline in how we think about software design:

Goals and constraints must be clear enough that an autonomous reasoner can understand what success looks like without needing to ask a follow-up question
The tools and capabilities available to the agent must be well-designed so that the agent can effectively gather information and take meaningful action
Feedback mechanisms must be explicit so that the agent can recognize when it has gone off track and correct itself

These three elements form the foundation of agentic systems. When they are well-designed, an agent can accomplish complex tasks. When they are poorly specified, the agent produces technically correct output that misses the point.

What’s Next

In the posts that follow, I plan to explore specific patterns and disciplines for building agentic systems effectively. Some posts will examine agent architectures and the tradeoffs between single-agent and multi-agent systems. Others will focus on tool design: how to create tools that agents can use reliably to gather information and take action. I will also cover evaluation and testing, since traditional testing approaches do not translate directly to systems where the agent’s path is not predetermined. And throughout, I will share case studies drawn from real work that show how these patterns play out in practice and where the discipline matters most.

I am still learning this space myself. The patterns are emerging, and there is much I do not yet understand about what works reliably and where the boundaries of the approach lie. I remain genuinely open to different approaches and to being wrong about what I think I know.

This blog is built with Jekyll and hosted on GitHub Pages.