Acceptance Criteria Agents Can Actually Execute

Most advice on acceptance criteria is still aimed at humans. That's the bug.

If you're handing work to Cursor, Claude Code, Codex, or Gemini, human-readable isn't enough. The agent needs criteria it can execute. That means machine-verifiable. In practice, the useful test is simple. A criterion should be observable, atomic, and bounded. If it isn't, you're not writing acceptance criteria. You're writing a wishlist and asking a model to guess.

That's why the backlash against spec-driven development isn't wrong. “Waterfall with markdown,” “productivity trap,” “spec drift,” “token-burning ceremony.” Fair. If your spec is long, vague, and full of fuzzy intent, it will absolutely waste time. Simon Willison has pushed hard on the failure modes of sloppy AI-assisted workflows, and Birgitta Böckeler's spec-first, spec-anchored, and spec-as-source framing matters here because not every “spec workflow” deserves the same label. Addy Osmani's spec-writing guide for AI agents also lands on the same pressure point: more text doesn't fix bad instructions.

The fix isn't “write bigger specs.” It's write criteria the agent can check without asking you what you meant. If the criterion can't become a test, the agent will improvise. That's where drift starts. If you've felt that pain already, the critique in this piece on the vibe coding wall probably sounds familiar.

The Backlash Is Right Vague Specs Are a Trap
Why Your Current Acceptance Criteria Fail AI Agents
The Anatomy of an Executable Acceptance Criterion
Actionable Examples for Common Scenarios
Tailoring Criteria for Different AI Agents
- What changes by agent
- How to adapt without rewriting everything
Putting It All Together in Your Workflow
- A simple workflow that holds up

The Backlash Is Right Vague Specs Are a Trap

The backlash against specs is mostly backlash against fake precision.

Teams write a polished feature brief, add four bullets under “acceptance criteria,” and act like the work is now controlled. Then an agent reads this:

user can update profile
error handling should be graceful
page should load quickly
UI should feel modern

That list does not constrain implementation. It does not define success. It gives the agent room to guess, and guesses are expensive in code.

I've seen this play out the same way in greenfield prompts and messy brownfield repos. The document looks organized. The output looks plausible. Then the agent changes the wrong layer, misses a business rule, or passes a happy path while breaking the state that mattered. That pattern is a big reason the prompting wall that drove the SDD wave showed up in the first place. More prose did not produce more control.

The underlying problem isn't the existence of specs. It's criteria written for human nodding instead of machine execution.

Practical rule: If two competent developers could read the same criterion and ship different behavior, the criterion is not ready for an agent.

That standard is stricter than “clear writing.” It asks a different question. Can a reviewer, a test runner, or the agent itself determine pass or fail from observable evidence? If not, the criterion is still a note, not an executable requirement.

This matters even more with autonomous agents because they optimize for completion. Give a human “make error handling graceful,” and they might ask follow-up questions. Give that to an agent, and it will often pick a pattern from prior training, wire in generic toast messages, and move on. Zephony's AI agent guide is useful here because it frames the agent as a system that needs explicit operating boundaries, not inspirational product language.

There's an older lesson behind all this. Acceptance criteria exist to define limits that can be checked. In validation-heavy fields, acceptance limits are tied to explicit parameters and observed outcomes, not broad intent statements (industry article on statistical tools for in-process acceptance criteria). Software teams do not need that terminology, but they do need the discipline. If nobody can say what evidence proves the feature passed, the spec is unfinished.

Shorter usually helps.

Agents work better with criteria that are observable, atomic, and bounded. One behavior. One boundary. One clear way to verify it. Long specs still have a place for context, constraints, and business background. Acceptance criteria do a different job. They tell the agent what must be true when the work is done.

Why Your Current Acceptance Criteria Fail AI Agents

Detailed user stories do not protect you from bad output. Vague acceptance criteria invite it.

A frustrated programmer staring at a computer screen full of errors while an AI agent holds vague requirements.

The usual spec still assumes a human reader will fill in the blanks. An AI agent will not do that the way your senior engineer would. It has to turn text into actions, file edits, and checks. If the criteria say “support profile editing,” the agent has too many valid interpretations. It might change the schema, add a form, update validation, touch API contracts, or all four. That is not intelligence failure. That is handoff failure.

Human-friendly is not executable

A criterion can sound reasonable to a product team and still be unusable for an agent. The gap is simple. Humans tolerate intent. Agents need pass or fail conditions.

These are common examples of criteria that read fine in grooming and fail in execution:

“The dashboard should load quickly.” No threshold, no page state, no measurement point.
“Show a helpful error.” No trigger, no placement, no copy rule, no expected status code.
“Support CSV import.” No column contract, no row limit, no malformed-row behavior.
“Keep the current architecture.” No boundary on what can change and what must stay untouched.

That is why generic advice to “write clearly” is not enough. The key shift is from human-readable acceptance criteria to machine-executable acceptance criteria. Agents need criteria with three properties: observable, atomic, and bounded. If a reviewer cannot tell what evidence proves success, the agent cannot reliably finish the task.

A lot of agent tooling advice reaches the same conclusion from a different direction. Zephony's AI agent guide is useful because it treats the agent as a system operating inside explicit states and constraints, not as a teammate who will infer missing rules.

Vague criteria usually fail before testing starts. They fail when the agent has to guess.

Brownfield repos punish guesswork

This problem gets sharper in an existing codebase.

Greenfield demos make weak criteria look serviceable because there is less history to break. Brownfield work is where loose language gets expensive. Existing repos have hidden coupling, naming conventions, side effects, partial abstractions, and migration baggage. A criterion like “add filtering to orders page” can send an agent into the wrong layer fast.

Typical failure patterns look like this:

Shared query logic gets moved into a page component because the criterion never named the correct boundary.
A new dependency gets added even though the repo already has an approved pattern for the same job.
Behavior gets duplicated because the criterion never pointed to the existing module the agent should extend.
The happy path passes while refunds, exports, or audit logs suffer unnoticed failures.

That is why “be specific” is still too soft. In brownfield work, acceptance criteria should also fence off blast radius. Name the route. Name the file or module. Name what must not change. If the task touches a risky area, say so directly.

Guidance on real codebases makes this point well. Acceptance criteria are harder in legacy systems because the risk is often architectural, not just functional (Mambo on acceptance criteria in real codebases).

Drift starts inside the criteria list

Teams often blame the model after a bad run. I usually blame the criteria first.

If one bullet says “allow admins to edit users” and another says “preserve current permissions behavior,” the agent now has to reconcile two incomplete instructions without a testable boundary. It will pick the most likely interpretation from training patterns and local code context. Sometimes that works. Sometimes you get a clean PR that makes a subtle change to authorization logic.

The practical standard is stricter than what many teams use today. Each criterion should describe one behavior, under one condition, with one observable result. If a single bullet can fail in three different ways, it is doing too much. If nobody can point to the exact output, DOM change, response code, log entry, or file diff that proves success, the criterion is still a note.

Treat each acceptance criterion like a tiny eval. The agent should be able to satisfy it, fail it, or ask for a missing constraint. Anything fuzzier burns tokens and raises the odds of a build that technically completes but should never have shipped.

The Anatomy of an Executable Acceptance Criterion

Human-readable acceptance criteria are not enough for an agent. The bar is higher. An agent needs criteria it can verify against the repo, the runtime, or the output without guessing what “done” means.

A diagram illustrating the anatomy of an executable acceptance criterion using the GIVEN, WHEN, THEN framework.

Given-When-Then is still the cleanest starting point because it forces a criterion into a form that can map to a test. The pattern came out of BDD and Gherkin for a reason. It turns vague intent into a condition, an action, and a result.

Start with Given When Then

Use each part for a specific job:

Given defines the starting state
When defines the trigger
Then defines the observable result

For AI work, that baseline usually needs two extra anchors:

Repository anchor. File path, route, table, command, fixture, or component name
Assertion anchor. Exact output, status code, DOM change, queued job, log line, or diff

Those anchors cut down the agent's search space. They also make failure obvious.

A browser task is a good example. “Test the checkout flow” leaves too much room for interpretation. “Run the guest checkout path on /checkout, click the coupon toggle, apply SAVE10, and confirm the total updates before payment submission” is something an agent can execute. If you need repeatable UI interactions, Browser Actions is a practical reference for making those steps explicit.

Here is a weak criterion:

Users can reset their password from the login page.

Here is one an agent can work with:

Given an existing user at tests/fixtures/users/password-reset-user.json
When the user submits their email in src/routes/login.tsx through the “Forgot password” form
Then the API returns 202, the UI does not reveal whether the account exists, and a job using template password_reset is queued for that address

That version does three useful things. It names the fixture. It names the surface area. It defines success in outputs the agent can inspect.

A quick visual explanation helps if you're introducing this pattern to collaborators.

Observable atomic bounded

These three properties matter more than whether the sentence sounds polished. If a criterion is missing one of them, the agent starts improvising.

Observable

The result has to be checkable by a machine.

Good:

error banner is visible
API returns 404
exported CSV includes header email
retry button becomes enabled after network recovery

Bad:

flow feels smooth
page is intuitive
code is clean
result looks right

“Observable” is the line between a spec and a wish. If success depends on taste, the agent has no stable target.

Atomic

One criterion should make one claim.

Teams frequently generate avoidable mess. A single bullet like “user can upload avatar, crop it, preview it, and save changes without reloading” can fail in four different places. The agent may satisfy two of them, miss one, and still produce a PR that looks complete at a glance.

Split it instead:

Good:

user can upload a JPG avatar
crop selection persists before save
preview updates after crop confirmation
saved avatar is visible after refresh

The rule I use is simple. If a reviewer would need to ask “which part failed?”, the criterion is overloaded.

Bounded

A criterion needs edges. Inputs, scope, environment, and expected outputs all count.

Without bounds, agents fill gaps with common patterns from training data or nearby code. That is how you get a “reasonable” implementation that inadvertently touches the wrong route, updates the wrong component, or handles the wrong status code.

Useful boundaries include:

Boundary type	Useful example
Input example	request body for `/api/invoices/import`
File reference	`src/components/ProfileCard.tsx`
Environment condition	unauthenticated user, seeded database
Expected output	returns `422` with field error on `email`
Negative case	malformed CSV row is skipped and logged

How far to constrain the agent

The trade-off is real. Too little detail and the agent freewheels. Too much detail and you turn acceptance criteria into implementation instructions that block better solutions.

The practical line is outcome plus boundaries.

Over-constrained:

create a React hook named useInvoiceFilter, call it from InvoicePage.tsx, use useMemo, and store state in local component state

Better:

filtering in src/pages/InvoicePage.tsx must preserve current URL query parameter behavior, support status and date-range filters, and keep existing route-level tests passing

That still gives the agent room to choose the implementation. It does not give it room to break behavior you care about.

Executable acceptance criteria are stricter than standard backlog notes. That is the point. Human-readable criteria help a team discuss the work. Machine-executable criteria help an agent complete the work, fail clearly, or ask for the missing constraint before it burns tokens and ships the wrong thing.

Actionable Examples for Common Scenarios

Theory is cheap. What matters is whether the criterion survives contact with the repo.

Frontend component example

You're updating a profile avatar component.

Vague acceptance criterion

“Users should be able to upload a profile picture and it should look good on the profile page.”

That asks the agent to infer file types, dimensions, validation, display behavior, and error states.

Executable acceptance criterion

Given a signed-in user on /settings/profile
When the user uploads a PNG file through src/components/settings/AvatarUpload.tsx
Then the client accepts the file, shows a preview before save, and stores the image reference after save so the avatar remains visible after page refresh

Add separate atomic criteria if you care about edge cases:

Given the same form
When the user uploads a non-image file
Then form submission is blocked and an inline validation message is shown

A lot of frontend drift comes from mixing multiple expectations into one line. Split display, validation, persistence, and error behavior.

Here's the transformation in table form.

Vague (Human-Centric) AC	Executable (Agent-Centric) AC
Profile image should look good	Given a user on `/settings/profile`, when a PNG is uploaded through `src/components/settings/AvatarUpload.tsx`, then a preview is shown before save and the saved avatar remains visible after refresh
Validation should be helpful	Given the avatar form, when a non-image file is selected, then submission is blocked and an inline validation message is shown

Backend API example

You need a new endpoint for creating project invites.

Vague acceptance criterion

“Add an endpoint to invite users to a project. Handle duplicates and permissions.”

That's three criteria pretending to be one.

Executable acceptance criteria

Given an authenticated project admin
When POST /api/projects/:id/invitations receives a valid email payload
Then the API returns 201 and creates a pending invitation linked to that project

Given a project member without admin permission
When the same endpoint is called
Then the API returns 403

Given an email with an existing pending invitation for the same project
When the endpoint is called again
Then the API does not create a second invitation and returns the existing pending state

Notice what changed. Each criterion is pass/fail. No buried assumptions. No “handle duplicates” hand waving.

Data pipeline example

This one catches people because data tasks often hide edge cases behind “just process the file.”

Say you have a script that imports creator video metadata. If you need concrete fixtures, a practical guide for scraping YouTube videos is useful for thinking about source fields and repeatable extraction inputs.

Vague acceptance criterion

“The import script should process channel video data and skip bad rows.”

Still too loose.

Executable acceptance criteria

Given a CSV file in data/input/channel_videos.csv with headers video_id,title,published_at
When scripts/import_channel_videos.py runs
Then rows with all required fields are inserted into the target table

Given the same file contains a row with an empty video_id
When the script runs
Then that row is skipped and the script logs the skipped-row reason

Given a duplicate video_id already exists in storage
When the script runs
Then the existing record is not duplicated

That's the difference between “process the file” and “ship a script I can trust.”

Performance criteria example

Performance criteria are where vague language really falls apart.

Meegle's guidance is the most practical version of the rule. For performance, specify response time, throughput, and load under named test conditions instead of qualitative language, and define the pass condition in one sentence (Meegle on performance acceptance criteria).

Bad:

the API should be fast
search should scale
page should handle traffic spikes

Better:

Given the search endpoint under the named load test scenario search-standard-load, when requests are executed at 50 concurrent users, then p95 latency remains under 200ms

That works because it gives the agent something to implement and something to verify. It also gives you a clean line between aspiration and acceptance gate.

Tailoring Criteria for Different AI Agents

Not all coding agents fail in the same way. The format should stay stable. The emphasis changes.

A comparison chart showing differences between basic and advanced AI agents for executing acceptance criteria.

If you've worked with multiple tools, you've seen this already. One agent handles broad repo context well but gets sloppy with local constraints. Another follows the exact step but misses the architectural implication. Another is strong on implementation but weaker on self-checking.

What changes by agent

Here's the practical breakdown.

Agent	Usually helps	Usually needs help
Cursor	Multi-file changes, IDE-local code navigation, file-aware instructions	Explicit boundaries when tasks span architecture concerns
Claude Code	Large-context reasoning, planning, reading long specs	Short repeated constraints near the task so it doesn't drift inside a long plan
Codex	Clear implementation tasks with direct acceptance checks	More explicit repository anchors and examples
Gemini	Broad synthesis and planning across docs and code	Tighter pass/fail wording for execution details

This isn't a ranking. It's a reminder that instruction-following is not a single thing.

If your spec references many files, shared abstractions, or older modules, repo awareness matters more than prompt cleverness. That's where codebase-aware AI planning becomes the useful layer, because criteria tied to real files, routes, and existing patterns survive handoff better than generic prompts.

How to adapt without rewriting everything

You don't need a custom methodology per tool. Keep the criterion shape stable and adjust the wrapper.

For example:

For Cursor, put file paths directly in the criterion or adjacent task note.
For Claude Code, restate strict constraints at the end of the prompt, especially if the surrounding spec is long.
For Codex, include concrete inputs and expected outputs early.
For Gemini, pair high-level intent with one worked example so it doesn't generalize too broadly.

If an agent tends to infer too much, reduce ambiguity. If it tends to miss context, add anchors.

There's also a hard truth here. Better agents do not remove the need for executable criteria. They just hide the cost longer. A strong model can appear correct while still violating one unstated edge case in a brownfield repo.

That's why acceptance criteria agents can execute should remain tool-agnostic at the core:

observable result
atomic assertion
bounded input and output

Everything else is adaptation.

Putting It All Together in Your Workflow

Teams do not lose time because they skipped Given-When-Then. They lose time because their acceptance criteria never became something a machine could check.

Screenshot from https://tekk.coach

Once you treat acceptance criteria as executable constraints, the workflow changes. The agent is no longer reading a polite description of success. It is working against a set of observable, atomic, and bounded checks. If those checks drift after the first prompt, the build drifts with them.

Hamel Husain and Shreya Shankar's AI Evals for Engineers & PMs is useful for the same reason. It frames quality as something you can test, not something you hope the model inferred from context.

A workable handoff for a solo builder usually includes six parts:

Problem statement with explicit scope boundaries
Repository context with likely files, modules, routes, or tables
Subtasks small enough to validate one by one
Acceptance criteria written so an agent or script can verify pass or fail
Validation scenarios covering the happy path, failure path, and one ugly edge case
Assumptions tagged by risk so guesses are visible before they hit production

That structure matters because human-readable specs still fail if the checks are fuzzy. “Users can log in smoothly” reads fine in a ticket. An agent cannot execute it. “Given a valid email and password, POST /api/login returns 200 and sets an httpOnly session cookie” gives the model something concrete to build and gives you something concrete to test.

I learned to separate planning from coding for the same reason. Tekk.coach connects to a GitHub repo, asks codebase-grounded clarifying questions, and produces a structured spec with acceptance criteria, scope boundaries, and validation scenarios that you then hand to Cursor, Claude Code, Codex, or Gemini yourself. It does not run the external agent and it does not open the PR. That separation keeps the spec stable enough to audit.

A simple workflow that holds up

Use a loop that keeps criteria executable from the first draft to the final check.

Start with the smallest shippable change
Skip broad goals like “build onboarding.” Write the first unit of value, such as “collect company name on step one and reject blank input.”
Draft criteria against observable outputs
Name the route, UI state, file, command, payload, log line, or test result that proves completion.
Split every combined sentence
If one line covers behavior, validation, and performance, break it apart. Agents miss hidden conjunctions. Reviewers miss them too.
Attach bounded examples
Give one valid input and one invalid input. Define expected output for both. Bounded examples stop the model from generalizing past the task.
Run one agent against one spec
Multi-agent setups add overhead fast. Get the single-agent path stable first. If you are deciding whether orchestration is worth the complexity, this guide to AI agent orchestration patterns is a good reality check.
Accept only what was verified
Review against the criteria, not memory, intent, or a quick skim of the UI. If nobody checked it, it is still a guess.
Revise the spec when implementation changes
Living specs beat frozen docs. When a constraint changes, update the source of truth before the next prompt.

A lot of failed “spec-driven” work is just polished ambiguity. The document looks organized, the prompts look thoughtful, and the agent still ships the wrong thing because the acceptance criteria were never machine-executable in the first place.

The workflow that survives is tighter than people expect. Small scope. Criteria that can be observed. Assertions that test one thing at a time. Inputs and outputs with hard edges.

Connect your GitHub repo. Describe the problem. Get a structured spec. Ship with Tekk.coach.

Part of the Spec-Driven Development pillar — a 52-page honest playbook on shipping with AI coding agents.

Acceptance Criteria Agents Can Actually Execute

Table of Contents

The Backlash Is Right Vague Specs Are a Trap