Reading the SlopCodeBench Paper

A paper called SlopCodeBench has been circulating as evidence that AI agents fundamentally lack design discipline. The headline findings are striking: no agent completes any of the 20 problems end-to-end, code quality degrades monotonically across checkpoints, agent code is 2.2 times more verbose than code from 20 maintained open-source repositories, and structural erosion increases in 80% of trajectories. The study covers 11 models, including Claude Code and Codex operating through their native CLI harnesses. The findings are real. The question is what they actually measure.

What the benchmark tests

SlopCodeBench evaluates agents on iterative software development tasks: build an initial implementation, then extend it across multiple checkpoints as specifications evolve, each time building on the agent’s own prior code. Passing tests at each step is not the same as producing extensible code, and most benchmarks never test the difference. An agent that hardcodes language-specific logic at the first checkpoint will face cascading rewrites as the specification expands, while one that builds an extensible architecture does not. The benchmark captures that distinction, and the results show agents consistently choosing the path that makes their own lives harder downstream.

But the evaluation protocol contains a constraint that shapes the entire result: “The agent must reason about changes solely from the code’s current structure as we do not provide the prior conversation’s context.” Between checkpoints, Docker containers reset. Shell history, session data, and any accumulated reasoning are wiped. The agent sees the updated specification and the code as it currently stands. Nothing more.

The prompt is also deliberately minimal. Only two requirements are given: keep a requirements.txt updated and write a named entrypoint script. The authors frame this as placing “the burden of good coding strategy on the agent and its harness.” No patterns to follow. No guidance about which architectural decisions already exist and why. No acceptance criteria beyond what the specification prose implies.

The load-bearing constraint

Removing context between checkpoints encodes a specific model of how AI development works, one that does not match how people who are making AI development work actually operate.

In the approach I described in Red, Green, Refactor With an AI in the Loop, each story in what I call an Autonomous Execution Specification (AES) carries explicit implementation notes: which files to touch, which patterns to follow, which existing abstractions to reuse, and which edge cases to handle. Acceptance criteria are written as verifiable assertions before a line of code exists. The agent is not asked to infer architecture from the code’s current structure, but is given the architecture as context, because the human worked through that thinking during the specification phase.

The SlopCodeBench agent is given none of this. It inherits code it produced earlier but has no memory of why it made the choices it did. When the specification expands, it patches rather than refactors, because it cannot tell the difference between a structural decision that was intentional and one that was incidental. The result is exactly what the paper observes: logic piles into existing functions, complexity concentrates, duplicated code grows. This is predictable, and it tells us something real about what happens when you strip context from the interaction. What it does not tell us is that iterative AI-assisted development produces degrading code.

What the prompt intervention study shows

The researchers also tested quality-aware prompts, instructions that specifically asked agents to write clean, extensible, non-verbose code from the start. Initial verbosity and erosion improved. The degradation continued anyway.

This is the most interesting result in the paper, and it points in the same direction as the rest of the evidence. Prompting for quality at invocation time is not the same as providing the structural context that would allow the agent to make better architectural decisions. Telling an agent to “write clean code” without telling it what the codebase’s existing patterns are, what the acceptance criteria for this specific story are, and what architectural constraints already exist is asking it to invent good structure from nothing. The initial improvement shows models do respond to quality signals. The continued degradation shows that surface-level prompting does not substitute for the reasoning done before execution.

What the findings do tell us

Stripped of prior context and given only a specification and the current state of a codebase, AI agents produce code that degrades in recognizable ways across iterations. Complexity concentrates in functions that were never designed to carry it. Duplicated code accumulates because there is no persistent understanding of what already exists and why. Pass rates hold in the near term while the underlying architecture quietly erodes.

This is a real failure mode, and it is useful to have it documented at this level of detail. The paper shows clearly that it does happen. The question is under what conditions it is triggered. The conditions are specific: no prior context, minimal structural guidance, no human review between iterations, no specification-level reasoning about which patterns to follow. Those conditions describe what happens when that layer of human thinking is removed from the loop.

The METR productivity study measured something structurally similar in a different domain: developers who were highly experienced on their codebases, using AI tools with limited prior familiarity, in a setting where the AI lacked the deep contextual understanding the developers themselves carried. That study found slowdowns. This one finds quality degradation. Both results make sense given their settings. Neither is a verdict on the approach.