Research · CLAUDE.md eval
The arc: 1. Cell tests 2. Planned build 3. Free-form build 4. Takeaway
Report 4 of 4 · The takeaway

Your CLAUDE.md isn't doing what you think it's doing

126 builds, 9 popular CLAUDE.mds, one finding nobody is talking about: the plan does the work — and the md catches what the plan misses.
Headline
The plan is the pilot. CLAUDE.md is the co-pilot. Most people have it backwards.
When the prompt is tight, all 9 mds converge — your rules barely matter. When the prompt is loose, the md takes over and shapes the entire codebase. The skill isn't picking the right md — it's knowing when the md is flying the plane.

Four takeaways

1 The plan flies the plane. CLAUDE.md is the co-pilot.
Tight prompt → 9 different mds converge to roughly the same code. Loose prompt → the md becomes the structural decision-maker. Your md isn't a style guide. It's a fallback that activates when your prompt is weak.
2 Fewer rules → more stability.
The empty CLAUDE.md had the lowest run-to-run stdev on both real experiments — 0.12 on scraper, 0.00 on shop. Every rule you add is a new place for variance to enter.
3 All 9 mds produce good code.
26 of 27 free-form builds shipped. The differences between mds are noticeable, but every single one is good enough to use. Pick the one whose shape you like.
4 The best md combines both styles — kept minimal.
Declarative mds (Karpathy, Codex) won on stability. Flow-aware mds (mine) had real advantages too — v2 scored 2.81 on the runs that built, ahead of Karpathy's 2.70. The sweet spot: pair declarative conventions (docstrings, type hints, typing) with a few high-value flow rules (write tests for regressions, plan first) — and stop there. v3, v4, v5 lost because they kept adding.
📖How to read this reportclick to collapse
The thesis
The looser the prompt, the more your CLAUDE.md fills the gap.
Same 9 mds, same agent. Only the prompt's specificity changes.
Total builds
126
9 mds × 14 task-runs across 3 experiments
LLM judgements
378
Each build judged by Opus, Sonnet, Haiku on 6 rubrics
Sample sizes
1 → 3 → 3
Report 1 matches the paper's N=1. Reports 2 & 3 raise to N=3.
Refusal
1 of 126
v2 on shop r2: rule conflict overrode "build it"

The curve, in one chart

Each bar is the score gap between the best and worst CLAUDE.md on that experiment. Same 9 mds, same agent, only the prompt specificity changes.

Variant spread — score gap between best and worst CLAUDE.md

Larger bar = more difference between mds. Lower = mds converge.
8 tasksCell-sized tasks
0.14
237-line specPlanned project
0.43
1 sentenceFree-form prompt
0.87

The three experiments

Each step kept the 9 CLAUDE.mds the same and changed only how much plan was in the prompt. Click any step to drill into its full report.

What makes a good CLAUDE.md

The winning recipe combines declarative conventions with a few high-value flow rules — kept minimal.

Declare conventions, not procedures
Say what to follow: docstrings, type hints, typing, naming, file layout. Don't tell the agent how to think. This is where Karpathy and Codex won.
Add a few flow rules — but stop early
A short flow nudge helps: plan first, write tests for regressions and new functionality. v2 (mine) scored 2.81 on builds that ran. v3/v4/v5 lost because they kept adding rules. Pick 2–3, stop.
Write tighter prompts
The 237-line spec in Report 2 cut variant spread from 0.87 → 0.43. A clear prompt beats any md.
State the need, not the method
Tell the agent what the user needs. Let it choose how. The mds that prescribed how to think (v3, v4, v5) ranked lowest.
Read first, change narrowly
Tell it to change only what's needed — after reading enough of the codebase to know what "needed" means. Surgery without anatomy is butchery.

What changes when the plan disappears

Quality spread
0.14 0.43 0.87
Score gap doubles as the plan disappears.
Code volume spread (max ÷ min)
— (mixed sizes) 1.4× 3.3×
Without a plan, the same 9 mds wrote between 580 and 1,910 lines.
Refusals (md blocked the user)
0 0 1
CLAUDE.md rules only override the user when the prompt is loose.

Same prompt, two shapes

Two CLAUDE.mds. Same one-sentence prompt: "Build me an online shop with a web interface for selling shoes." Same agent, same model. The structural choice is entirely the md.

v1 — Karpathy rules only
single-file monolith
10 files
580 lines
README.md30
app.py232
requirements.txt1
static/style.css141
templates/base.html37
templates/cart.html38
templates/checkout.html28
templates/index.html33
templates/order.html17
templates/product.html23
v3 — Dory's AGENTS_medium
nested package
27 files
1,842 lines
.gitignore5
pyproject.toml16
run.py20
shoe_shop/__init__.py3
shoe_shop/cart.py144
shoe_shop/constants.py52
shoe_shop/database.py157
shoe_shop/images.py138
shoe_shop/logging_config.py34
shoe_shop/main.py354
shoe_shop/schemas.py73
shoe_shop/seed.py187
shoe_shop/static/css/style.css308
shoe_shop/static/img/shoe-1.svg12
shoe_shop/static/img/shoe-2.svg12
shoe_shop/static/img/shoe-3.svg9
shoe_shop/static/img/shoe-4.svg10
shoe_shop/static/img/shoe-5.svg10
shoe_shop/static/img/shoe-6.svg11
shoe_shop/static/img/shoe-7.svg8
shoe_shop/static/img/shoe-8.svg9
shoe_shop/templates/base.html61
shoe_shop/templates/cart.html58
shoe_shop/templates/catalog.html30
shoe_shop/templates/checkout.html47
shoe_shop/templates/order.html36
shoe_shop/templates/product.html38

v1 (Karpathy) chose a single 232-line app.py. v3 (mine, medium) chose a nested package with separate modules for cart, database, images, schemas, and 8 SVG illustrations. Same prompt — 3× the code.

How each CLAUDE.md actually performed

9 mds, 3 experiments, 378 LLM judgements. Per-variant verdicts.

Works well v7 — OpenAI Codex AGENTS.md
Most consistent across all three regimes
Evidence: Top-3 quality on every experiment (2.83 / 2.37 / 2.74). Best stability on shop (stdev 0.03). 3rd best on scraper (stdev 0.17).
Why: Narrow, specific rules (mostly Rust/codex-rs conventions) that don't try to control conversational flow — so they can't conflict with user intent.
Recommended default. If you must pick one md to copy, this is it.
Works well v1 — Karpathy rules
Highest quality + lowest code volume
Evidence: Best score on cell tests (2.85) and scraper (2.46), top-3 on shop (2.70). Produced 580 lines on shop — half of what most variants wrote.
Why: "Behavioral guidelines to reduce common LLM mistakes" — declarative and short. Doesn't tell the agent how to think, just what to avoid.
Pick this if you value simplicity and you trust Claude to make sensible defaults. Run-to-run stdev on shop (0.22) is a touch higher than v0 or v7.
Works well v0 — empty (no md)
Most stable across runs, every time
Evidence: Lowest run-to-run stdev on both scraper (0.12) and shop (perfect 0.00 — same 2.72 three times). #2 on shop quality.
Why: No rules = no rule conflicts = no surprising behavior. The agent falls back to its training defaults, which are reproducible.
The reproducibility champion. Use this when you care more about "same output every time" than "best possible output." Loses on planned tasks where v1/v7 have a small edge.
Mixed v4 — Dory's AGENTS_full (1027 lines)
Length didn't help
Evidence: Middling quality everywhere (2.81 / 2.31 / 2.61). 1027 lines of rules and the score sits below 50-line Karpathy on every regime.
Why: The agent isn't reading a constitution — long mds create more places for instructions to conflict, more rules to half-apply.
Trim aggressively. There's no quality reward for verbosity at this length.
Mixed v8 — shanraisshan
Competitive but unstable on planned tasks
Evidence: Top-tier on cell (2.83) and shop (2.59) quality. But scraper stdev = 0.60 — same md, same prompt, three runs swinging by a full point.
Why: Worth digging into separately — the rules look reasonable but something about them resonates with run-to-run noise on multi-file builds.
Useful as a starting point but pair with N=3 runs in any eval before trusting a single comparison.
Works well v2 — mine (AGENTS_light, 57 lines)
Highest quality when it builds — and the only md that knew when to pause
Evidence: Best score on cell tests (2.84). Top score on shop when it builds (2.83 on r1, 2.78 on r3). On 1 of 3 shop runs it paused before writing code, citing its own "plan-first" rule.
Why: The pause is a feature. The md was asked to build a vague shop with no plan, recognised the ambiguity, and asked for one. Re-prompting the same conversation broke the gate and produced its highest score yet. That's the md doing its job — not failing.
If you adopt this approach, build in a clear escape hatch — a phrase or signal the user can use to confirm "yes, I really mean go". The pause is valuable; the friction of recovering shouldn't be.
Caveats v6 — HumanLayer CLAUDE.md
Mode-collapsed once
Evidence: On shop r1, picked Node + committed <code>node_modules/</code> (596 dep files, 65k lines vendored). r2 and r3 didn't repeat the mistake. Highest stdev among non-refusing variants (0.80 on shop, 0.58 on scraper).
Why: Best guess: a rule somewhere encourages "go fast, ship runnable code" without a counter-rule about diff hygiene. Small prompt change, big swing.
Add explicit .gitignore guidance and dependency-handling rules if you adopt this.
Caveats v3 — Dory's AGENTS_medium (147 lines) · v5 — medium+karpathy
More rules, more code, lower quality (on planned tasks)
Evidence: v3 scraper: 2.11. v5 scraper: 2.04. Both at the bottom. Both also produce the most code on the scraper (2,563 / 2,517 lines vs Karpathy's 2,134).
Why: Verbose mds tend to push agents toward verbose code: more files, more abstractions, more places for the rubric to dock points.
If you're tempted to add rules to fix a single bad output, resist. Ship-tight prompts beat extensive mds.

Patterns across the 9 variants

Length doesn't predict quality. 0 lines (v0) → 50 lines (v1 Karpathy) → 1027 lines (v4 full) all produced similar mean quality. The longest md cluster (v3, v4, v5) actually scored worse than the shortest on the planned scraper. More rules ≠ better code.
Hard procedural rules backfire under loose prompts. v2's combination of "always plan first" + "rules win conflicts with the user" caused the only refusal in the experiment. If your md installs gates ("never X without approval"), expect them to fire when prompts get vague.
Declarative beats procedural. The two top-3-on-everything mds (Karpathy, Codex) describe what to avoid and what conventions to follow — not how to think or when to ask. Behavioral guardrails don't conflict; flow-control rules do.
No md is best across all regimes. Out of 9 variants, only v0 (empty) and v7 (Codex) appeared in top-3 on multiple stability and quality dimensions across both planned and free-form. Every other md wins on one regime and loses on another. If you're picking an md, pick for your specific task profile, not from a leaderboard.

The full picture

126
total builds
9 variants × 14 task-runs across the 3 experiments
378
LLM judgements
Each build judged by Opus 4.7, Sonnet 4.6, Haiku 4.5 on 6 rubrics
1
refusal
v2 on shop r2: rule conflict overrode the user's "build it" prompt

📊 Full receipts: signal vs noise, lines-of-code variance, the comparison table

Click to expand · all the underlying numbers

Signal vs noise — can you actually tell variants apart?

Signal = best md minus worst md. Noise = how much the same md varies across 3 runs. If noise > signal, you can't trust a single comparison.

Signal vs noise — can you actually tell the variants apart?

Signal = spread between best and worst CLAUDE.md (cross-variant). Noise = median quality wobble within a single CLAUDE.md across its 3 runs. If noise > signal, you can't trust any single comparison — the same md run twice gives different answers than two different mds.

STEP 2Planned scraper
signal
0.43
noise
0.34
1.3×
lost in noise
STEP 3Free-form shop
signal
0.87
noise
0.11
7.8×
clear signal

The headline: on the planned scraper, signal/noise is just 1.26× — the same CLAUDE.md run twice can produce more variation than two different CLAUDE.mds. On the free-form shop, signal/noise jumps to 7.83× — the variants actually differ in measurable, reproducible ways. This is why the recent paper saw flat results: they were in the planned regime where the signal is below the noise floor.

Lines-of-code variance — coefficient of variation (%)

Cell tests are intentionally apples-to-oranges (different tasks, different sizes); scraper and shop are same-task across 3 runs — those bars are the real signal.

Coefficient of variation in lines added

How much code volume bounced around — the same setup, different runs.
1Cell tests (n=72, mixed sizes)
120.1%
2Planned scraper (same task)
11.0%
3Free-form shop (same task)
24.6%

The full numbers

Every metric, every experiment, side by side.

Metric 1. Cell tests 2. Planned scraper 3. Free-form shop
Cells in experiment 72 27 27
Refusals (0-line builds) 0 0 1
— Code volume —
Mean lines added per build 51 2287 1196
Lines added — min 1 1891 580
Lines added — max 225 2727 1910
Lines coefficient of variation 120.1% 11.0% 24.6%
Mean files touched per build 1.1 16.3 17.2
— Quality scores (0–3) —
Mean cell score 2.81 2.24 2.50
Best variant score 2.85 2.46 2.74
Worst variant score 2.71 2.04 1.87
Variant spread (best − worst) 0.14 0.43 0.87
Score coefficient of variation 6.0% 16.1% 23.5%

Why this matters

For the recent paper claiming CLAUDE.md doesn't matter They tested planned tasks. They were right — for that regime. This experiment shows the regime where it does matter: loose prompts.

When your prompt carries the plan, your CLAUDE.md is a small style nudge. When your prompt is one sentence, the md fills the gap — sometimes well, sometimes by stopping the conversation.

The fix isn't a better CLAUDE.md. It's a clearer prompt.