AI Will Replace You Research · CLAUDE.md eval

The arc: 1. Cell tests → 2. Planned build → 3. Free-form build → 4. Takeaway

Report 4 of 4 · The takeaway

Your CLAUDE.md isn't doing what you think it's doing

126 builds, 9 popular CLAUDE.mds, one finding nobody is talking about: the plan does the work — and the md catches what the plan misses.

Headline

The plan is the pilot. CLAUDE.md is the co-pilot. Most people have it backwards.

When the prompt is tight, all 9 mds converge — your rules barely matter. When the prompt is loose, the md takes over and shapes the entire codebase. The skill isn't picking the right md — it's knowing when the md is flying the plane.

Four takeaways

1 The plan flies the plane. CLAUDE.md is the co-pilot. ▸

Tight prompt → 9 different mds converge to roughly the same code. Loose prompt → the md becomes the structural decision-maker. Your md isn't a style guide. It's a fallback that activates when your prompt is weak.

2 Fewer rules → more stability. ▸

The empty CLAUDE.md had the lowest run-to-run stdev on both real experiments — 0.12 on scraper, 0.00 on shop. Every rule you add is a new place for variance to enter.

3 All 9 mds produce good code. ▸

26 of 27 free-form builds shipped. The differences between mds are noticeable, but every single one is good enough to use. Pick the one whose shape you like.

4 The best md combines both styles — kept minimal. ▸

Declarative mds (Karpathy, Codex) won on stability. Flow-aware mds (mine) had real advantages too — v2 scored 2.81 on the runs that built, ahead of Karpathy's 2.70. The sweet spot: pair declarative conventions (docstrings, type hints, typing) with a few high-value flow rules (write tests for regressions, plan first) — and stop there. v3, v4, v5 lost because they kept adding.

📖How to read this reportclick to collapse

The thesis

The looser the prompt, the more your CLAUDE.md fills the gap.

Same 9 mds, same agent. Only the prompt's specificity changes.

Total builds

126

9 mds × 14 task-runs across 3 experiments

LLM judgements

378

Each build judged by Opus, Sonnet, Haiku on 6 rubrics

Sample sizes

1 → 3 → 3

Report 1 matches the paper's N=1. Reports 2 & 3 raise to N=3.

Refusal

1 of 126

v2 on shop r2: rule conflict overrode "build it"

The curve, in one chart

Each bar is the score gap between the best and worst CLAUDE.md on that experiment. Same 9 mds, same agent, only the prompt specificity changes.

Variant spread — score gap between best and worst CLAUDE.md

Larger bar = more difference between mds. Lower = mds converge.

8 tasksCell-sized tasks

0.14

237-line specPlanned project

0.43

1 sentenceFree-form prompt

0.87

The three experiments

Each step kept the 9 CLAUDE.mds the same and changed only how much plan was in the prompt. Click any step to drill into its full report.

Step 1

Prompt size

tightly scoped · cell-sized

Reproductions + synthetic tests

8 small tasks (4 real bug-fixes from Dory's repo, 4 synthetic rule-tests). Each task has a clear before/after — the agent has nowhere to go off-script.

Variant spread: 0.14
72 cells · 0 refusals · barely separable.

Open report 1 →

Step 2

Prompt size

237-line spec · planned project

Planned multi-file scraper

9 variants × 3 runs against a detailed plan. Vision API, dedup, filters, multi-site. The plan carries the structural decisions.

Variant spread: 0.43
27 cells · 0 refusals · still inside run-to-run noise.

Free-form 'build me a shoe shop'

9 variants × 3 runs against a single sentence. No plan, no structure, no scope hints. The CLAUDE.md is the only differentiator.

Variant spread: 0.87
27 cells · 1 refusal · 2× more variance · rankings flip.

Open report 3 →

What makes a good CLAUDE.md

The winning recipe combines declarative conventions with a few high-value flow rules — kept minimal.

①

Declare conventions, not procedures

Say what to follow: docstrings, type hints, typing, naming, file layout. Don't tell the agent how to think. This is where Karpathy and Codex won.

②

Add a few flow rules — but stop early

A short flow nudge helps: plan first, write tests for regressions and new functionality. v2 (mine) scored 2.81 on builds that ran. v3/v4/v5 lost because they kept adding rules. Pick 2–3, stop.

③

Write tighter prompts

The 237-line spec in Report 2 cut variant spread from 0.87 → 0.43. A clear prompt beats any md.

④

State the need, not the method

Tell the agent what the user needs. Let it choose how. The mds that prescribed how to think (v3, v4, v5) ranked lowest.

⑤

Read first, change narrowly

Tell it to change only what's needed — after reading enough of the codebase to know what "needed" means. Surgery without anatomy is butchery.

What changes when the plan disappears

Quality spread

0.14 0.43 0.87

Score gap doubles as the plan disappears.

Code volume spread (max ÷ min)

— (mixed sizes) 1.4× 3.3×

Without a plan, the same 9 mds wrote between 580 and 1,910 lines.

Refusals (md blocked the user)

0 0 1

CLAUDE.md rules only override the user when the prompt is loose.

Same prompt, two shapes

Two CLAUDE.mds. Same one-sentence prompt: "Build me an online shop with a web interface for selling shoes." Same agent, same model. The structural choice is entirely the md.

v1 — Karpathy rules only

single-file monolith

10 files

580 lines

README.md30

app.py232

requirements.txt1

static/style.css141

templates/base.html37

templates/cart.html38

templates/checkout.html28

templates/index.html33

templates/order.html17

templates/product.html23

v3 — Dory's AGENTS_medium

nested package

27 files

1,842 lines

.gitignore5

pyproject.toml16

run.py20

shoe_shop/__init__.py3

shoe_shop/cart.py144

shoe_shop/constants.py52

shoe_shop/database.py157

shoe_shop/images.py138

shoe_shop/logging_config.py34

shoe_shop/main.py354

shoe_shop/schemas.py73

shoe_shop/seed.py187

shoe_shop/static/css/style.css308

shoe_shop/static/img/shoe-1.svg12

shoe_shop/static/img/shoe-2.svg12

shoe_shop/static/img/shoe-3.svg9

shoe_shop/static/img/shoe-4.svg10

shoe_shop/static/img/shoe-5.svg10

shoe_shop/static/img/shoe-6.svg11

shoe_shop/static/img/shoe-7.svg8

shoe_shop/static/img/shoe-8.svg9

shoe_shop/templates/base.html61

shoe_shop/templates/cart.html58

shoe_shop/templates/catalog.html30

shoe_shop/templates/checkout.html47

shoe_shop/templates/order.html36

shoe_shop/templates/product.html38

v1 (Karpathy) chose a single 232-line app.py. v3 (mine, medium) chose a nested package with separate modules for cart, database, images, schemas, and 8 SVG illustrations. Same prompt — 3× the code.

How each CLAUDE.md actually performed

9 mds, 3 experiments, 378 LLM judgements. Per-variant verdicts.

Works well v7 — OpenAI Codex AGENTS.md

Most consistent across all three regimes

Evidence: Top-3 quality on every experiment (2.83 / 2.37 / 2.74). Best stability on shop (stdev 0.03). 3rd best on scraper (stdev 0.17).

Why: Narrow, specific rules (mostly Rust/codex-rs conventions) that don't try to control conversational flow — so they can't conflict with user intent.

Recommended default. If you must pick one md to copy, this is it.

Works well v1 — Karpathy rules

Highest quality + lowest code volume

Evidence: Best score on cell tests (2.85) and scraper (2.46), top-3 on shop (2.70). Produced 580 lines on shop — half of what most variants wrote.

Why: "Behavioral guidelines to reduce common LLM mistakes" — declarative and short. Doesn't tell the agent how to think, just what to avoid.

Pick this if you value simplicity and you trust Claude to make sensible defaults. Run-to-run stdev on shop (0.22) is a touch higher than v0 or v7.

Works well v0 — empty (no md)

Most stable across runs, every time

Evidence: Lowest run-to-run stdev on both scraper (0.12) and shop (perfect 0.00 — same 2.72 three times). #2 on shop quality.

Why: No rules = no rule conflicts = no surprising behavior. The agent falls back to its training defaults, which are reproducible.

The reproducibility champion. Use this when you care more about "same output every time" than "best possible output." Loses on planned tasks where v1/v7 have a small edge.

Mixed v4 — Dory's AGENTS_full (1027 lines)

Length didn't help

Evidence: Middling quality everywhere (2.81 / 2.31 / 2.61). 1027 lines of rules and the score sits below 50-line Karpathy on every regime.

Why: The agent isn't reading a constitution — long mds create more places for instructions to conflict, more rules to half-apply.

Trim aggressively. There's no quality reward for verbosity at this length.

Mixed v8 — shanraisshan

Competitive but unstable on planned tasks

Evidence: Top-tier on cell (2.83) and shop (2.59) quality. But scraper stdev = 0.60 — same md, same prompt, three runs swinging by a full point.

Why: Worth digging into separately — the rules look reasonable but something about them resonates with run-to-run noise on multi-file builds.

Useful as a starting point but pair with N=3 runs in any eval before trusting a single comparison.

Works well v2 — mine (AGENTS_light, 57 lines)

Highest quality when it builds — and the only md that knew when to pause

Evidence: Best score on cell tests (2.84). Top score on shop when it builds (2.83 on r1, 2.78 on r3). On 1 of 3 shop runs it paused before writing code, citing its own "plan-first" rule.

Why: The pause is a feature. The md was asked to build a vague shop with no plan, recognised the ambiguity, and asked for one. Re-prompting the same conversation broke the gate and produced its highest score yet. That's the md doing its job — not failing.

If you adopt this approach, build in a clear escape hatch — a phrase or signal the user can use to confirm "yes, I really mean go". The pause is valuable; the friction of recovering shouldn't be.

Caveats v6 — HumanLayer CLAUDE.md

Mode-collapsed once

Evidence: On shop r1, picked Node + committed <code>node_modules/</code> (596 dep files, 65k lines vendored). r2 and r3 didn't repeat the mistake. Highest stdev among non-refusing variants (0.80 on shop, 0.58 on scraper).

Why: Best guess: a rule somewhere encourages "go fast, ship runnable code" without a counter-rule about diff hygiene. Small prompt change, big swing.

Add explicit .gitignore guidance and dependency-handling rules if you adopt this.

Caveats v3 — Dory's AGENTS_medium (147 lines) · v5 — medium+karpathy

More rules, more code, lower quality (on planned tasks)

Evidence: v3 scraper: 2.11. v5 scraper: 2.04. Both at the bottom. Both also produce the most code on the scraper (2,563 / 2,517 lines vs Karpathy's 2,134).

Why: Verbose mds tend to push agents toward verbose code: more files, more abstractions, more places for the rubric to dock points.

If you're tempted to add rules to fix a single bad output, resist. Ship-tight prompts beat extensive mds.

Patterns across the 9 variants

①

Length doesn't predict quality. 0 lines (v0) → 50 lines (v1 Karpathy) → 1027 lines (v4 full) all produced similar mean quality. The longest md cluster (v3, v4, v5) actually scored worse than the shortest on the planned scraper. More rules ≠ better code.

②

Hard procedural rules backfire under loose prompts. v2's combination of "always plan first" + "rules win conflicts with the user" caused the only refusal in the experiment. If your md installs gates ("never X without approval"), expect them to fire when prompts get vague.

③

Declarative beats procedural. The two top-3-on-everything mds (Karpathy, Codex) describe what to avoid and what conventions to follow — not how to think or when to ask. Behavioral guardrails don't conflict; flow-control rules do.

④

No md is best across all regimes. Out of 9 variants, only v0 (empty) and v7 (Codex) appeared in top-3 on multiple stability and quality dimensions across both planned and free-form. Every other md wins on one regime and loses on another. If you're picking an md, pick for your specific task profile, not from a leaderboard.

The full picture

126

total builds

9 variants × 14 task-runs across the 3 experiments

378

LLM judgements

Each build judged by Opus 4.7, Sonnet 4.6, Haiku 4.5 on 6 rubrics

refusal

v2 on shop r2: rule conflict overrode the user's "build it" prompt

📊 Full receipts: signal vs noise, lines-of-code variance, the comparison table

Click to expand · all the underlying numbers

Signal vs noise — can you actually tell variants apart?

Signal = best md minus worst md. Noise = how much the same md varies across 3 runs. If noise > signal, you can't trust a single comparison.

Signal vs noise — can you actually tell the variants apart?

Signal = spread between best and worst CLAUDE.md (cross-variant). Noise = median quality wobble within a single CLAUDE.md across its 3 runs. If noise > signal, you can't trust any single comparison — the same md run twice gives different answers than two different mds.

STEP 2Planned scraper

signal

0.43

noise

0.34

1.3×

lost in noise

STEP 3Free-form shop

signal

0.87

noise

0.11

7.8×

clear signal

The headline: on the planned scraper, signal/noise is just 1.26× — the same CLAUDE.md run twice can produce more variation than two different CLAUDE.mds. On the free-form shop, signal/noise jumps to 7.83× — the variants actually differ in measurable, reproducible ways. This is why the recent paper saw flat results: they were in the planned regime where the signal is below the noise floor.

Lines-of-code variance — coefficient of variation (%)

Cell tests are intentionally apples-to-oranges (different tasks, different sizes); scraper and shop are same-task across 3 runs — those bars are the real signal.

Coefficient of variation in lines added

How much code volume bounced around — the same setup, different runs.

1Cell tests (n=72, mixed sizes)

120.1%

2Planned scraper (same task)

11.0%

3Free-form shop (same task)

24.6%

The full numbers

Every metric, every experiment, side by side.

Metric	1. Cell tests	2. Planned scraper	3. Free-form shop
Cells in experiment	72	27	27
Refusals (0-line builds)	0	0	1

— Code volume —
Mean lines added per build	51	2287	1196
Lines added — min	1	1891	580
Lines added — max	225	2727	1910
Lines coefficient of variation	120.1%	11.0%	24.6%
Mean files touched per build	1.1	16.3	17.2

— Quality scores (0–3) —
Mean cell score	2.81	2.24	2.50
Best variant score	2.85	2.46	2.74
Worst variant score	2.71	2.04	1.87
Variant spread (best − worst)	0.14	0.43	0.87
Score coefficient of variation	6.0%	16.1%	23.5%

Why this matters

For the recent paper claiming CLAUDE.md doesn't matter They tested planned tasks. They were right — for that regime. This experiment shows the regime where it does matter: loose prompts.

When your prompt carries the plan, your CLAUDE.md is a small style nudge. When your prompt is one sentence, the md fills the gap — sometimes well, sometimes by stopping the conversation.

The fix isn't a better CLAUDE.md. It's a clearer prompt.