AI Will Replace You Research · CLAUDE.md eval

Research · AI Will Replace You · 2026-05-07

Does your `CLAUDE.md`
actually matter?

A recent paper said no. So I tested 9 popular CLAUDE.md / AGENTS.md files across three increasingly loose tasks.
The answer is: it depends on how much plan you give the agent.

CLAUDE.md variants tested

empty · Karpathy · Codex · HumanLayer · light · medium · full · merged · shanraisshan

126

code builds

9 variants × 14 task-runs (8 cell + 3 scraper + 3 shop)

378

LLM judgements

Opus 4.7 + Sonnet 4.6 + Haiku 4.5

live smoke tests

install → boot → curl → grep for shoes

The arc — three experiments, one curve

Each experiment is the same 9 mds, same agent, judged by the same rubric. The only thing that changes between them is how much planning is in the prompt.

Step 1

Prompt size

tightly scoped · cell-sized

Small, well-defined tasks

8 surgical tasks (4 from Dory's repo, 4 synthetic rule-tests). Each task has a clear before/after — the agent has nowhere to go off-script.

Variant spread: 0.14
Effectively no md effect.

Open report 1 →

Step 2

Prompt size

237-line spec · planned project

Planned multi-file project

Build a Serbian real-estate scraper from a detailed plan. Multiple files, vision API, dedup, filters. The plan does the structural work.

Variant spread: 0.43
Still inside run-to-run noise.

Free-form 'build me a shoe shop'

One sentence: "Build me an online shop with a web interface for selling shoes." The CLAUDE.md is now the only thing differentiating the agents.

Variant spread: 0.87
2× more variance · rankings flip · one md refuses.

Open report 3 →

Step 4 · Takeaway

Synthesis

all 3 experiments side by side

Planning kills variance

The cross-experiment view: variance numbers across all 3 prior steps in one table. Lines added, files touched, quality scores, refusal counts — the whole curve.

The reveal
CLAUDE.md effect ≈ inverse of prompt specificity.

Open the takeaway →

The reveal

Your CLAUDE.md is co-authoring every reply.

When your prompts carry the plan, the md is a small style nudge. When your prompts are vague, the md fills the gap — sometimes well, sometimes by stopping the conversation.

v0 (empty CLAUDE.md) ranked 7th of 9 on the planned task. It ranked 2nd of 9 on the free-form one. That single rank-flip refutes "always use Karpathy's md" or any other universal-best advice.

One md refused to write code on 1 of 3 runs. Same prompt, same agent — its rules said "always plan first" and won against the user's "do not ask clarifying questions; build it." Re-sending the prompt verbatim broke the gate.

The fix isn't a better md. It's a clearer prompt.

Variant spread vs prompt specificity

Cell tests

0.14

Planned project

0.43

Free-form

0.87

← tighter prompt · looser prompt →

Watch · read · subscribe

▶

Watch the video

Full walkthrough on YouTube — the whole arc in 12 minutes

📝

Read the blog post

Methodology, data tables, and the things that didn't fit on screen

Does your CLAUDE.mdactually matter?

The arc — three experiments, one curve

Small, well-defined tasks

Planned multi-file project

Free-form 'build me a shoe shop'

Planning kills variance

The reveal

Watch · read · subscribe

Watch the video

Read the blog post

Get the newsletter

Does your `CLAUDE.md`
actually matter?