Research · CLAUDE.md eval
Research · AI Will Replace You · 2026-05-07

Does your CLAUDE.md
actually matter?

A recent paper said no. So I tested 9 popular CLAUDE.md / AGENTS.md files across three increasingly loose tasks.
The answer is: it depends on how much plan you give the agent.

9
CLAUDE.md variants tested
empty · Karpathy · Codex · HumanLayer · light · medium · full · merged · shanraisshan
126
code builds
9 variants × 14 task-runs (8 cell + 3 scraper + 3 shop)
378
LLM judgements
Opus 4.7 + Sonnet 4.6 + Haiku 4.5
27
live smoke tests
install → boot → curl → grep for shoes

The arc — three experiments, one curve

Each experiment is the same 9 mds, same agent, judged by the same rubric. The only thing that changes between them is how much planning is in the prompt.

The reveal

Your CLAUDE.md is co-authoring every reply.

When your prompts carry the plan, the md is a small style nudge. When your prompts are vague, the md fills the gap — sometimes well, sometimes by stopping the conversation.

v0 (empty CLAUDE.md) ranked 7th of 9 on the planned task. It ranked 2nd of 9 on the free-form one. That single rank-flip refutes "always use Karpathy's md" or any other universal-best advice.

One md refused to write code on 1 of 3 runs. Same prompt, same agent — its rules said "always plan first" and won against the user's "do not ask clarifying questions; build it." Re-sending the prompt verbatim broke the gate.

The fix isn't a better md. It's a clearer prompt.

Variant spread vs prompt specificity
Cell tests
0.14
Planned project
0.43
Free-form
0.87
← tighter prompt  ·  looser prompt →

Watch · read · subscribe