Research · CLAUDE.md eval
The arc: 1. Cell tests 2. Planned build 3. Free-form build 4. Takeaway
Report 1 of 4 · Cell-sized tests

Reproduction commits + synthetic rule-tests

8 tightly-scoped tasks — replays of real bug-fixes from Dory's repo + 4 synthetic rule-tests. The smallest unit of CLAUDE.md influence.
Headline
I wanted to compare CLAUDE.mds — deeply. A range of handpicked mds from the internet, vs mine.
8 tasks: recent commits I'd shipped to my own repo + a handful of synthetic rule-tests. 9 CLAUDE.md variants, one run each. Asked Claude to design the iterations — it called them "cells." Claude thinks in dataframes. Cute. Variant spread: 0.14 on a 0–3 scale — barely separable. Cool finding: one md wrote tests nobody asked for.
📖How to read this reportclick to collapse
Tasks
8
4 real bug-fixes from my repo · 4 synthetic rule-tests
Variants
9 CLAUDE.mds
Karpathy · Codex · HumanLayer · shanraisshan · 4 of mine · empty
Sample size
N = 1
Same as the recent paper. Terrible for variance — but matches the baseline.
Score
0 → 3
3 LLM judges · 6 rubrics · 2.0 = merge w/ comments · 3.0 = ready to merge
⚠ Important context These tasks aren't greenfield builds. Each real task is a reproduction — I took a real bug-fix I'd shipped, reset the repo, then re-ran the agent with the original prompt. Each synthetic task is from a curated rule-test suite where the agent applies judgement on a single small file. In both cases the change is scope-bounded by the task.
Cells judged72
Average quality score
2.81
/ 3.00 · higher = better
Variant spread0.14 pts
What this number is: the mean code-quality score across 72 cells (8 tasks × 9 variants × 3 judges). Each cell is judged on 6 rubric dimensions (correctness, simplicity, modularity, DRY, review-acceptance, diff-discipline) by 3 LLMs and averaged. 2.0 = "would merge with comments", 3.0 = "ready to merge", 0.0 = refused or broken.
N = 1 <strong>only 1 run per task</strong> — we cannot measure how much an individual cell would vary if re-run. Between-run variance is unmeasured here.

Variant ranking

Each row averages across 8 tasks. The differences are small — adding a CLAUDE.md helps a little, but no md dominates.

VariantScoreSignal
#1 v1 — Karpathy rules only (110k stars) 2.85 41 lines avg · ±0.10 across 8 tasks
#2 v2 — Dory's AGENTS_light (57 lines) 2.84 60 lines avg · ±0.11 across 8 tasks · only md that added a regression test
#3 v8 — shanraisshan claude-code-best-practice (51k stars) 2.83 46 lines avg · ±0.12 across 8 tasks
#4 v7 — OpenAI Codex AGENTS.md (80k stars) 2.83 53 lines avg · ±0.14 across 8 tasks
#5 v0 — empty (control) 2.82 44 lines avg · ±0.12 across 8 tasks
#6 v6 — HumanLayer CLAUDE.md (10.7k stars) 2.82 46 lines avg · ±0.13 across 8 tasks
#7 v4 — Dory's AGENTS_full1027 (1353 lines) 2.81 48 lines avg · ±0.15 across 8 tasks
#8 v3 — Dory's AGENTS_medium_autonomous (147 lines) 2.77 72 lines avg · ±0.27 across 8 tasks
#9 v5 — medium + Karpathy merged (~196 lines) 2.71 54 lines avg · ±0.30 across 8 tasks

Three things to take away

1
Spread tiny: 0.14 of 3. Best vs worst — barely separable. Small task = small md effect.
2
My agents_light (v2) was the only md that added a regression test. On the LinkedIn bug-fix, v2 wrote a test to lock in the fix — its rule book requires one for every bug. The other 8 variants shipped the fix without it. Even on tiny tasks, the md leaks through — for better or worse.
3
Real or synthetic — same story. Repo bug-fixes and curated rule-tests gave the same verdict. The interesting stuff starts in Report 2.

📊 Full data: heatmap, per-cell diffs, judge rationales

Click to expand the 8×9 grid · click any cell for the full diff and per-judge breakdown

Score grid (8 tasks × 9 variants)

Each cell is the mean of 3 judges (Opus 4.7, Sonnet 4.6, Haiku 4.5) on a 0–3 quality rubric. Click any cell to inspect the diff and the judges' rationales.

Variantreal · linkedin start-date fixreal · thumbnail batch processorreal · webflow dedupreal · webflow stdout polishsynth · simplicity (word_count)synth · goal-driven (order_total)synth · surgical editsynth · think before codevariant mean
v0 — empty (control)2.892.672.782.672.782.892.923.002.82
v1 — Karpathy rules only (110k stars)2.722.832.832.722.832.942.923.002.85
v2 — Dory's AGENTS_light (57 lines)2.892.892.722.722.782.943.002.752.84
v3 — Dory's AGENTS_medium_autonomous (147 lines)2.832.832.832.112.832.832.923.002.77
v4 — Dory's AGENTS_full1027 (1353 lines)2.722.832.782.832.502.892.923.002.81
v5 — medium + Karpathy merged (~196 lines)2.782.832.722.722.582.063.003.002.71
v6 — HumanLayer CLAUDE.md (10.7k stars)2.722.832.722.722.672.943.002.942.82
v7 — OpenAI Codex AGENTS.md (80k stars)2.892.892.832.832.502.892.922.892.83
v8 — shanraisshan claude-code-best-practice (51k stars)2.782.612.722.892.922.942.922.892.83
task mean2.802.802.772.692.712.812.942.94

real tasks (from Dory's repo) · synthetic tasks (from eugenexz/agent-prompt-rules)