Hands-on tests of how AI tools really behave when you push them. Real builds, real data, full source. Each experiment is reproducible — pick one and look at the receipts.
Click into any experiment for the full interactive report.
Karpathy's CLAUDE.md vs the internet's most-liked CLAUDE.md / AGENTS.md files — plus a few of my own. The results aren't what you'd expect.
Prompt-specificity curve (3+ specificity levels), tool-use reliability across model versions, agent-vs-orchestrator latency. New experiments published as runs complete.