Hands-on tests of how AI tools really behave when you push them. Real builds, real data, full source. Each experiment is reproducible — pick one and look at the receipts.
Click into any experiment for the full interactive report.
9 popular CLAUDE.md / AGENTS.md variants tested across three increasingly loose tasks. A recent paper said system-prompt choice barely moves the needle. The data says: it depends entirely on how much plan you give the agent.
Prompt-specificity curve (3+ specificity levels), tool-use reliability across model versions, agent-vs-orchestrator latency. New experiments published as runs complete.