Results first. To see whether Karpathy Skills (community-curated CLAUDE.md behavior rules) actually help, I ran 10 real tickets with Claude Opus 4.8 locked—the only variable was whether Karpathy's four principles were loaded. After one week, the numbers that stuck:
xcodebuild test green rose from 5/10 to 8/10 (+30pp). Median human CR time landed around 38 min → 22 min.
There was also a disappointment—a one-line constant change where the rules barely helped and added an extra round of assumption-checking. Success and failure both get documented below, plus the final CLAUDE.md we keep in the repo (what people search for as Claude Code Rules / Claude Code Prompt).
Why Claude Code edits code it shouldn't
It's less "the model is dumb" and more a default optimization target misaligned with engineers: ship a patch that looks complete fast—add abstractions, touch neighbor files, skip ambiguity. Andrej Karpathy summarized four chronic symptoms on X: coding before asking, over-engineering simple asks, drive-by edits on files that weren't in scope, and replacing verifiable goals with "done."
In spring 2026 the community compressed that into a few dozen lines of CLAUDE.md (forrestchang/andrej-karpathy-skills, aka Karpathy Skills). No new runtime—just four behavioral contracts Claude Code reads every session. What I wanted was I tested it — what's the number.
What Karpathy Skills are
The name says Skills, but the artifact is a project instruction file—not OpenAI GPT Skills, not OpenHuman's SKILL.md plugin pack. Claude Code reads repo-root CLAUDE.md (plus equivalent text from global plugins) at session start and constrains how the agent thinks and cuts diffs before you type the first line.
The four principles map to the upstream CLAUDE.md:
- Think Before Coding: State assumptions, surface ambiguity, propose simpler approaches—don't silently pick one interpretation and start typing.
- Simplicity First: Minimum code that solves the problem. Reject unrequested abstractions, config, or "future-proof" layers.
- Surgical Changes: Touch only task-necessary lines. No drive-by "cleanup" or formatting. Remove dead code you introduced; mention legacy dead code, don't delete unless asked.
- Goal-Driven Execution: Turn bug fixes and validation into verifiable goals (failing test first when applicable). Multi-step work: list a verify checkpoint per step.
Install two ways: Claude Code plugin /plugin install andrej-karpathy-skills@karpathy-skills, or merge upstream CLAUDE.md at repo root—the final version below is copy-paste ready.
How I ran the test
Variable control (ran on a Mac mini M4 Cloud Mac for tmux long-runs and uninterrupted xcodebuild; numbers reproduce on a local Mac too):
- 10 real tickets: 4 small Swift features, 2 cross-file renames, 2 test additions, 2 CI script tweaks.
- Model fixed:
claude-opus-4-8, Efforthigh, no minor bump within the same week. - Single variable: control used old
CLAUDE.md(no Karpathy). Experiment merged four principles + two team rules below. - Each ticket twice (A→B or alternate days) to avoid "good day" bias.
Five metrics aligned with what actually burns review time:
| Metric | Meaning | How captured |
|---|---|---|
| Scope creep | Share of unrelated files / hunks | git diff --stat vs declared task paths |
| Diff volume | Whether +/- lines exceed task necessity | Line counts + manual "could be smaller" |
| First-pass green | First xcodebuild test / CI before PR |
Same-host build logs on Cloud Mac |
| Revert / redo | Revert or full Agent rewrite within 48h of merge | Ticket + git log |
| Pre-implementation clarity | Hypotheses / options listed before coding | Session export scored 0/1 by hand |
Results from 10 tickets
Table shows medians (n=10, not an official benchmark). If you remember one figure: unrelated diff −78%.
| Metric | No Karpathy (median) | With Karpathy (median) | Relative change (pilot) |
|---|---|---|---|
| Unrelated-path diff share | 18% | 4% | ~−78% |
| Task-related diff lines (+/− total) | 412 lines | 286 lines | ~−31% (less over-implementation) |
| First-pass CI green | 5/10 | 8/10 | +30pp |
| Revert / full rewrite within 48h | 3/10 | 1/10 | −67% (small n, order-of-magnitude only) |
| Pre-implementation clarity (manual 0/1) | 2/10 | 7/10 | Think Before Coding stands out |
Unrelated diff 18% → 4%: the Agent still touched many files, but drive-by "format the Payment module" hunks vanished from the diff—Surgical Changes on that curve. First-pass CI green 5/10 → 8/10 mostly tracks Goal-Driven Execution: failing tests added before implementation more often.
Where the lift is largest
- Cross-file rename / signature change: unrelated diff −78% felt largest in review—CR stops feeling like a minefield.
- Small feature with slightly fuzzy requirements: Think Before Coding forced hypothesis lists first; rework dropped from 3 rounds to 1 (tickets PAY-1842, AUTH-901).
- Test addition + implementation: first-pass green rate up; writing red→green into
CLAUDE.mdstabilized execution.
Orthogonal to a code knowledge graph: the graph answers "where to edit"; CLAUDE.md answers "don't touch the rest".
Where it barely helps (with a failure)
One disappointment: a single constant. Ticket CFG-77: change maxRetryCount from 3→5, path already specified. With Karpathy enabled the Agent still added a round: "Should I also change backoff strategy and unit-test defaults?"—+1 clarification step, diff as clean as without rules, +4 minutes total. On hyper-specified one-line edits the caution in the four principles delivered near-zero benefit.
Other near-no-op cases:
- Team
CLAUDE.mdalready long and overlapping Karpathy—diminishing returns. - No tests / no
xcodebuild—Goal-Driven can't close; Agent still "looks done." - Rules only in chat, not written to the repo—
CLAUDE.mdnever loads.
My final CLAUDE.md (copy-paste)
People searching CLAUDE.md, Claude Code Rules, Claude Code Prompt want a repo-ready artifact. Below: Karpathy four principles + two iOS team rules merged (body stays English for model adherence):
# CLAUDE.md — Karpathy rules + Vuncloud iOS team ## Think Before Coding Don't assume. Don't hide confusion. Surface tradeoffs. Before implementing: state assumptions; if unclear, ask; if a simpler approach exists, say so. ## Simplicity First Minimum code that solves the problem. Nothing speculative. No extra abstractions, config, or error handling for impossible cases. ## Surgical Changes Touch only what you must. Don't "improve" adjacent code or formatting. Match existing style. Mention unrelated dead code; don't delete unless asked. ## Goal-Driven Execution Transform tasks into verifiable goals (tests first when applicable). Multi-step: list plan with verify check per step. ## Vuncloud: Build Verification (custom) After code changes to Swift/ObjC targets, run before claiming done: xcodebuild test -scheme YourApp -destination 'platform=iOS Simulator,name=iPhone 16' Report exit code. If tests fail, fix or stop — do not claim success. ## Vuncloud: Path Allowlist (custom) Unless the user explicitly expands scope, only edit paths they named or standard paired test paths (e.g. Sources/Foo/ ↔ Tests/Foo/).
Upstream install:
/plugin marketplace add forrestchang/andrej-karpathy-skills /plugin install andrej-karpathy-skills@karpathy-skills # Or per-project: curl -fsSL -o /tmp/karpathy-claude.md \ https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md # Do not overwrite existing CLAUDE.md — merge paragraphs
Cursor users: sync .cursor/rules/karpathy-guidelines.mdc to the same semantics as Claude Code. When measuring Karpathy, pin claude-opus-4-8—don't mix with a model upgrade; see the Opus 4.8 long-run guide.
Cloud Mac long-run setup (Claude Code → tmux → Mac mini)
This fits Cloud Mac better than OpenHuman because the chain closes: Claude Code long session → tmux survives disconnect → same-host xcodebuild → dedicated Mac mini. A/B pilots, overnight agents, and cross-timezone CR all fear laptop sleep; I used Cloud Mac for environment stability only—the numbers reproduce locally.
- tmux + Claude Code: SSH drop doesn't kill the session (examples in the Opus 4.8 article).
- Persistent-volume
CLAUDE.md: rules and monorepo travel together; config doesn't vanish. - Goal-Driven hard verification: Build Verification in CLAUDE.md forces Scheme runs before "done."
- Same machine as Mac cloud CI—edit and verify immediately.
Pilot metrics script snippet
TASK_ID="PAY-1842"
git diff --stat main...HEAD > "/tmp/${TASK_ID}-stat.txt"
git diff --name-only main...HEAD | wc -l | awk '{print "files_changed=" $1}'
# Unrelated paths: manual review or compare against allowlist
xcodebuild test -scheme YourApp -destination 'platform=iOS Simulator,name=iPhone 16' \
| tee "/tmp/${TASK_ID}-xcodebuild.log"
echo "exit=$?" >> "/tmp/${TASK_ID}-xcodebuild.log"
When delegating in Claude Code, use a goal sentence that pins the fourth principle:
Goal: Add non-positive amount validation to CheckoutViewModel.validateAmount. Verify: 1) Add unit tests covering 0, negative, and NaN; 2) xcodebuild test -scheme YourApp passes; Scope: Sources/Checkout/ and paired Tests/ only — do not touch UI theme or other modules.
FAQ
Did Karpathy write this? The principles trace to his X observations; the CLAUDE.md is community-maintained (Forrest Chang et al.). Karpathy has RT'd related repos—treat the README as source of truth.
Conflict with Claude Code memory / project instructions? Merge, don't replace. Put the four principles up front; team naming, branches, and test commands after.
Replace code review? No. Noise diffs drop, but merge gates, human CR, and security scans stay mandatory.
Cursor users? Copy into .cursor/rules/. Same git repo as Claude Code—keep semantics aligned.
Can I cite the percentages? Cite your own A/B. This table is a Vuncloud pilot (n=10, not rigorous double-blind).
Conclusion
I measured Karpathy Skills—not vague "does it work." Ten tickets, Opus 4.8 fixed: unrelated diff ~78% down, first-pass CI green +30pp; constant edits barely helped. What's valuable is a CLAUDE.md checked into the repo—Karpathy four principles plus your own build and path rules. For 24/7 Claude Code, Mac mini Cloud Mac + tmux is more natural than OpenHuman: agent, terminal, and Xcode belong on the same macOS.
Rent a Mac mini M4 for Claude Code long runs and A/B pilots
Vuncloud dedicated Mac mini M4 Cloud Mac: persistent CLAUDE.md, tmux, same-host xcodebuild—same environment as Opus 4.8 overnight agents.
Shortcuts: Mac Mini M4 Plans, Help Center, More from the blog.