Vuncloud Blog
← Back to Dev Diary

Do You Need a CLAUDE.md for Claude Code? I Tested Karpathy Skills on 10 Real Tickets

Unrelated diff −78%, first-pass CI +30pp — one week measured · Field notes · 2026.06.01 ·~16 min read

Code editor open on a MacBook — a metaphor for A/B testing Karpathy CLAUDE.md behavior rules in Claude Code

Results first. To see whether Karpathy Skills (community-curated CLAUDE.md behavior rules) actually help, I ran 10 real tickets with Claude Opus 4.8 locked—the only variable was whether Karpathy's four principles were loaded. After one week, the numbers that stuck:

Biggest shift: unrelated code changes down ~78%
Median unrelated-path diff share went from 18% to 4%. On the same task set, first-pass xcodebuild test green rose from 5/10 to 8/10 (+30pp). Median human CR time landed around 38 min → 22 min.

There was also a disappointment—a one-line constant change where the rules barely helped and added an extra round of assumption-checking. Success and failure both get documented below, plus the final CLAUDE.md we keep in the repo (what people search for as Claude Code Rules / Claude Code Prompt).

−78%
Unrelated-path diff (median, 10-ticket pilot)
+30pp
First-pass CI green (5/10 → 8/10)
10
Real tickets · claude-opus-4-8 fixed

Why Claude Code edits code it shouldn't

It's less "the model is dumb" and more a default optimization target misaligned with engineers: ship a patch that looks complete fast—add abstractions, touch neighbor files, skip ambiguity. Andrej Karpathy summarized four chronic symptoms on X: coding before asking, over-engineering simple asks, drive-by edits on files that weren't in scope, and replacing verifiable goals with "done."

In spring 2026 the community compressed that into a few dozen lines of CLAUDE.md (forrestchang/andrej-karpathy-skills, aka Karpathy Skills). No new runtime—just four behavioral contracts Claude Code reads every session. What I wanted was I tested it — what's the number.

What Karpathy Skills are

The name says Skills, but the artifact is a project instruction file—not OpenAI GPT Skills, not OpenHuman's SKILL.md plugin pack. Claude Code reads repo-root CLAUDE.md (plus equivalent text from global plugins) at session start and constrains how the agent thinks and cuts diffs before you type the first line.

The four principles map to the upstream CLAUDE.md:

  • Think Before Coding: State assumptions, surface ambiguity, propose simpler approaches—don't silently pick one interpretation and start typing.
  • Simplicity First: Minimum code that solves the problem. Reject unrequested abstractions, config, or "future-proof" layers.
  • Surgical Changes: Touch only task-necessary lines. No drive-by "cleanup" or formatting. Remove dead code you introduced; mention legacy dead code, don't delete unless asked.
  • Goal-Driven Execution: Turn bug fixes and validation into verifiable goals (failing test first when applicable). Multi-step work: list a verify checkpoint per step.

Install two ways: Claude Code plugin /plugin install andrej-karpathy-skills@karpathy-skills, or merge upstream CLAUDE.md at repo root—the final version below is copy-paste ready.

How I ran the test

Variable control (ran on a Mac mini M4 Cloud Mac for tmux long-runs and uninterrupted xcodebuild; numbers reproduce on a local Mac too):

  • 10 real tickets: 4 small Swift features, 2 cross-file renames, 2 test additions, 2 CI script tweaks.
  • Model fixed: claude-opus-4-8, Effort high, no minor bump within the same week.
  • Single variable: control used old CLAUDE.md (no Karpathy). Experiment merged four principles + two team rules below.
  • Each ticket twice (A→B or alternate days) to avoid "good day" bias.

Five metrics aligned with what actually burns review time:

Metric Meaning How captured
Scope creep Share of unrelated files / hunks git diff --stat vs declared task paths
Diff volume Whether +/- lines exceed task necessity Line counts + manual "could be smaller"
First-pass green First xcodebuild test / CI before PR Same-host build logs on Cloud Mac
Revert / redo Revert or full Agent rewrite within 48h of merge Ticket + git log
Pre-implementation clarity Hypotheses / options listed before coding Session export scored 0/1 by hand

Results from 10 tickets

Table shows medians (n=10, not an official benchmark). If you remember one figure: unrelated diff −78%.

Metric No Karpathy (median) With Karpathy (median) Relative change (pilot)
Unrelated-path diff share 18% 4% ~−78%
Task-related diff lines (+/− total) 412 lines 286 lines ~−31% (less over-implementation)
First-pass CI green 5/10 8/10 +30pp
Revert / full rewrite within 48h 3/10 1/10 −67% (small n, order-of-magnitude only)
Pre-implementation clarity (manual 0/1) 2/10 7/10 Think Before Coding stands out

Unrelated diff 18% → 4%: the Agent still touched many files, but drive-by "format the Payment module" hunks vanished from the diff—Surgical Changes on that curve. First-pass CI green 5/10 → 8/10 mostly tracks Goal-Driven Execution: failing tests added before implementation more often.

Team discussing requirements and acceptance criteria on a whiteboard — Goal-Driven Execution and verifiable Claude Code prompts

Where the lift is largest

  • Cross-file rename / signature change: unrelated diff −78% felt largest in review—CR stops feeling like a minefield.
  • Small feature with slightly fuzzy requirements: Think Before Coding forced hypothesis lists first; rework dropped from 3 rounds to 1 (tickets PAY-1842, AUTH-901).
  • Test addition + implementation: first-pass green rate up; writing red→green into CLAUDE.md stabilized execution.

Orthogonal to a code knowledge graph: the graph answers "where to edit"; CLAUDE.md answers "don't touch the rest".

Where it barely helps (with a failure)

One disappointment: a single constant. Ticket CFG-77: change maxRetryCount from 3→5, path already specified. With Karpathy enabled the Agent still added a round: "Should I also change backoff strategy and unit-test defaults?"—+1 clarification step, diff as clean as without rules, +4 minutes total. On hyper-specified one-line edits the caution in the four principles delivered near-zero benefit.

Other near-no-op cases:

  • Team CLAUDE.md already long and overlapping Karpathy—diminishing returns.
  • No tests / no xcodebuild—Goal-Driven can't close; Agent still "looks done."
  • Rules only in chat, not written to the repo—CLAUDE.md never loads.

My final CLAUDE.md (copy-paste)

People searching CLAUDE.md, Claude Code Rules, Claude Code Prompt want a repo-ready artifact. Below: Karpathy four principles + two iOS team rules merged (body stays English for model adherence):

Repo root · CLAUDE.md
# CLAUDE.md — Karpathy rules + Vuncloud iOS team

## Think Before Coding
Don't assume. Don't hide confusion. Surface tradeoffs.
Before implementing: state assumptions; if unclear, ask; if a simpler approach exists, say so.

## Simplicity First
Minimum code that solves the problem. Nothing speculative.
No extra abstractions, config, or error handling for impossible cases.

## Surgical Changes
Touch only what you must. Don't "improve" adjacent code or formatting.
Match existing style. Mention unrelated dead code; don't delete unless asked.

## Goal-Driven Execution
Transform tasks into verifiable goals (tests first when applicable).
Multi-step: list plan with verify check per step.

## Vuncloud: Build Verification (custom)
After code changes to Swift/ObjC targets, run before claiming done:
  xcodebuild test -scheme YourApp -destination 'platform=iOS Simulator,name=iPhone 16'
Report exit code. If tests fail, fix or stop — do not claim success.

## Vuncloud: Path Allowlist (custom)
Unless the user explicitly expands scope, only edit paths they named
or standard paired test paths (e.g. Sources/Foo/ ↔ Tests/Foo/).

Upstream install:

Claude Code · plugin or curl
/plugin marketplace add forrestchang/andrej-karpathy-skills
/plugin install andrej-karpathy-skills@karpathy-skills

# Or per-project:
curl -fsSL -o /tmp/karpathy-claude.md \
  https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md
# Do not overwrite existing CLAUDE.md — merge paragraphs

Cursor users: sync .cursor/rules/karpathy-guidelines.mdc to the same semantics as Claude Code. When measuring Karpathy, pin claude-opus-4-8—don't mix with a model upgrade; see the Opus 4.8 long-run guide.

Cloud Mac long-run setup (Claude Code → tmux → Mac mini)

This fits Cloud Mac better than OpenHuman because the chain closes: Claude Code long session → tmux survives disconnect → same-host xcodebuild → dedicated Mac mini. A/B pilots, overnight agents, and cross-timezone CR all fear laptop sleep; I used Cloud Mac for environment stability only—the numbers reproduce locally.

  • tmux + Claude Code: SSH drop doesn't kill the session (examples in the Opus 4.8 article).
  • Persistent-volume CLAUDE.md: rules and monorepo travel together; config doesn't vanish.
  • Goal-Driven hard verification: Build Verification in CLAUDE.md forces Scheme runs before "done."
  • Same machine as Mac cloud CI—edit and verify immediately.

Pilot metrics script snippet

Capture metrics after each task
TASK_ID="PAY-1842"
git diff --stat main...HEAD > "/tmp/${TASK_ID}-stat.txt"
git diff --name-only main...HEAD | wc -l | awk '{print "files_changed=" $1}'
# Unrelated paths: manual review or compare against allowlist
xcodebuild test -scheme YourApp -destination 'platform=iOS Simulator,name=iPhone 16' \
  | tee "/tmp/${TASK_ID}-xcodebuild.log"
echo "exit=$?" >> "/tmp/${TASK_ID}-xcodebuild.log"

When delegating in Claude Code, use a goal sentence that pins the fourth principle:

Claude Code · prompt example
Goal: Add non-positive amount validation to CheckoutViewModel.validateAmount.
Verify: 1) Add unit tests covering 0, negative, and NaN; 2) xcodebuild test -scheme YourApp passes;
Scope: Sources/Checkout/ and paired Tests/ only — do not touch UI theme or other modules.

FAQ

Did Karpathy write this? The principles trace to his X observations; the CLAUDE.md is community-maintained (Forrest Chang et al.). Karpathy has RT'd related repos—treat the README as source of truth.

Conflict with Claude Code memory / project instructions? Merge, don't replace. Put the four principles up front; team naming, branches, and test commands after.

Replace code review? No. Noise diffs drop, but merge gates, human CR, and security scans stay mandatory.

Cursor users? Copy into .cursor/rules/. Same git repo as Claude Code—keep semantics aligned.

Can I cite the percentages? Cite your own A/B. This table is a Vuncloud pilot (n=10, not rigorous double-blind).

Conclusion

I measured Karpathy Skills—not vague "does it work." Ten tickets, Opus 4.8 fixed: unrelated diff ~78% down, first-pass CI green +30pp; constant edits barely helped. What's valuable is a CLAUDE.md checked into the repo—Karpathy four principles plus your own build and path rules. For 24/7 Claude Code, Mac mini Cloud Mac + tmux is more natural than OpenHuman: agent, terminal, and Xcode belong on the same macOS.

Rent a Mac mini M4 for Claude Code long runs and A/B pilots

Vuncloud dedicated Mac mini M4 Cloud Mac: persistent CLAUDE.md, tmux, same-host xcodebuild—same environment as Opus 4.8 overnight agents.

Shortcuts: Mac Mini M4 Plans, Help Center, More from the blog.

AI developers

−78% unrelated diff starts with CLAUDE.md

Karpathy Skills · Claude Code Rules · Cloud Mac

Back to home
Limited offer View plans