What are Karpathy Skills?

A community artifact that compresses Andrej Karpathy's observations about common LLM coding failures into four principles written into a repo-root CLAUDE.md (forrestchang/andrej-karpathy-skills and similar). Claude Code reads it at session start and treats it as a persistent project-level system prompt.

How much does unrelated diff drop?

In our 10-ticket Swift monorepo pilot, median unrelated-path diff share fell from 18% to 4% (roughly 78% relative reduction). First-pass CI green went from 5/10 to 8/10 (+30pp). Single-line constant edits saw almost no improvement—see the CFG-77 failure in the article.

How should I write CLAUDE.md?

Merge the four Karpathy principles (Think Before Coding, Simplicity First, Surgical Changes, Goal-Driven Execution) with team-specific rules such as xcodebuild verification commands and path allowlists. The article includes a copy-paste final version.

How do I install it in Claude Code?

Globally: /plugin marketplace add forrestchang/andrej-karpathy-skills then /plugin install andrej-karpathy-skills@karpathy-skills. Per project: curl the upstream CLAUDE.md and merge paragraphs into your existing file—do not overwrite.

Karpathy Skills vs Opus 4.8 upgrade—which matters more?

The model raises the ceiling; CLAUDE.md sets the floor. They are orthogonal: fix the model version for A/B and toggle only Karpathy on/off so model gains are not mistaken for Skills gains.

Why run this on a Cloud Mac?

Long sessions, tmux, xcodebuild verification, and a persistent CLAUDE.md need stable macOS. Laptop sleep kills agents. A dedicated Mac mini M4 can run Claude Code and iOS CI on the same host 24/7.

Do You Need a CLAUDE.md for Claude Code? Karpathy Skills on 10 Real Tickets (2026)

Results first. To see whether Karpathy Skills (community-curated CLAUDE.md behavior rules) actually help, I ran 10 real tickets with Claude Opus 4.8 locked—the only variable was whether Karpathy's four principles were loaded. After one week, the numbers that stuck:

Biggest shift: unrelated code changes down ~78%

Median unrelated-path diff share went from 18% to 4%. On the same task set, first-pass xcodebuild test green rose from 5/10 to 8/10 (+30pp). Median human CR time landed around 38 min → 22 min.

There was also a disappointment—a one-line constant change where the rules barely helped and added an extra round of assumption-checking. Success and failure both get documented below, plus the final CLAUDE.md we keep in the repo (what people search for as Claude Code Rules / Claude Code Prompt).

−78%

Unrelated-path diff (median, 10-ticket pilot)

+30pp

First-pass CI green (5/10 → 8/10)

Real tickets · claude-opus-4-8 fixed

Why Claude Code edits code it shouldn't

It's less "the model is dumb" and more a default optimization target misaligned with engineers: ship a patch that looks complete fast—add abstractions, touch neighbor files, skip ambiguity. Andrej Karpathy summarized four chronic symptoms on X: coding before asking, over-engineering simple asks, drive-by edits on files that weren't in scope, and replacing verifiable goals with "done."

In spring 2026 the community compressed that into a few dozen lines of CLAUDE.md (forrestchang/andrej-karpathy-skills, aka Karpathy Skills). No new runtime—just four behavioral contracts Claude Code reads every session. What I wanted was I tested it — what's the number.

What Karpathy Skills are

The name says Skills, but the artifact is a project instruction file—not OpenAI GPT Skills, not OpenHuman's SKILL.md plugin pack. Claude Code reads repo-root CLAUDE.md (plus equivalent text from global plugins) at session start and constrains how the agent thinks and cuts diffs before you type the first line.

The four principles map to the upstream CLAUDE.md:

Think Before Coding: State assumptions, surface ambiguity, propose simpler approaches—don't silently pick one interpretation and start typing.
Simplicity First: Minimum code that solves the problem. Reject unrequested abstractions, config, or "future-proof" layers.
Surgical Changes: Touch only task-necessary lines. No drive-by "cleanup" or formatting. Remove dead code you introduced; mention legacy dead code, don't delete unless asked.
Goal-Driven Execution: Turn bug fixes and validation into verifiable goals (failing test first when applicable). Multi-step work: list a verify checkpoint per step.

Install two ways: Claude Code plugin /plugin install andrej-karpathy-skills@karpathy-skills, or merge upstream CLAUDE.md at repo root—the final version below is copy-paste ready.

How I ran the test

Variable control (ran on a Mac mini M4 Cloud Mac for tmux long-runs and uninterrupted xcodebuild; numbers reproduce on a local Mac too):

10 real tickets: 4 small Swift features, 2 cross-file renames, 2 test additions, 2 CI script tweaks.
Model fixed: claude-opus-4-8, Effort high, no minor bump within the same week.
Single variable: control used old CLAUDE.md (no Karpathy). Experiment merged four principles + two team rules below.
Each ticket twice (A→B or alternate days) to avoid "good day" bias.

Five metrics aligned with what actually burns review time:

Metric	Meaning	How captured
Scope creep	Share of unrelated files / hunks	`git diff --stat` vs declared task paths
Diff volume	Whether +/- lines exceed task necessity	Line counts + manual "could be smaller"
First-pass green	First `xcodebuild test` / CI before PR	Same-host build logs on Cloud Mac
Revert / redo	Revert or full Agent rewrite within 48h of merge	Ticket + git log
Pre-implementation clarity	Hypotheses / options listed before coding	Session export scored 0/1 by hand

Results from 10 tickets

Table shows medians (n=10, not an official benchmark). If you remember one figure: unrelated diff −78%.

Metric	No Karpathy (median)	With Karpathy (median)	Relative change (pilot)
Unrelated-path diff share	18%	4%	~−78%
Task-related diff lines (+/− total)	412 lines	286 lines	~−31% (less over-implementation)
First-pass CI green	5/10	8/10	+30pp
Revert / full rewrite within 48h	3/10	1/10	−67% (small n, order-of-magnitude only)
Pre-implementation clarity (manual 0/1)	2/10	7/10	Think Before Coding stands out

Unrelated diff 18% → 4%: the Agent still touched many files, but drive-by "format the Payment module" hunks vanished from the diff—Surgical Changes on that curve. First-pass CI green 5/10 → 8/10 mostly tracks Goal-Driven Execution: failing tests added before implementation more often.

Team discussing requirements and acceptance criteria on a whiteboard — Goal-Driven Execution and verifiable Claude Code prompts

Where the lift is largest

Cross-file rename / signature change: unrelated diff −78% felt largest in review—CR stops feeling like a minefield.
Small feature with slightly fuzzy requirements: Think Before Coding forced hypothesis lists first; rework dropped from 3 rounds to 1 (tickets PAY-1842, AUTH-901).
Test addition + implementation: first-pass green rate up; writing red→green into CLAUDE.md stabilized execution.

Orthogonal to a code knowledge graph: the graph answers "where to edit"; CLAUDE.md answers "don't touch the rest".

Where it barely helps (with a failure)

One disappointment: a single constant. Ticket CFG-77: change maxRetryCount from 3→5, path already specified. With Karpathy enabled the Agent still added a round: "Should I also change backoff strategy and unit-test defaults?"—+1 clarification step, diff as clean as without rules, +4 minutes total. On hyper-specified one-line edits the caution in the four principles delivered near-zero benefit.

Other near-no-op cases:

Team CLAUDE.md already long and overlapping Karpathy—diminishing returns.
No tests / no xcodebuild—Goal-Driven can't close; Agent still "looks done."
Rules only in chat, not written to the repo—CLAUDE.md never loads.

My final CLAUDE.md (copy-paste)

People searching CLAUDE.md, Claude Code Rules, Claude Code Prompt want a repo-ready artifact. Below: Karpathy four principles + two iOS team rules merged (body stays English for model adherence):

Repo root · CLAUDE.md

# CLAUDE.md — Karpathy rules + Vuncloud iOS team

## Think Before Coding
Don't assume. Don't hide confusion. Surface tradeoffs.
Before implementing: state assumptions; if unclear, ask; if a simpler approach exists, say so.

## Simplicity First
Minimum code that solves the problem. Nothing speculative.
No extra abstractions, config, or error handling for impossible cases.

## Surgical Changes
Touch only what you must. Don't "improve" adjacent code or formatting.
Match existing style. Mention unrelated dead code; don't delete unless asked.

## Goal-Driven Execution
Transform tasks into verifiable goals (tests first when applicable).
Multi-step: list plan with verify check per step.

## Vuncloud: Build Verification (custom)
After code changes to Swift/ObjC targets, run before claiming done:
  xcodebuild test -scheme YourApp -destination 'platform=iOS Simulator,name=iPhone 16'
Report exit code. If tests fail, fix or stop — do not claim success.

## Vuncloud: Path Allowlist (custom)
Unless the user explicitly expands scope, only edit paths they named
or standard paired test paths (e.g. Sources/Foo/ ↔ Tests/Foo/).

Upstream install:

Claude Code · plugin or curl

/plugin marketplace add forrestchang/andrej-karpathy-skills
/plugin install andrej-karpathy-skills@karpathy-skills

# Or per-project:
curl -fsSL -o /tmp/karpathy-claude.md \
  https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md
# Do not overwrite existing CLAUDE.md — merge paragraphs

Cursor users: sync .cursor/rules/karpathy-guidelines.mdc to the same semantics as Claude Code. When measuring Karpathy, pin claude-opus-4-8—don't mix with a model upgrade; see the Opus 4.8 long-run guide.

Cloud Mac long-run setup (Claude Code → tmux → Mac mini)

This fits Cloud Mac better than OpenHuman because the chain closes: Claude Code long session → tmux survives disconnect → same-host xcodebuild → dedicated Mac mini. A/B pilots, overnight agents, and cross-timezone CR all fear laptop sleep; I used Cloud Mac for environment stability only—the numbers reproduce locally.

tmux + Claude Code: SSH drop doesn't kill the session (examples in the Opus 4.8 article).
Persistent-volume CLAUDE.md: rules and monorepo travel together; config doesn't vanish.
Goal-Driven hard verification: Build Verification in CLAUDE.md forces Scheme runs before "done."
Same machine as Mac cloud CI—edit and verify immediately.

Pilot metrics script snippet

Capture metrics after each task

TASK_ID="PAY-1842"
git diff --stat main...HEAD > "/tmp/${TASK_ID}-stat.txt"
git diff --name-only main...HEAD | wc -l | awk '{print "files_changed=" $1}'
# Unrelated paths: manual review or compare against allowlist
xcodebuild test -scheme YourApp -destination 'platform=iOS Simulator,name=iPhone 16' \
  | tee "/tmp/${TASK_ID}-xcodebuild.log"
echo "exit=$?" >> "/tmp/${TASK_ID}-xcodebuild.log"

When delegating in Claude Code, use a goal sentence that pins the fourth principle:

Claude Code · prompt example

Goal: Add non-positive amount validation to CheckoutViewModel.validateAmount.
Verify: 1) Add unit tests covering 0, negative, and NaN; 2) xcodebuild test -scheme YourApp passes;
Scope: Sources/Checkout/ and paired Tests/ only — do not touch UI theme or other modules.

FAQ

Did Karpathy write this? The principles trace to his X observations; the CLAUDE.md is community-maintained (Forrest Chang et al.). Karpathy has RT'd related repos—treat the README as source of truth.

Conflict with Claude Code memory / project instructions? Merge, don't replace. Put the four principles up front; team naming, branches, and test commands after.

Replace code review? No. Noise diffs drop, but merge gates, human CR, and security scans stay mandatory.

Cursor users? Copy into .cursor/rules/. Same git repo as Claude Code—keep semantics aligned.

Can I cite the percentages? Cite your own A/B. This table is a Vuncloud pilot (n=10, not rigorous double-blind).

Conclusion

I measured Karpathy Skills—not vague "does it work." Ten tickets, Opus 4.8 fixed: unrelated diff ~78% down, first-pass CI green +30pp; constant edits barely helped. What's valuable is a CLAUDE.md checked into the repo—Karpathy four principles plus your own build and path rules. For 24/7 Claude Code, Mac mini Cloud Mac + tmux is more natural than OpenHuman: agent, terminal, and Xcode belong on the same macOS.