Six months of essays on agents and HTML; the same patterns recur under different names. This is what's hiding in the cracks between the pieces — connections the individual essays don't quite make, with the receipts.
The library reads, at a glance, like four separate piles: the HTML-as-format thesis, the CLAUDE.md essays, the autonomous-loop essays, and the practitioner case studies. Read in sequence, that clustering holds. Read across, it dissolves. The same small set of ideas keeps reappearing in different costumes, and the writers — at least the ones in this collection — almost never name the move.
What follows is seven of those crossings. Each one is a pattern repeated in three or more essays under different vocabularies. I am calling them crossings, not theses, because they are observations about the corpus, not arguments about the world.
CLAUDE.md, and a public Slack channel are the same primitiveRead Karpathy's autoresearch and Hayduk's Codex goal mode back to back and the parallel is so clean that the codex-goals essay marks it explicitly: PLAN.md is program.md, EXPERIMENTS.md is results.tsv. What that essay doesn't do — and what no piece in the library does — is notice that the same shape is also Mnimiy's CLAUDE.md spec, the project instructions for "Format as html" itself, and Tobi Lütke's account of River-in-Slack. Five very different artifacts, one architecture.
The architecture has four roles. Instruction is what the human writes to specify intent; it changes slowly and is reviewed deliberately. Work is the artifact the agent actually modifies; it changes fast and is reviewed by diff. Ledger is the durable record of what was tried and what happened, kept separate from work so failed attempts aren't lost when the branch resets. Substrate is the part nobody is allowed to edit — the fixed evaluation, the test suite, the codebase the PR opens against, the company's actual production data. It is what reality looks like from inside the loop.
The point is not that the roles are profound. The point is that they are stable across very different problems, and that the systems that work tend to keep all four separated. The ones that fail tend to collapse two or more into a single file.
| System | Instruction | Work | Ledger | Substrate |
|---|---|---|---|---|
| Autoresearch Karpathy | program.md marching orders, ~114 lines, human-edited | train.py ~630 lines, agent edits freely | results.tsv + git branch every attempt logged; commits = ratified moves | prepare.py + eval set immutable; defines what "better" means |
| Codex goal mode Hayduk | PLAN.md human-seeded; agent revises as it learns | EXPERIMENT_NOTES.md live scratchpad as the loop runs | EXPERIMENTS.md curated ledger, "the file that survives" | the measurable goal implicit; "set up a clear, measurable goal" |
| Claude Code Karpathy 4 · Mnimiy 12 | CLAUDE.md behavioral contract; failure modes & rules | the codebase files the agent reads & edits | git history commits + PR descriptions; what shipped | tests + production "tests verify intent" — Mnimiy rule 9 |
| "Format as html" these project instructions | project instructions ~430 words; the behavior contract | the current chat where new artifacts are produced | project knowledge the curated library; the only cross-chat state | the visual register implicit; "match the essays already there" |
| River @ Shopify Lütke | channel asks @-mention prompts in public Slack | pull requests 1,870/week, in the open | channel history searchable; "the Lehrwerkstatt" | the monorepo + prod the company's actual systems |
results.tsv) put a star on it.
Three corollaries fall out of the table that none of the source essays state.
First, the ledger is more durable than the work. A coding agent that ships a PR and discards its scratchpad has thrown away nine-tenths of what it learned. The systems in the corpus that survive long runs are the ones where the ledger is a separate, append-only file — Hayduk literally calls EXPERIMENTS.md "the file that survives long after a session's context window is gone." When CLAUDE.md essays argue you should "write more atomic commits," they are arguing for the same thing in different vocabulary: make git the ledger.
Second, the substrate is the one place agents fail by editing. Mnimiy's rule 16 ("touch only what was asked") and the autoresearch decision to ship prepare.py as read-only are the same move: the world the agent acts in cannot be modifiable by the agent. The pattern fails the moment an agent edits its own evaluation, its own tests, or — in River's case — its own merge criteria.
Third, the instruction surface is small. program.md is ~114 lines. The "Format as html" project instructions are ~430 words. The four Karpathy rules fit on a postcard. Mnimiy's evidence-backed claim is that past 200 lines, CLAUDE.md compliance falls off a cliff. The behavioral contract is small by construction; this is not aesthetic preference but a load-bearing architectural choice.
The human iterates on the prompt; the agent iterates on the training code.— Karpathy · X · 9 March 2026 — the line that names two of the four surfaces but not the other two
Mnimiy's twelve-rule CLAUDE.md reaches its hardest rule at the end: Fail loud. "The most expensive Claude failures are the ones that look like success. A function 'works' but returns wrong data. A migration 'completes' but skipped 30 records." It's framed as a coding-agent concern. But once you've seen it phrased that way, it starts appearing everywhere — and not just in the CLAUDE.md essays.
The pattern below is striking enough to make the case that this is the corpus's single most repeated thesis, even though no piece argues for it as such. The evidence:
results.tsv. The file is append-only and deliberately untracked by git so silence cannot creep in via reset.What makes this striking is not that the rule appears in eight essays. It's that the rule appears under eight different vocabularies, in eight different domains, framed as eight different failure modes — and no essay reaches across to the others to notice the pattern. Coding-agent writers think they're solving a coding-agent problem. RAG-prompt writers think they're solving a hallucination problem. Lütke thinks he's solving an organizational-learning problem. They are all solving the same problem.
The deeper claim under all of this — and this is the part I think is worth holding — is that the visible artifact is the safety mechanism. HTML is loud markdown. CLAUDE.md is loud preferences. A public Slack channel is a loud private DM. results.tsv is a loud git history. The library is, at root, a sustained argument that the way you make AI work reliably is to leave it nowhere to hide.
Karpathy's autoresearch loop is explicit about being pure greedy hill-climbing — no simulated annealing, no basin-hopping, no rollback to explore around a plateau. program.md tells the agent to rewind from kept commits "very very sparingly (if ever)." The bet, in his framing, is that the search surface around a decent baseline has enough small, additive gains that greedy ascent finds useful directions before it stalls.
What you don't notice until you read across is that the same shape is the recommended workflow in essays that have nothing to do with ML research.
EXPERIMENTS.md verdicts · Garry Tan's
prompt-reversal · Lütke's 36% → 77% merge-rate climb
Compare the systems:
val_bpb. Kept commits advance the branch; everything else is reset away.The shared structure is: greedy local search, a strict ledger of attempts, and a single measurable metric that decides each step. The metric varies — validation loss, response quality, merge rate, prompt effectiveness — but the loop is identical. Nothing in the corpus reaches for a planner. Nothing reaches for tree search. Nothing reaches for a critic-actor split. The dominant unit of progress is the smallest possible loop, run as fast as the feedback signal allows.
The bet under this is interesting and not, to my eye, examined anywhere in the corpus: the search surface near a decent baseline is dense in small, additive improvements. If that's true, you don't need exploration mechanisms; you just need a fast loop with an honest ledger. If it's false — if the gains require leaving a local optimum — every system in the library is undertooled. Karpathy's autoresearch essay flags this as Limitation F-02. Nobody else in the corpus addresses it.
Karpathy's Sequoia talk gives the cleanest metaphor in the library: weights as CPU, context window as RAM, the classical computer reduced to a peripheral. It's a useful picture. It's also wrong in the specific way that matters most.
If the context window were really RAM, you could trust what's in it. You can't. The corpus, read across, is candid about this without ever saying it directly:
program.md: don't pipe with tee. "Verbose log output flooding the agent's context window is one of the dominant failure modes of long autonomous runs."The unstated empirical claim under all of this: the effective context window is much smaller than the advertised context window, and degrades non-uniformly as it fills. Rules buried in the middle get lost. Documents in the wrong position bias attention. Logs flood out instructions. Recursive summarization is necessary because a single-pass summary of a long document does not actually have the whole document available.
Two consequences fall out of this that I haven't seen named in the corpus.
First, every long-running agent system in the library aggressively externalizes state to disk. Hayduk says it most directly: "put the thread on disk." But autoresearch does it (TSV + git branch + scratchpad as three substrates), Claude Code does it (CLAUDE.md + memory files), and Lütke's River does it (Slack channel history as durable, searchable memory). The architectural move is the recovery pattern — and the four-surface architecture from §01 is partly an explanation of why: each surface is a different externalization of state that the context window can't reliably hold.
Second, the value of HTML-as-artifact may be a context-window phenomenon as much as a usability phenomenon. Thariq's essay argues HTML wins on spatial layout, color, real diagrams, and round-trip editing. All true. But there's a subtler win the essay doesn't claim: an HTML artifact is a state externalization. The information lives on disk, addressable, inspectable, and doesn't have to be rebuilt from context next time. A markdown chat reply lives in the chat; an HTML artifact lives in the filesystem. The longer the work, the bigger the difference.
CLAUDE.md, and a public Slack channel are the same primitive.This is the crossing I'm least sure about and most enjoy. Three things the corpus treats as different categories — an HTML artifact, a CLAUDE.md file, and Tobi Lütke's insistence that River works only in public — turn out, when you list their properties, to be the same primitive playing different roles.
CLAUDE.md / spec.The same five properties — persistent, addressable, inspectable, compounding, loud — show up in all three. That's not coincidence. Every essay in the corpus that argues for one of these three is, structurally, making the same argument: the artifact that survives the session is the thing that lets the work compound. Thariq doesn't quite say this; he says HTML is a richer canvas. Mnimiy doesn't quite say this; he says CLAUDE.md is a behavioral contract. Lütke doesn't quite say this; he says public is faster than private. They're describing one property of the same primitive from three different windows.
The implication, if you take it seriously: the unit of leverage in this generation of agent work is not the prompt and not the model — it is the durable artifact the agent leaves behind. Garry Tan's "fat skills, fat code, thin harness" gets close. But Tan is talking about the engineering scaffolding around the model. The point goes further: the output is also part of the scaffolding, because every output is a candidate to become an input next time. An HTML artifact is a CLAUDE.md for a one-off interaction. A CLAUDE.md is a public Slack channel for a private project. River-in-Slack is HTML-as-format scaled to an organization.
This one is small and a little odd. The "Format as html" project instructions — the artifact that produced these very pages — are written in exactly the shape Mnimiy describes for CLAUDE.md: named failure modes, contracts that close them, XML tags as section markers, positive framing of rules, a budget under 500 words. The instructions explicitly cite claude-md.html as their template, and explicitly note which rules don't transfer (no codebase, no tests, no commit boundary).
What's worth noticing is the loop. The library contains an essay about CLAUDE.md (Mnimiy). That essay shapes the project's own behavioral contract. The behavioral contract produces new essays in the library. The new essays cite the old ones, including Mnimiy. The library is its own substrate.
This is not unique to "Format as html" — it's the same self-referential turn Lütke describes when he says River gets better at being Shopify because the channel history accumulates Shopify's taste. It's the same turn Garry Tan describes when his book-mirror skill v3 "cites brain pages from my meeting notes" — the system learning from its own past output. Compounding, when you actually look at what's compounding, turns out to be the artifact-stream eating itself in a useful direction.
The interesting question this raises and the corpus does not answer: how do you tell when the loop is compounding usefully versus calcifying? Mnimiy's analogy in claude-md.html is excavators: you learn the machine's quirks and route around them. But a CLAUDE.md that fixes January 2026 failure modes can become a liability against May 2026 failure modes, and the essay itself flags this as the reason the original Karpathy 4 needed eight more rules. A self-referential library is not automatically a self-improving one. The corpus shows the structure; what it doesn't show is the maintenance schedule.
The first six crossings are about what the essays share. This one is about what they don't say — gaps consistent enough across sixteen essays to be a finding in their own right. If you only read the library, here is what you'd never quite encounter.
None of these are individual failings of the essays. Most are reasonable scope decisions for the piece in question. They are interesting only at the corpus level, as a shared blind spot of a tribe.
The library is a tribe — useful and worth knowing. It has converged on a real and coherent set of practices for 2026 agent work, and the convergence is itself the signal: when sixteen writers in different domains arrive at the same primitives under different names, they are tracking something real. But it is a tribe, with the corresponding blind spots. A reading list that included one good dissenting voice would be sturdier.