Reading across the library

Seven
crossings.

Six months of essays on agents and HTML; the same patterns recur under different names. This is what's hiding in the cracks between the pieces: connections the individual essays don't quite make, with the receipts.

Corpus

16 essays · 2026

Reading

~18 minutes

Method

Cross-search, then synthesize

Caveat

See § Colophon

Before the crossings

The library reads, at a glance, like four separate piles: the HTML-as-format thesis, the CLAUDE.md essays, the autonomous-loop essays, and the practitioner case studies. Read in sequence, that clustering holds. Read across, it dissolves. The same small set of ideas keeps reappearing in different costumes, and the writers, at least the ones in this collection, almost never name the move.

What follows is seven of those crossings. Each one is a pattern repeated in three or more essays under different vocabularies. I am calling them crossings, not theses, because they are observations about the corpus, not arguments about the world.

§ 01

The four-surface architecture, named in none of the essays that build it

§ 02

The hidden master rule: engineering against silent failure

§ 03

Greedy ratchets, not search: the dominant unit of progress

§ 04

Context windows are advertised as RAM and behave like fragile working memory

§ 05

HTML, CLAUDE.md, and a public Slack channel are the same primitive

§ 06

The library is a CLAUDE.md for itself

§ 07

What the corpus is conspicuously quiet about

§ 01 · the unnamed pattern

Every long-running agent system in the corpus is a four-surface machine.

Read Karpathy's autoresearch and Hayduk's Codex goal mode back to back and the parallel is so clean that the codex-goals essay marks it explicitly: PLAN.md is program.md, EXPERIMENTS.md is results.tsv. What that essay doesn't do, and what no piece in the library does, is notice that the same shape is also Mnimiy's CLAUDE.md spec, the project instructions for "Format as html" itself, and Tobi Lütke's account of River-in-Slack. Five very different artifacts, one architecture.

The architecture has four roles. Instruction is what the human writes to specify intent; it changes slowly and is reviewed deliberately. Work is the artifact the agent actually modifies; it changes fast and is reviewed by diff. Ledger is the durable record of what was tried and what happened, kept separate from work so failed attempts aren't lost when the branch resets. Substrate is the part nobody is allowed to edit: the fixed evaluation, the test suite, the codebase the PR opens against, the company's actual production data. It is what reality looks like from inside the loop.

The point is not that the roles are profound. The point is that they are stable across very different problems, and that the systems that work tend to keep all four separated. The ones that fail tend to collapse two or more into a single file.

The four surfaces · mapped across the corpus 5 systems · 4 roles

System	Instruction	Work	Ledger	Substrate
Autoresearch Karpathy	program.md marching orders, ~114 lines, human-edited	train.py ~630 lines, agent edits freely	results.tsv + git branch every attempt logged; commits = ratified moves	prepare.py + eval set immutable; defines what "better" means
Codex goal mode Hayduk	PLAN.md human-seeded; agent revises as it learns	EXPERIMENT_NOTES.md live scratchpad as the loop runs	EXPERIMENTS.md curated ledger, "the file that survives"	the measurable goal implicit; "set up a clear, measurable goal"
Claude Code Karpathy 4 · Mnimiy 12	CLAUDE.md behavioral contract; failure modes & rules	the codebase files the agent reads & edits	git history commits + PR descriptions; what shipped	tests + production "tests verify intent": Mnimiy rule 9
"Format as html" these project instructions	project instructions ~430 words; the behavior contract	the current chat where new artifacts are produced	project knowledge the curated library; the only cross-chat state	the visual register implicit; "match the essays already there"
River @ Shopify Lütke	channel asks @-mention prompts in public Slack	pull requests 1,870/week, in the open	channel history searchable; "the Lehrwerkstatt"	the monorepo + prod the company's actual systems

Five systems, four roles, one architecture. The most leveraged single file in each row is the ledger, not the instruction: the essays that flag this explicitly (Hayduk, Karpathy on results.tsv) put a star on it.

Three corollaries fall out of the table that none of the source essays state.

First, the ledger is more durable than the work. A coding agent that ships a PR and discards its scratchpad has thrown away nine-tenths of what it learned. The systems in the corpus that survive long runs are the ones where the ledger is a separate, append-only file, Hayduk literally calls EXPERIMENTS.md "the file that survives long after a session's context window is gone." When CLAUDE.md essays argue you should "write more atomic commits," they are arguing for the same thing in different vocabulary: make git the ledger.

Second, the substrate is the one place agents fail by editing. Mnimiy's rule 16 ("touch only what was asked") and the autoresearch decision to ship prepare.py as read-only are the same move: the world the agent acts in cannot be modifiable by the agent. The pattern fails the moment an agent edits its own evaluation, its own tests, or, in River's case, its own merge criteria.

Third, the instruction surface is small. program.md is ~114 lines. The "Format as html" project instructions are ~430 words. The four Karpathy rules fit on a postcard. Mnimiy's evidence-backed claim is that past 200 lines, CLAUDE.md compliance falls off a cliff. The behavioral contract is small by construction; this is not aesthetic preference but a load-bearing architectural choice.

The human iterates on the prompt; the agent iterates on the training code. Karpathy · X · 9 March 2026: the line that names two of the four surfaces but not the other two

§ 02 · convergence under different names

Almost every essay in the corpus is, at root, about engineering against silent failure.

Mnimiy's twelve-rule CLAUDE.md reaches its hardest rule at the end: Fail loud. "The most expensive Claude failures are the ones that look like success. A function 'works' but returns wrong data. A migration 'completes' but skipped 30 records." It's framed as a coding-agent concern. But once you've seen it phrased that way, it starts appearing everywhere, and not just in the CLAUDE.md essays.

The pattern below is striking enough to make the case that this is the corpus's single most repeated thesis, even though no piece argues for it as such. The evidence:

Convergent evidence across the corpus

Mnimiy
claude-md.html

Rule 12: fail loud. "Completed is wrong if anything was skipped silently. Default to surfacing uncertainty, not hiding it."
Karpathy 4
karpathy-skills.html

Rule 1, on Karpathy's reading: agents "don't manage their confusion, don't seek clarifications, don't surface inconsistencies." The closing contract is "state what you're assuming."
Kopadze 21
21-claude-md-rules.html

Rule 3: never fill knowledge gaps with plausible-sounding information. "I'm not certain about this is always better than presenting a guess as a fact."
Karpathy
autoresearch.html

Every attempt, including crashes: gets a row in results.tsv. The file is append-only and deliberately untracked by git so silence cannot creep in via reset.
Hayduk
codex-goals.html

The ledger "never lies about the past." Every experiment gets a verdict, kept, reverted, promoted. The schema is "forgiving" except on the question of whether a row exists.
Moonshot AI
kimi-prompting-best-practices.html

The grounded-answer prompt's most important word is the fallback: "If the answer is not found in the article, write 'I can't find the answer.'" Explicit fail-loud-on-missing-source.
Lütke
learning-on-the-shop-floor.html

River refuses to respond to DMs. Every interaction is structurally public. Private failure modes are made architecturally impossible.
these instructions
format-as-html-instructions.html

One of the four explicit rules: "Fail loud on missing context. If you would need to invent project conventions to answer, say so and ask, rather than confidently making them up."

What makes this striking is not that the rule appears in eight essays. It's that the rule appears under eight different vocabularies, in eight different domains, framed as eight different failure modes, and no essay reaches across to the others to notice the pattern. Coding-agent writers think they're solving a coding-agent problem. RAG-prompt writers think they're solving a hallucination problem. Lütke thinks he's solving an organizational-learning problem. They are all solving the same problem.

The deeper claim under all of this, and this is the part I think is worth holding, is that the visible artifact is the safety mechanism. HTML is loud markdown. CLAUDE.md is loud preferences. A public Slack channel is a loud private DM. results.tsv is a loud git history. The library is, at root, a sustained argument that the way you make AI work reliably is to leave it nowhere to hide.

§ 03 · the dominant unit of progress

The corpus's preferred workflow is a greedy ratchet
not planning, not search.

Karpathy's autoresearch loop is explicit about being pure greedy hill-climbing, no simulated annealing, no basin-hopping, no rollback to explore around a plateau. program.md tells the agent to rewind from kept commits "very very sparingly (if ever)." The bet, in his framing, is that the search surface around a decent baseline has enough small, additive gains that greedy ascent finds useful directions before it stalls.

What you don't notice until you read across is that the same shape is the recommended workflow in essays that have nothing to do with ML research.

Fig. 03 · The shape that repeats kept commits / discarded forks

The same shape: autoresearch's staircase · Hayduk's EXPERIMENTS.md verdicts · Garry Tan's prompt-reversal · Lütke's 36% → 77% merge-rate climb

Compare the systems:

The ratchet, in five domains

Autoresearch
ML training

Edit, train, grep, decide. Pure greedy ascent on val_bpb. Kept commits advance the branch; everything else is reset away.
Codex /goal
arbitrary goals

"Clear measurable goal + tight feedback loop + markdown files to think in." The ledger records the verdict on every attempt.
Garry Tan
prompt engineering

Prompt-reversal: iterate until the output is right, then ask the model to write the single prompt that would have produced the final turn in one shot. The artifact you keep is the prompt.
Kimi K2.6
code generation

"15 tok/s becomes 193 tok/s. Not in one shot. In 14 loops.: " Adversarial pressure each round; only verified improvements survive.
River @ Shopify
org behaviour

36% → 77% merge rate over two months with no model swap, no retrain. What changed was the prompting pattern, ratcheted in public.

The shared structure is: greedy local search, a strict ledger of attempts, and a single measurable metric that decides each step. The metric varies, validation loss, response quality, merge rate, prompt effectiveness, but the loop is identical. Nothing in the corpus reaches for a planner. Nothing reaches for tree search. Nothing reaches for a critic-actor split. The dominant unit of progress is the smallest possible loop, run as fast as the feedback signal allows.

The bet under this is interesting and not, to my eye, examined anywhere in the corpus: the search surface near a decent baseline is dense in small, additive improvements. If that's true, you don't need exploration mechanisms; you just need a fast loop with an honest ledger. If it's false, if the gains require leaving a local optimum, every system in the library is undertooled. Karpathy's autoresearch essay flags this as Limitation F-02. Nobody else in the corpus addresses it.

§ 04 · the unstated empirical claim

Context windows are advertised as RAM and behave like fragile working memory. All the practical advice is recovery patterns for this fact.

Karpathy's Sequoia talk gives the cleanest metaphor in the library: weights as CPU, context window as RAM, the classical computer reduced to a peripheral. It's a useful picture. It's also wrong in the specific way that matters most.

If the context window were really RAM, you could trust what's in it. You can't. The corpus, read across, is candid about this without ever saying it directly:

Five places the corpus admits context is unreliable

Mnimiy
claude-md.html

"Past 200 lines, compliance drops sharply because important rules get buried in the noise." Compliance is ~80% even on a clean CLAUDE.md.
Karpathy
autoresearch.html

Parenthetical warning, all-caps in program.md: don't pipe with tee. "Verbose log output flooding the agent's context window is one of the dominant failure modes of long autonomous runs."
Hayduk
codex-goals.html

The entire premise of the three-markdown-files architecture: "Even with good compaction, no context window holds a coherent thread that long. The fix is to put the thread on disk."
Anthropic
claude_prompting-best-practices.html

Long-context layout: long docs at the top, query at the end (~30% improvement). XML wrapping. Quote-grounding as a pre-step. None of this would matter if the window were uniform.
Moonshot AI
kimi-prompting-best-practices.html

Recursive summarization is presented as the only sane technique for long documents. The implicit admission: a single pass over the whole window loses the thread.

The unstated empirical claim under all of this: the effective context window is much smaller than the advertised context window, and degrades non-uniformly as it fills. Rules buried in the middle get lost. Documents in the wrong position bias attention. Logs flood out instructions. Recursive summarization is necessary because a single-pass summary of a long document does not actually have the whole document available.

Two consequences fall out of this that I haven't seen named in the corpus.

First, every long-running agent system in the library aggressively externalizes state to disk. Hayduk says it most directly: "put the thread on disk." But autoresearch does it (TSV + git branch + scratchpad as three substrates), Claude Code does it (CLAUDE.md + memory files), and Lütke's River does it (Slack channel history as durable, searchable memory). The architectural move is the recovery pattern, and the four-surface architecture from §01 is partly an explanation of why: each surface is a different externalization of state that the context window can't reliably hold.

Second, the value of HTML-as-artifact may be a context-window phenomenon as much as a usability phenomenon. Thariq's essay argues HTML wins on spatial layout, color, real diagrams, and round-trip editing. All true. But there's a subtler win the essay doesn't claim: an HTML artifact is a state externalization. The information lives on disk, addressable, inspectable, and doesn't have to be rebuilt from context next time. A markdown chat reply lives in the chat; an HTML artifact lives in the filesystem. The longer the work, the bigger the difference.

§ 05 · the costume change

HTML, `CLAUDE.md`, and a public Slack channel are the same primitive.

This is the crossing I'm least sure about and most enjoy. Three things the corpus treats as different categories: an HTML artifact, a CLAUDE.md file, and Tobi Lütke's insistence that River works only in public, turn out, when you list their properties, to be the same primitive playing different roles.

Costume 01

HTML artifact.

Persistent: lives on disk after the chat ends.
Addressable: has a filename, can be linked to.
Inspectable: view source; the structure is the artifact.
Compounds: opens correctly tomorrow without re-running the agent.
Loud: broken diagrams, broken layout, broken logic are visible.

Costume 02

`CLAUDE.md` / spec.

Persistent: committed to the repo; outlives any session.
Addressable: same path every time; the agent finds it on its own.
Inspectable: diff in PR review; every rule on the page.
Compounds: every team's accumulated taste flows in.
Loud: a missing rule is the failure mode that closed it.

Costume 03

Public channel.

Persistent: channel history is durable, searchable.
Addressable: link to message; permalink survives.
Inspectable: anyone in the company can read the thread.
Compounds: "the next person with the same question doesn't have to ask."
Loud: DMs are explicitly refused; failure has nowhere to hide.

The same five properties: persistent, addressable, inspectable, compounding, loud: show up in all three. That's not coincidence. Every essay in the corpus that argues for one of these three is, structurally, making the same argument: the artifact that survives the session is the thing that lets the work compound. Thariq doesn't quite say this; he says HTML is a richer canvas. Mnimiy doesn't quite say this; he says CLAUDE.md is a behavioral contract. Lütke doesn't quite say this; he says public is faster than private. They're describing one property of the same primitive from three different windows.

The implication, if you take it seriously: the unit of leverage in this generation of agent work is not the prompt and not the model, it is the durable artifact the agent leaves behind. Garry Tan's "fat skills, fat code, thin harness" gets close. But Tan is talking about the engineering scaffolding around the model. The point goes further: the output is also part of the scaffolding, because every output is a candidate to become an input next time. An HTML artifact is a CLAUDE.md for a one-off interaction. A CLAUDE.md is a public Slack channel for a private project. River-in-Slack is HTML-as-format scaled to an organization.

§ 06 · the self-referential turn

The library is, structurally, a CLAUDE.md for itself.

This one is small and a little odd. The "Format as html" project instructions: the artifact that produced these very pages, are written in exactly the shape Mnimiy describes for CLAUDE.md: named failure modes, contracts that close them, XML tags as section markers, positive framing of rules, a budget under 500 words. The instructions explicitly cite claude-md.html as their template, and explicitly note which rules don't transfer (no codebase, no tests, no commit boundary).

What's worth noticing is the loop. The library contains an essay about CLAUDE.md (Mnimiy). That essay shapes the project's own behavioral contract. The behavioral contract produces new essays in the library. The new essays cite the old ones, including Mnimiy. The library is its own substrate.

This is not unique to "Format as html", it's the same self-referential turn Lütke describes when he says River gets better at being Shopify because the channel history accumulates Shopify's taste. It's the same turn Garry Tan describes when his book-mirror skill v3 "cites brain pages from my meeting notes": the system learning from its own past output. Compounding, when you actually look at what's compounding, turns out to be the artifact-stream eating itself in a useful direction.

The interesting question this raises and the corpus does not answer: how do you tell when the loop is compounding usefully versus calcifying? Mnimiy's analogy in claude-md.html is excavators: you learn the machine's quirks and route around them. But a CLAUDE.md that fixes January 2026 failure modes can become a liability against May 2026 failure modes, and the essay itself flags this as the reason the original Karpathy 4 needed eight more rules. A self-referential library is not automatically a self-improving one. The corpus shows the structure; what it doesn't show is the maintenance schedule.

§ 07 · the negative space

What the corpus is conspicuously quiet about.

The first six crossings are about what the essays share. This one is about what they don't say, gaps consistent enough across sixteen essays to be a finding in their own right. If you only read the library, here is what you'd never quite encounter.

Negative-space findings

Four things the corpus, taken as a whole, will not tell you.

None of these are individual failings of the essays. Most are reasonable scope decisions for the piece in question. They are interesting only at the corpus level, as a shared blind spot of a tribe.

Cost. Autoresearch reports "~100 experiments overnight." Hayduk reports goal mode running for "hours or even days." Nobody in the library prices an H100-hour, an API call, or a Cursor seat. The library is unpriced; the workflows are not.
Quantitative eval of the fixes. Mnimiy's "41% → 3% mistake rate" is the only sustained numerical claim about a CLAUDE.md change, and it's self-reported across 30 codebases. Karpathy 4 is endorsed by 120,000 GitHub stars, which is popularity, not efficacy. The corpus argues from intuition and from the moment ("the moment Claude introduced React hooks into a class-component codebase"), not from controlled measurement.
Disagreement. Sixteen essays, almost zero argument between them. Mnimiy "extends" Karpathy. Kopadze "ports" the pattern. Hayduk "parallels" autoresearch. The codex-goals essay even has a section called "Sits directly next to autoresearch in this library." There is no contrarian piece, no essay arguing markdown is fine, no piece arguing CLAUDE.md files are a dead end, no piece pushing back on greedy ratchets. The library is a coherent tribe; that's worth knowing about it.
Failure cases of the corpus's own techniques. The CLAUDE.md essays have lots of failure modes for code without CLAUDE.md. They have very few failure modes for CLAUDE.md itself. When does a four-surface architecture break down? When is a public-only Slack bot the wrong call? When does the ratchet stall, and what do you do then? Karpathy's autoresearch essay is the only one that lists its own limitations honestly (F-01 through F-04). Most of the rest treat the technique they're describing as load-bearing.

The library is a tribe, useful and worth knowing. It has converged on a real and coherent set of practices for 2026 agent work, and the convergence is itself the signal: when sixteen writers in different domains arrive at the same primitives under different names, they are tracking something real. But it is a tribe, with the corresponding blind spots. A reading list that included one good dissenting voice would be sturdier.

Seven crossings.