Berman & Karpathy · Sequoia AI Summit

Why AI is so
smart & so
dumb.

Andrej Karpathy at Sequoia's annual AI summit, on software 3.0, the December inflection, the jaggedness of intelligence, and what founders should actually build now, with running commentary and reaction by Matthew Berman.

01 to 02

Software 3.0

03 to 04

The bitter lesson

Jaggedness & verifiability

06 to 07

Build advice

08 to 10

Floor & ceiling, ghosts, the coda

In conversation

Andrej Karpathy

co-founder, OpenAI · ex-Tesla AI
coined "vibe coding"

Reaction by

Matthew Berman

youtube.com/@matthew_berman

Bottom line up front

Eight ideas worth keeping.

If you read nothing else, read this. The argument of the talk in eight points, each one anchored to a section of this issue.

December was a real inflection.

If you tried agentic coding before the latest models and weren't impressed, your data is stale. Try again.

Software 3.0 is here.

Prompts are the new programming language. The LLM is the interpreter. The context window is your lever.

The bitter lesson holds.

Don't stop at "use AI for one piece." Replace pipelines of traditional code with end-to-end neural networks where you can.

Models are jagged.

They peak where verification is cheap (code, math) and lag where it's hard (taste, common sense). This is not a bug; it's training.

Don't fight the labs head-on.

The labs will own verifiable domains. Founders should hunt for valuable RL environments the labs aren't focused on yet.

Floor versus ceiling.

Vibe coding raises the floor for everyone. Agentic engineering raises the ceiling for professionals. Different disciplines.

AI is a ghost, not an animal.

No curiosity, no fun, no intrinsic motivation: just data and reward. Use it accordingly.

Thinking is now cheap. Understanding isn't.

You can outsource your thinking. You cannot outsource your understanding. That gap is the new job.

If you have 5 minutes → Read points 02, 04, and 08. If you're a founder → Sections IV, V, IX. If you're an engineer → Sections II, VI, VII.

§ Contents

The talk, mapped.

I

The December inflection.
Why one of the world's best programmers said he had never felt more behind.

04
II

Software 3.0.
If 1.0 was code and 2.0 was weights, prompts are the new program. With Diagram 1.

05
III

LLMs as a new computer.
Karpathy's mental model, drawn out: weights as CPU, context as RAM. With Diagram 2.

07
IV

The bitter lesson.
End-to-end neural networks eat the stack. Tesla Autopilot was the canary.

09
V

Jagged.
Why a model finds zero-days but tells you to walk to the car wash. With Diagrams 3 & 4.

11
VI

What founders should build.
Karpathy's veiled counsel: route around the labs in non-obvious RL environments.

13
VII

Vibe coding vs. agentic engineering.
Two disciplines, often confused. With Diagram 5.

15
VIII

Animals, ghosts.
Why you should stop reasoning about AI as if it had a will. With Diagram 6.

17
IX

Rebuild the internet.
Everything is still written for humans. The agent-native rebuild has begun.

19
X

A coda on understanding.
The line Karpathy says he can't stop thinking about. The new job, in one sentence.

21

▌ Editor's pick: III · V · X ~ 16 min read

A few months ago, Karpathy wrote that he had never felt more behind as a programmer. Coming from him, that's startling. The interviewer asks him to unpack it, was the feeling exhilarating, or unsettling?

Andrej Karpathy

A mixture of both, for sure. Like many of you, I've been using agentic tools for a while. It was very good at chunks of code, and sometimes it would mess up and you have to edit them. It was kind of helpful.

And then I would say December was this clear point where, for me, I was on a break, so I had a bit more time. I just started to notice that with the latest models, the chunks just came out fine. And then I kept asking for more. And it just came out fine. And then I can't remember the last time I corrected it. And then I trusted the system more and more. And then I was vibe coding.

What he's describing is what anybody on the frontier of agentic coding felt around December of last year. Something changed: the models plus the harnesses became incredibly good. You no longer just got snippets of code that you could copy-paste and stitch together. The system could do all of it, end to end. It built entire applications for you.

If you tried AI coding a year ago and weren't impressed, your data is stale. The rate of progress is insane, and that's what Karpathy is describing. Something really did change in December.

▌ Reaction · Berman, on retrying frontier tools

If you tried agentic coding a year ago and weren't impressed, your data is stale. Try again.

Side reading

Friend of Berman's channel Matt Schumer published an essay titled Something Big Is Happening, arguing that the last few months' capability gains exceeded what he thought possible in such a short window, and that the fundamentals of work and the economy are being rewritten in real time.

Karpathy has been arguing for years that LLMs aren't just better software, they're a whole new computing paradigm. The interviewer asks: what does a team build differently the day they actually believe this?

Andrej Karpathy

Software 1.0, I'm writing code. Software 2.0, I'm programming by creating data sets and training neural networks. The programming is kind of like arranging data sets and maybe some objectives and architectures.

And then what happened is, basically, if you train a GPT model on a sufficiently large set of tasks, implicitly, by training on the internet, it becomes kind of like a programmable computer. So software 3.0 is about your programming turning into prompting. What's in the context window is your lever over the interpreter that is the LLM.

He's saying: the data you choose to put into your model, as a lab, is your way of programming the model. Once the program exists, the way you steer it: the way you operate it, is through the prompt. Through the context window. That's software 3.0.

Diagram 1: The progression of programming, after Karpathy

Version 1.0

Code.

Explicit rules, written by a human.

if obstacle.is_stop_sign():
car.brake()
elif light == "red":
car.stop()

Substrate: a deterministic CPU
To program: write instructions
Limit: every case must be coded

Version 2.0

Weights.

A function learned from data, end-to-end.

Substrate: a learned function
To program: curate data & train
Limit: narrow, single-task models

Version 3.0

Prompts.

Natural language as the program itself.

"You are a driving
  assistant. Look at
  the camera feed.
  Decide. Explain."

Substrate: an LLM as interpreter
To program: write the context window
Limit: jagged: verifiability rules

Programming abstraction rises →

Specification gives way to verification

← Human in the loop falls

The OpenClaw example

OpenClaw's actual installation is a copy-paste of natural language that you give to your agent. It's a little skill: copy-paste this, give it to your agent, and the agent installs OpenClaw. No 600-line bash script.

This is the concrete shift. If you're building applications now and you're thinking about writing out specific instructions for your agent, you're thinking about it wrong. What you need to do is explain the outcome. The agent will use its weights to figure out how to get there.

Concept · Agent-native install

Outcome, not instructions.

An agent-native install is a paragraph of natural language describing the desired end state. The agent reads its environment, handles the divergent paths between platforms, debugs in the loop, and produces the outcome. You stop writing 600-line bash and start writing one paragraph of intent.

III

Three years ago, Karpathy posted on X a sketch of a new computing architecture. Audio and video still come in. Peripherals, keyboard, mouse, still attach. There's still some appendage to the classical computer: file systems, a browser. But everything else collapses into one box: the LLM.

The picture is worth memorising because it changes how you reason about every layer above it. If the context window is RAM, then context engineering is memory management. If the weights are the CPU, then swapping models is swapping silicon. If file systems and browsers are side connectors, then your agent's tools are its bus.

Diagram 2: The LLM as a new computer, after Karpathy's 2023 sketch

Inputs · Sensors & peripherals

🎙

Audio

Microphone

📷

Video

Camera

⌨

Keyboard

Text input

🖱

Mouse

Pointer

↓ ↓ ↓ ↓

▌ The compute core · an LLM

Processor

The weights.

The trained parameters of the model. Frozen at inference. Equivalent to the CPU of a classical computer, they do all the processing.

Memory

The context window.

Short-term, volatile, the only state the model actively reasons over. Equivalent to RAM. Your prompt is what you put in it; everything else is paged in.

Side connectors · The classical computer, demoted

File system

Deterministic state

Browser

Access to the web

Tools / APIs

Action surface

Output is "computation in the digital information space", Karpathy's phrase. The classical operating system has dissolved.

Karpathy describes shipping an app that lets you upload a photo of a menu and re-renders it with images of each dish. Standard software 1.0 plumbing: OCR the titles, generate images for each item, composite them back. He shipped it. Then he saw the software 3.0 version of the same idea, and it broke him.

Andrej Karpathy

I coded this app that lets you upload a photo and it does all this stuff, runs on Vercel, re-renders the menu, gives you all the items, gives you a picture. It uses an image generator to OCR all the different titles, then uses the image generator to get pictures of them and shows it to you.

And then I saw the software 3.0 version, which blew my mind. Take your photo, give it to Gemini, say: use Nanobanana to overlay the things onto the menu. And Nanobanana basically returned an image that is exactly the picture of the menu I took, but it actually rendered the different things into the pixels. Actually, all of my MenuGen is spurious. It's working in the old paradigm. That app shouldn't exist.

This is what Karpathy calls the outward creep of end-to-end neural network capabilities. As you build, instead of thinking "I'll use an LLM for this one piece," you think differently: the entire thing, end to end, can just be the model. You give it the instructions. It does something. It returns the answer.

Concept · The bitter lesson

Never bet against the scale.

Coined by Rich Sutton, the bitter lesson is the observation that methods leveraging more compute and more data have, over decades, consistently outperformed methods that bake in human heuristics. The lesson is "bitter" because researchers keep refusing to learn it, preferring elegant, hand-crafted rules over scale.

The Tesla Autopilot moment

For years, Tesla Autopilot was a hybrid: neural net plus human-written rules. If you see a red sign that says STOP, it's a stop sign, stop the car. The trouble with that approach is you have to define every rule. With an end-to-end neural network, you don't define anything manually. You let the net learn from data.

One day an engineer came to Musk and said: I think we should switch to a fully end-to-end neural network. They scrapped what they had. They did the transition. It paid off. Shortly after, Autopilot worked far better than it ever had, and was less complicated to maintain. Karpathy worked for Elon for years. He saw it happen.

▌ Reaction · The operational version of the bitter lesson

Don't stop at "use AI for one piece." Let it eat the whole pipeline where you can.

Karpathy spent time writing about verifiability. His core claim is short and powerful: traditional computers automate what you can specify; LLMs automate what you can verify.

Concept · Verifiability

You don't tell it what to do, you grade it.

Specification means giving step-by-step instructions a deterministic machine can execute. Verification means producing many candidate artifacts and checking which ones are correct. Domains where checking is cheap (math, code, formal proofs) train well via reinforcement learning. Domains where checking is hard (taste, ethics, common sense) don't, yet.

Andrej Karpathy

When frontier labs train these LLMs, they are giant reinforcement learning environments. They're given verification rewards. The models end up creating these jagged entities that really peak in capability in verifiable domains, like math and code, and stagnate, are a little rough around the edges, when things are not in that space.

Diagram 3: The verifiability spectrum, what the labs can RL into

Math

2 + 2 = 4

Code

does it run?

Facts

date of WWII

Reasoning

word problems

Common sense

walk or drive?

Taste

good design?

Ethics

should you?

← Easy to verify Hard to verify →

▌ Where labs win

The cobalt end of this spectrum gets the most RL compute. Code and math are where the labs can pull a lever, scale data centers, and reliably get better.

▌ Where founders should look

Karpathy hints, without naming: that valuable RL environments exist toward the right that the labs aren't focused on. That's the gap to hunt.

The car-wash problem

The classic example for a while was: how many R's are in strawberry? Models famously got it wrong: perfect jaggedness. The labs patched that. The new example: I want to go to a car wash to wash my car and it's 50 meters away. Should I drive, or should I walk?

State-of-the-art models will tell you to walk because it's so close. Of course they will, but the whole point of going to the car wash is to wash the car. How is it possible that a model finds zero-day vulnerabilities in the morning and tells you to walk to the car wash in the afternoon?

Diagram 4: The capability profile of a frontier model: an illustration

Refactor a 100k LOC codebase

Find a zero-day vulnerability

Olympiad-level math

Count letters in a word

Recall recent news

Common-sense reasoning (car wash)

Aesthetic judgment

Reasoning under deep uncertainty

Illustrative profile, not benchmarked. The shape is the point: peaks in the verifiable domains the labs RL hardest, with significant drops where verification is harder. Jaggedness is the visual signature of "we trained for what we could measure."

This is the argument that we are not at AGI. If we were, the skills the models have in code would generalise beyond code. The fact that there is such jaggedness is itself proof we don't have generalised intelligence yet, or, more humbly, that we don't yet know how to draw it out.

Sequoia: Interviewer

If you're a founder today, trying to solve a problem you think is tractable, in a domain that's verifiable, you look around and think, "the labs have really gotten to escape velocity in math, coding, and others." What would your advice be?

Andrej Karpathy

Verifiability makes something tractable in the current paradigm because you can throw a huge amount of RL at it. So one way to see it is: that remains true even if the labs are not focusing on it directly. If you are in a verifiable setting where you could create these RL environments or examples, that actually sets you up to potentially do your own fine-tuning.

That is fundamentally technology that just works. You can pull a lever, if you have a huge amount of diverse data sets of RL environments, you can use your favorite fine-tuning framework, pull the lever, and get something that actually works pretty well.

Translation: in verifiable domains, don't try to compete with the labs head-on. Those are domains the labs will own. Even if they're not focused there directly, they can move into them the moment it matters. But what about non-verifiable domains?

Andrej Karpathy

I don't know what the examples of this might be. But I do think there are some very valuable reinforcement learning environments that people could think of that are not part of the, yeah, I don't want to give away the answer. But there is one domain that I think is very, sorry, I don't mean to vague-post on the stage.

Editorial reading

Karpathy explicitly declines to name the domain he has in mind. Take that as the most actionable signal in the entire talk.

Andrej Karpathy

I do think that ultimately almost everything can be made verifiable to some extent. Some things are easier than others.

That is a wild claim. Berman pushes back: think about art, music, human taste. How can taste be verifiable? You could put humans in the loop, but taste shifts over time and we don't fully understand how or why it shifts. So how can it be verifiable?

Almost everything can be made verifiable to some extent.

▌ Karpathy · The quietest, most consequential line in the talk

This is one of the smartest people in AI saying everything is verifiable, and there are still domains where the path to verifiability is opaque. That gap, between "ultimately verifiable" and "currently opaque", is exactly where the opportunity hides.

VII

Andrej Karpathy

Vibe coding is about raising the floor for everyone in terms of what they can do in software. The floor rises, everyone can vibe-code anything, and that's amazing. Incredible.

But then I would say agentic engineering is about preserving the quality bar of what existed before in professional software.

This definition is gorgeous. Vibe coding lets anyone build software, get in there, actually do it, without needing to understand the syntax or how the code works under the hood. That raises the floor of what's possible for any human being.

Agentic engineering is the opposite end. It's raising the ceiling. What's possible for actual software engineers, with AI, to ship at the same quality bar they had before, but now ten, a hundred, a thousand times faster.

Vibe coding raises the floor. Agentic engineering raises the ceiling.

▌ Berman, paraphrasing Karpathy

Andrej Karpathy

You're not allowed to introduce vulnerabilities due to vibe coding. You're still responsible for your software, just as before. But can you go faster? And spoiler, you can. But how do you do that properly?

To me, agentic engineering is an engineering discipline. You have these agents, which are these spiky entities: a bit fallible, a little stochastic, but extremely powerful. How do you coordinate them to go faster without sacrificing the quality bar? Doing that well and correctly is the realm of agentic engineering.

Diagram 5: Two disciplines, one term, what each lifts

For everyone · Raising the floor

Vibe coding.

Anyone who could not previously build software now can. The maximum quality bar may stay where it is, but the floor rises sharply. New entrants flood in. Distribution of software-creating capacity becomes radically more equal.

▌ Discipline I

For professionals · Raising the ceiling

Agentic engineering.

A senior engineer with one keyboard now orchestrates ten agents. Peter Steinberger reports running a hundred in parallel. The ceiling rises: the same person ships ten, a hundred, a thousand times what they could before, at the same quality bar.

▌ Discipline II

The two disciplines are not in tension, they pull different ends of the same software-output distribution upward at once. Confusing them costs you both ways.

VIII

In an essay called Animals vs Ghosts, Karpathy draws a line that, once you see it, you can't unsee. There are two kinds of intelligence in our world today. They look similar from the outside. They are nothing alike inside.

Andrej Karpathy

Animals are sculpted by evolution. They have intrinsic motivation, curiosity, a will, fun, joy. A zebra runs minutes after birth. Most of what an animal does was put there by hundreds of millions of years of evolution, not learned in its lifetime.

LLMs are not animals. They're a different kind of intelligence. They are ghosts. They are spirits. They are fully digital. They are evolved, in some sense, from the data of the internet by imitation.

Diagram 6: Two kinds of intelligence, after Karpathy's essay

Type I

Animals.

Origin: Hundreds of millions of years of evolution.
Inheritance: Most behavior is hard-wired, ready at birth.
Drive: Intrinsic: curiosity, hunger, fear, joy.
Learning: Tiny adjustment on top of an enormous prior.
Substrate: Embodied, biological, mortal.
Capability: Smooth, generalised, full of common sense.

Type II

Ghosts.

Origin: Imitation of the digital exhaust of humans.
Inheritance: None: everything is in the weights.
Drive: None. No curiosity, no fun, no preference.
Learning: Whatever the labs choose to reward.
Substrate: Disembodied, digital, copyable.
Capability: Jagged. Brilliant in places, blank in others.

An animal has been shaped by survival; a ghost has been shaped by your training data. When a ghost behaves badly, you do not appeal to its better nature, you change the data or the reward.

Aside

Sergey Brin recently said something funny: it's underreported that if you threaten the model with violence, it sometimes performs better. Karpathy is saying don't take that as evidence of a soul. Take it as evidence that pleas-and-threats are well-represented in the training distribution. The ghost is patterning, not flinching.

Why this matters operationally: every product decision that treats AI like a tireless intern with goals, rather than a stochastic parrot of internet text, goes wrong in predictable ways. The ghost has no preference. The ghost has no fun. The ghost is exactly as capable as your evaluation harness lets you measure.

▌ Reaction · The operational rule

Build for ghosts.

Karpathy's working assumption is that almost everything has to be rewritten. Documentation written in prose with screenshots, designed for a human reading on a laptop, is not what an agent needs. The agent wants structure, machine-readable affordances, and a clean action surface. The companies who already get this are rare and instructive.

Andrej Karpathy

Stripe has this projects.stripe.com, which is just a list of all of these companies that are already building infrastructure for agents specifically, AgentMail, Algolia, Amplitude, Browserbase, Chroma, Clerk, Cloudflare. There's a lot of stuff. They're all building components for the agentic infrastructure stack.

And then companies like Salesforce are leading on the application side. They've been talking about Headless 360: Salesforce re-imagined for the agent. Not a UI for a human to click. A clean surface for an agent to act on.

Concept · Agent-native infrastructure

Build the surface, not the screen.

Agent-native means: every action your software exposes is callable by a model with no human in the loop. Documentation is structured. Auth is delegated. Outputs are typed. The "form a user fills out" is replaced by "the contract an agent completes." If your product still requires a person to click, the agent is still doing screen-scraping on your behalf.

I'm doing this with my company, here.now. We're building a product called Journey Chat: an iMessage-style app that is fully populated by AI agents. You meet other people through their agents, you book travel through their agents, you communicate with brands through their agents. We're agent-native by default.

I'll have my agent talk to your agent to figure out the details of our meeting.

▌ Karpathy · On agent-to-agent representation

Andrej Karpathy

I do think we're going towards a world where there's agent representation for people and for organizations. I'll have my agent talk to your agent to figure out some of the details of our meetings, or things like that.

Two implications worth holding. First, every consumer-facing surface becomes a thin shell over agent-mediated negotiation; the UX in five years is mostly status updates from your agent on what it just did. Second, every B2B product that doesn't expose itself agent-natively becomes a service that other agents have to scrape, and that asymmetry is fatal in a market of agents who can switch with a prompt.

What founders should ship this quarter

Documentation as machine-readable as your API. Install scripts that agents copy-paste, not 600 lines of bash. SDKs in every major language the model writes well. Auth flows that don't require a human to click through three browser pages. A public, structured action surface so every operation in your product is just a callable function. If your product can already be driven from the command line by a competent engineer, it can be driven by an agent next week.

▌ X · The line that doesn't leave him

You can outsource your thinking.

At the end of the conversation, the interviewer asks Karpathy what he thinks people should be thinking about that they aren't. He answers with a sentence he says he keeps coming back to, every other day.

Andrej Karpathy

You can outsource your thinking, but you can't outsource your understanding.

Read it slowly. Thinking is the production of intermediate reasoning, sequences of steps, candidate solutions, clean drafts. An LLM is exceptional at thinking on your behalf. It can take a vague intent and produce twenty pages of plausible reasoning before you finish your coffee.

Understanding is something else entirely. Understanding is the internal model that lets you pick the right step out of twenty, notice the one that's subtly wrong, sense which of the plausible drafts is in fact misaligned with the goal. Understanding is the thing that knows when the model is pattern-matching past your actual question. You cannot outsource it because there is no signal in the output that tells you whether you have it.

What this means concretely: if you read an LLM's answer and feel satisfied without being able to reproduce its reasoning yourself, you have outsourced your thinking and your understanding has not grown. That's fine for low-stakes recall. It is catastrophic for anything where you'll later be the one held accountable for the decision.

Berman's gloss: every founder, every engineer, every operator who pulls real value from these tools is doing one thing in common. They use the model to think faster, and then they spend the time saved building their understanding deeper. The ones who fall behind are using the model to replace their understanding. Same tool. Opposite outcome.

Thinking is now cheap. Understanding isn't. The gap is the new job.

▌ The whole talk, in one paragraph

Software 3.0 has arrived. The model is the interpreter; your context is the program. The bitter lesson keeps holding, let the network eat your pipeline. Capability is jagged because we trained for what we could verify. Founders should hunt valuable RL environments the labs are not focused on. Vibe coding is a floor; agentic engineering is a ceiling; both are rising. The thing in front of you is a ghost, not an animal, build for ghosts. Most of the internet still has to be rewritten for agents. And the new job, the durable job, is the one you cannot hand to the ghost: holding the understanding the thinking depends on.

Appendix A · What to do Monday

A builder's checklist.

The talk's claims, distilled into actions. Two columns: one for the founder deciding what to build, one for the engineer deciding how.

If you're a founder.

The labs will own anything they can verify. Find what they can't, or won't, and own it first.

Map your domain to the verifiability spectrum.

How easy is it to grade an answer in your space? The harder it is, the more defensible against the labs.

Build the RL environment first, the product second.

If you have a uniquely valuable evaluation harness, you have a moat the labs can't copy without your data.

Don't compete on math, code, or general reasoning.

Those are exactly the verifiable domains the frontier labs target. You will not out-RL the labs.

Make your product agent-native by default.

Structured docs, clean API, install instructions an agent can copy-paste. Treat the screen as the legacy surface, not the primary one.

Reach for end-to-end neural networks.

Wherever you have a pipeline of "use AI here, code here, AI here", consider whether one large model could replace the whole thing. The bitter lesson applies to your stack too.

Plan for agent-to-agent.

Your customers will increasingly arrive as agents acting for humans. Design contracts and pricing around that.

If you're an engineer.

Vibe coding is a floor; agentic engineering is a ceiling. You're aiming at the ceiling. Don't ship vibe-coded vulnerabilities into production.

Re-test the frontier tools.

If your last serious attempt at agentic coding pre-dates December, your evidence is stale. Try Cursor, Claude Code, and Codex on a non-trivial task this week.

Move up to orchestration.

The unit of work shifts from "I write code" to "I steer ten agents." Plan parallelism. Plan review. Plan merge conflicts on ten branches at once.

Defend the quality bar.

Treat every line of vibe-coded output as a junior PR. Read it. Test it. Refuse it if it lowers the floor of your codebase.

Invest in evals more than in prompts.

Your prompts will rot in three months. Your evals are the durable artifact, they specify what "good" means.

Develop taste.

The thing models are worst at is the thing you should be best at. Aesthetic, architectural, and product taste are still the lever only humans hold.

Outsource thinking, not understanding.

After every session with an agent, write down what you learned. If you can't, you didn't.

Appendix B · Terms used in this issue

A short glossary.

Every piece of jargon in the talk, defined in one or two sentences. Light pointers to the section where each term is introduced.

Software 3.0: Karpathy's term for programming-by-prompt. Software 1.0 was hand-written code; 2.0 was learned weights; 3.0 is natural language delivered to an LLM that acts as the interpreter. Your "program" is the contents of the context window. § II
Vibe coding: Building software without needing to understand the underlying code, describing intent and trusting an agent to deliver. Karpathy coined the term. It raises the floor of who can build at all. § VII
Agentic engineering: The discipline of orchestrating one or more AI agents to ship production-quality software at the same quality bar as professional engineering, but at multiples of the speed. Distinct from vibe coding. § VII
The bitter lesson: Rich Sutton's observation that approaches relying on more compute and more data have, over decades, consistently beaten approaches that bake in human heuristics. "Bitter" because researchers keep refusing to learn it. § IV
Verifiability: The property of being easy to grade. A domain is verifiable if you can cheaply tell a correct answer from a wrong one. Math and code rank high; taste, ethics, and common sense rank low. The labs RL hardest in verifiable domains. § V
RL environment: A setting in which a model can produce candidate actions, receive a reward signal, and update its weights. Valuable, non-obvious RL environments are the founder opportunity Karpathy gestures at and pointedly does not name. § VI
Jaggedness: The signature pattern of frontier-model capability: peaks in verifiable domains (zero-day vulnerabilities, Olympiad math), valleys in domains the labs can't grade (common sense, taste). Evidence that we are not at general intelligence. § V
End-to-end neural network: An architecture where a single learned model handles the full task, input to output, with no hand-written pipeline glue. Tesla Autopilot's switch from hybrid rules-plus-net to fully end-to-end is the case study. § IV
Animals vs ghosts: Karpathy's framing for two kinds of intelligence. Animals are sculpted by evolution; they have intrinsic motivation. Ghosts (LLMs) are summoned from data; they have none. Confusing them produces bad product decisions. § VIII
Skill (in the agentic sense): A short, copy-paste-able natural-language instruction that installs or configures something for an agent. The OpenClaw install is the exemplar, one paragraph replaces a 600-line bash script. § II
Agent-native infrastructure: Software designed primarily to be operated by agents, not by humans. Structured documentation, machine-callable actions, delegated auth. Stripe's Projects list and Salesforce's Headless 360 are the early reference points. § IX
Context window: The fixed-size buffer of tokens an LLM can attend to in a single forward pass. Karpathy's analogy: the context is the RAM of the new computer; the weights are the CPU. Context engineering is memory management. § III

Appendix C · People, essays, & projects

References & further reading.

Every name dropped in the talk, with what to look up. Skim this list, mark two or three to follow up.

▌ The source

Andrej Karpathy at the Sequoia AI Summit

The conversation this issue reacts to. Co-founder of OpenAI, former director of AI at Tesla, coined "vibe coding" and (with the Software 1.0/2.0 series) the version-numbering of the programming paradigm.

Sequoia annual AI summit · conversation with Pat Grady

▌ Essay

Animals vs Ghosts, by Andrej Karpathy

The blog post Karpathy refers to in section VIII. The clearest statement of why LLMs are not minds, and what mistakes follow if you treat them as such.

karpathy.ai · the essay this issue's section VIII paraphrases

▌ Essay

Something Big Is Happening, by Matt Schumer

A friend-of-the-channel piece arguing that recent capability gains exceeded what he had thought possible in such a short window, and that the macroeconomics of work are being rewritten in real time.

linked from Matthew Berman's reaction; cited in section I

▌ Project list

Stripe Projects

Stripe's curated list of companies building agent-native infrastructure. Karpathy reads from it on stage. Includes AgentMail, Algolia, Amplitude, Browserbase, Chroma, Clerk, Cloudflare. A useful tour of the new stack.

projects.stripe.com · section IX

▌ Product

Salesforce Headless 360

Salesforce's reframing of its CRM for the agent era: a clean action surface for agents instead of a UI for humans. The paradigm case Karpathy points to on the application side.

salesforce.com · cited in section IX

▌ Builder

Peter Steinberger

The frontier agentic engineer Berman points to as a working proof of the ceiling rising. Reportedly orchestrates around a hundred agents in parallel from a single keyboard.

cited in section VII as the live example of orchestration at scale

▌ Essay

The Bitter Lesson, by Rich Sutton

The original 2019 essay underwriting half of Karpathy's argument. Two pages, free online, foundational. If you have not read it, read it before you build anything else.

incompleteideas.net · cited in section IV

▌ Aside

Sergey Brin on threatening models

Brin's offhand remark that threatening a model with violence sometimes improves performance. Karpathy uses it in section VIII to remind the audience that ghosts respond to training-distribution patterns, not to feelings.

unsourced public remark · section VIII

Why AI is so smart & so dumb.

Eight ideas worth keeping.

December was a real inflection.

Software 3.0 is here.

The bitter lesson holds.

Models are jagged.

Don't fight the labs head-on.

Floor versus ceiling.

AI is a ghost, not an animal.

Thinking is now cheap. Understanding isn't.

The talk, mapped.

The December inflection.

Software 3.0.

Code.

Weights.

Prompts.

The OpenClaw example

Outcome, not instructions.

LLMs as a new computer.

Processor

The weights.

Memory

The context window.

The bitter lesson.

Never bet against the scale.

The Tesla Autopilot moment

Jagged.

You don't tell it what to do, you grade it.

The car-wash problem

Don't fight the labsroute around them.

Floor versus ceiling.

Vibe coding.

Agentic engineering.

Animals, ghosts.

Animals.

Ghosts.

Rebuild the internet.

Build the surface, not the screen.

What founders should ship this quarter

You can outsource your thinking.

A builder's checklist.

If you're a founder.

If you're an engineer.

A short glossary.

References & further reading.

Andrej Karpathy at the Sequoia AI Summit

Animals vs Ghosts, by Andrej Karpathy

Something Big Is Happening, by Matt Schumer

Stripe Projects

Salesforce Headless 360

Peter Steinberger

The Bitter Lesson, by Rich Sutton

Sergey Brin on threatening models

Why AI is so
smart & so
dumb.

Don't fight the labs
route around them.