Day: April 29, 2026

How GPT-5, Claude, and Gemini Are Actually Trained and Served: The Real Math Behind Frontier AI Infrastructure
Reiner Pope, CEO of MatX and former TPU architect at Google, sat down with Dwarkesh Patel for a different kind of episode: a chalk-and-blackboard lecture on how frontier LLMs like GPT-5, Claude, and Gemini are actually trained and served. With nothing but a handful of equations and public API prices, Reiner reverse engineers an astonishing amount of what the labs are doing. If you have ever wondered why Fast Mode costs more, why context length stalls around 200k tokens, why models seem 100x over-trained, or why hyperscalers are pouring half a trillion dollars into memory, this is the most lucid explanation on the internet.

TLDW

Frontier LLM economics come down to two simple budgets: compute time and memory time. Once you write the rooflines on a blackboard, almost everything else falls out of them. Optimal batch size is roughly 300 times your sparsity ratio (around 2,000 to 3,000 tokens for a DeepSeek-style model). A new batch “train” departs every 20 milliseconds because that is how long it takes to read HBM end to end. Mixture of experts strongly favors staying inside a single rack, which is why scale-up domains went from 8 GPUs (Hopper) to 72 (Blackwell) to 500-plus (Rubin). Pipeline parallelism solves weight capacity but does nothing for KV cache, and adds painful per-hop latency, which is why Ilya famously said pipelining is not wise. Because of reinforcement learning and inference economics, frontier models are roughly 100x over-trained versus Chinchilla optimal, and a well-tuned model should output roughly as many tokens during deployment as went into its pre-training corpus. API prices leak the rest: Gemini’s 50% premium above 200k tokens reveals where KV memory time crosses weight memory time, prefill being 5x cheaper than decode confirms decode is memory bandwidth bound, and cache hit pricing tiers map directly to HBM, DDR, flash, and (yes) spinning disk. The lecture closes on a beautiful detour about the convergent evolution of neural nets and cryptographic ciphers.

Key Takeaways
- Two equations explain almost everything. A roofline analysis comparing compute time to memory fetch time predicts cost, latency, and architectural choices with shocking accuracy.
- Optimal batch size is about 300 times sparsity. For a DeepSeek model that activates 32 of 256 experts, that lands around 2,000 to 3,000 tokens per batch. Real deployments go a bit higher to leave headroom.
- The 20 millisecond train. A new batch departs every 20ms because that is how long it takes to read all of HBM once. Worst-case queue latency is roughly 40ms.
- Fast Mode is just smaller batches. Pay 6x more, get 2.5x faster decode by amortizing weights over fewer users. There is a hard latency floor at the HBM read time.
- Slow Mode would not save much. Once you are past the optimal batch size, the cost-per-token plateau is dominated by compute, not weight fetches. You cannot meaningfully amortize KV cache because it is unique per sequence.
- One rack is the natural MoE unit. Expert parallelism wants all-to-all communication, which strongly favors the scale-up network (NVLink) over the scale-out network (roughly 8x slower).
- Bigger scale-up domains drove model scaling. The jump from 8 (Hopper) to 72 (Blackwell) to 500-plus (Rubin) GPUs per rack increased aggregate memory bandwidth by 8x, which is why trillion-plus parameter models only became viable recently.
- Pipeline parallelism is overrated for inference. It saves on weight memory capacity but does nothing for KV cache memory. It also adds milliseconds of latency per hop in decode.
- Why Ilya said pipelining is not wise. Architectural constraints (cross-layer residuals like in Kimi) and the inability to amortize weight loads across micro-batches make pipelining a hassle in training too.
- The memory wall is real and paradoxical. Hyperscalers reportedly spend 50% of CapEx on memory, yet racks have far more HBM than a trillion-parameter model needs. The capacity is there for KV cache and batch size, not for weights.
- Frontier models are roughly 100x over-trained vs Chinchilla. When you minimize total cost across pre-training plus RL plus inference, smaller models trained on more data win.
- Each model should output roughly all human knowledge. If you equalize pre-training and inference compute, the total tokens served by a model during its lifetime should approximate its training corpus. Roughly 150 trillion in, 150 trillion out.
- API pricing reveals architecture. Gemini’s 50% premium above 200k context, the 5x decode-vs-prefill ratio, and cache duration tiers all leak detailed information about KV size, memory bottlenecks, and storage hierarchy.
- KV cache is roughly 2KB per token. Solving Gemini’s pricing equation gives a plausible 1.6 to 2 kilobytes per token at 100B active parameters and 200k context.
- Decode is memory bandwidth bound, prefill is compute bound. The 5x price gap is direct evidence.
- Cache pricing maps to memory tiers. The 5-minute and 1-hour cache durations probably correspond to flash and spinning disk drain times respectively. LLM serving uses spinning disk.
- Context length is stuck near 200k. Memory bandwidth, not compute, is the binding constraint. Sparse attention gives a square-root improvement but is not infinite.
- Cryptography and neural nets are mathematical cousins. Both rely on jumbling information across inputs. Feistel ciphers led directly to RevNets (reversible neural networks). Adversarial attacks mirror the cipher avalanche property.
Detailed Summary

The Roofline: Compute Time vs Memory Time

Reiner starts with the simplest possible model of LLM inference. The time to do a forward pass is bounded below by the maximum of compute time and memory fetch time. Compute time is the batch size times active parameters divided by FLOPs. Memory time is total parameters divided by memory bandwidth, plus a KV cache term that scales with batch size and context length. From these two equations, almost every economic and architectural fact about modern LLMs can be derived.

Plotting cost per token against batch size gives a clean picture: at low batch you pay enormous overhead because you cannot amortize the weight fetches, and at high batch you hit a compute floor. There is a sweet spot where memory bandwidth time equals compute time. That sweet spot is what Fast Mode and Slow Mode are tuning around.

Why Fast Mode Costs More: The Batch Trade-Off

When Claude Code or Codex offers Fast Mode at 6x the price for 2.5x the speed, what is really happening is that they are running you at a smaller batch size. Smaller batch means weight loads are amortized over fewer users, so cost per token goes up. But latency goes down because each forward pass touches less data. There is a hard floor on latency because you have to read every byte of HBM at least once per token, and that takes about 20 milliseconds on Blackwell-class hardware. There is also a soft ceiling on Slow Mode savings because the unamortizable parts (KV cache fetches, compute) eventually dominate.

The 20 Millisecond Train

HBM capacity divided by HBM bandwidth lands consistently around 20 milliseconds across generations of Nvidia hardware. That is the natural cadence at which a frontier model can run a forward pass over all its weights. Reiner uses a memorable analogy: a train departs every 20 milliseconds. Any users whose requests are ready board the train. If the train is full, they wait. If it is empty, it leaves anyway. This is why you do not need millions of concurrent users to saturate a model’s batch. You only need enough to fill a 2,000-token train every 20ms.

Why Optimal Batch Size Is About 300 Times Sparsity

Setting compute time equal to weight fetch time and rearranging gives a beautiful result: batch size needs to be greater than (FLOPs / memory bandwidth) times (total params / active params). The hardware ratio is a dimensionless 300 on most GPUs and has stayed remarkably stable from A100 through Hopper, Blackwell, and Rubin. The model term is just the sparsity ratio. For DeepSeek with 32 of 256 experts active, that is 8. So optimal batch is around 2,400 tokens. Real deployments push this to 3x to leave headroom for non-ideal efficiency. At 64 trains per second, that is roughly 128,000 tokens per second per replica, or about 1/1000 of Gemini’s reported global throughput.

Mixture of Experts Wants to Live Inside a Rack

MoE all-to-all routing means every token can be sent to any expert on any GPU. The communication pattern strongly prefers the fast scale-up network (NVLink) inside a rack to the slower scale-out network between racks. Scale-out is roughly 8x slower in bandwidth. This is why one rack ends up being the natural unit for an expert layer, and why Nvidia’s progression from 8 GPUs per rack (Hopper) to 72 (Blackwell) to 500-plus (Rubin) has been such a big deal for model size scaling.

Reiner walks through the physical constraints: cable density, bend radius, weight, power, cooling. Modern racks are pushing every dimension to the limit. Stuffing more GPUs into the scale-up domain is genuinely a hardware engineering problem.

Pipeline Parallelism: Why Ilya Said It Is Not Wise

Pipelining splits model layers across racks. It is the natural way to scale beyond the scale-up domain for very large models. But it has problems. In inference, pipelining does not save runtime, it only saves memory capacity per rack, which already is not the binding constraint because trillion-parameter models only need a terabyte and racks have 10x that. In training, pipelining creates the famous bubble (idle GPU time at the start and end of each pipeline pass) and forces micro-batching, which kills your ability to amortize weight loads across the global batch.

There is also an architectural cost. Models like Kimi use cross-layer residual connections where attention attends to layers a few back, and pipelining makes those patterns very hard to implement cleanly. Ilya’s quip “as we now know, pipelining is not wise” captures all of this.

The Memory Wall Paradox

Industry analysts report that hyperscalers are spending 50% of CapEx on memory this year, while smartphones and laptops are seeing 30% volume drops because there is not enough HBM and DDR to go around. Yet a Blackwell rack already has tens of terabytes of HBM, far more than a trillion-parameter model needs. The reason is that all that extra capacity goes to KV cache, batch size, and longer context. The bandwidth, not the capacity, is what matters most for weight loading. This also implies that hardware could be designed with less HBM per GPU if you commit to pipelining the weights, which is a real architectural option for a chip startup like MatX.

Reinforcement Learning and the 100x Over-Training of Frontier Models

Chinchilla scaling laws say a model with N active parameters should be trained on roughly 20N tokens for compute-optimal training. But frontier labs do not just minimize training cost. They minimize training plus inference cost across the model’s deployment lifetime. With reinforcement learning added to the mix, the cost equation has three terms: pre-training (6 times active params times tokens), RL (somewhere between 2x and 6x times active params times RL tokens, with a 30% efficiency penalty for decode-heavy rollouts), and inference (2 times active params times inference tokens).

If you assume those three roughly equalize at the optimum (a heuristic that holds for many cost curves), you get a clean conclusion: the data going into pre-training should be roughly equal to the data going into RL, which should be roughly equal to the tokens served at inference. With 100 billion active parameters and roughly 150 trillion training tokens, that is about 75x past Chinchilla optimal. Reiner rounds it to 100x. This is the most concrete first-principles argument for why frontier models are so deeply over-trained, and it implies that as inference traffic grows, models should keep getting smaller and longer-trained.

Each Model Should Output All of Human Knowledge

The most jaw-dropping consequence: if you equalize pre-training and inference compute, then the total tokens generated by a model across its deployment lifetime should approximate the size of its training corpus. GPT-5, served to hundreds of millions of users for two months, will collectively output something on the order of 150 trillion tokens. That is roughly the sum of human knowledge in textual form. Each frontier model is, in this sense, a one-shot universal author of a corpus the size of its source material.

API Prices Leak Architecture

This is where the lecture gets really fun. Gemini 3.1 charges 50% more for context above 200k tokens. Setting memory time equal to compute time at exactly 200k context and solving for KV cache size gives roughly 1.6 to 2 kilobytes per token, which is plausible for a model with 8 KV heads, dense attention, and head dimension of 128.

The 5x premium for output (decode) tokens versus input (prefill) tokens is direct evidence that decode is severely memory bandwidth bound and prefill is compute bound. Prefill processes many tokens per weight load, so it amortizes memory cost over the whole sequence. Decode processes one token per weight load, so it pays full memory cost every time.

Cache hits priced at one tenth of cache misses tell you that storing the KV cache in HBM (or DDR or flash) is much cheaper than recomputing it from scratch. The two cache duration tiers (5 minutes and 1 hour) probably correspond to memory tiers whose drain times match those durations: flash for the 5-minute tier, spinning disk for the 1-hour tier. Yes, spinning disk is in the modern LLM serving stack, despite being decades-old technology.

Why Context Length Has Plateaued at 200k

Context lengths shot up from 8k to roughly 200k during the GPT-3 to GPT-4 era and have stayed roughly flat for the past two years. Reiner argues this is the natural balance point where memory bandwidth cost crosses compute cost. Going to a million tokens is expensive. Going to 100 million tokens (which Dario has hinted is needed for true continual learning via in-context learning) is essentially impossible without either a memory technology breakthrough or a much more aggressive sparse attention scheme. Sparse attention helps with a square-root improvement, but it is not unlimited. Going too sparse trades off too much quality.

Cryptography Meets Neural Nets

The episode ends with a lovely intellectual detour. Cryptographic protocols and transformer architectures both rely on jumbling information across all inputs. They are doing inverse versions of the same operation: ciphers take structured input and produce randomness, while neural nets take noisy input and extract structure. Both fields use differentiation as their primary attack vector (differential cryptanalysis on ciphers, gradient descent on neural nets). Adversarial attacks on image classifiers exploit exactly the avalanche property that good ciphers are designed for.

The most concrete crossover: Feistel ciphers, which let you build invertible functions out of non-invertible ones, were ported into deep learning as RevNets (reversible networks) in 2017. RevNets let you run the entire network backwards during the backward pass, eliminating the need to store activations and dramatically reducing training memory footprint. It is the opposite trade-off of KV caching: spending compute to save memory rather than spending memory to save compute.

Thoughts

The most striking thing about this episode is how much can be deduced from a few equations and the public API price sheets of the major labs. The labs treat their architectures as trade secrets, but the moment they price tokens to be close to cost (which competition forces them to do), the prices themselves leak the underlying ratios. Anyone with a pen and paper can reverse engineer the KV cache size, the memory tier hierarchy, and the compute-vs-memory bottleneck profile of a frontier model. There is a lesson here for builders: in competitive markets, the prices tell you almost everything.

The 100x over-training result has interesting implications for what comes next. If the optimal balance shifts further toward inference (as adoption keeps growing), models should get smaller and longer-trained. That is good news for serving costs and bad news for training-compute-as-moat. The biggest determinant of model quality might increasingly be data quality and RL environment design, not raw pre-training compute. This squares with what is visible publicly: the leading labs are investing heavily in RL infrastructure, evaluations, and synthetic data pipelines.

The memory wall is the most underrated infrastructure story in AI. Most people think of compute as the bottleneck, but Reiner makes it clear that memory bandwidth is what actually limits context length, which limits how agentic a model can be in practice. If you cannot get to 100 million token contexts, you probably cannot have an AI agent that has been working with you for a month and remembers everything. Either some sparse attention scheme has to give us cheap effective context length, or we need a memory hardware breakthrough, or we have to invent some form of continual learning that does not rely on context windows. None of those paths are obviously easy, and the fact that context length has been flat for two years despite enormous investment suggests we are stuck against a real wall.

The cryptography parallel is the kind of cross-disciplinary insight that does not show up enough in AI discourse. Treating neural networks as a kind of differentiable cipher reframes a lot of the architecture choices (residual connections, layer normalization, attention) as deliberate efforts to make the function smooth and invertible enough to learn, in contrast to ciphers, which are deliberately designed to resist exactly that. Adversarial robustness research probably has a lot more to learn from cryptanalysis than it currently does.

Finally, the format itself is a win. Most AI podcasts are conversational, which is great for personality but bad for technical depth. A blackboard lecture with an interlocutor who asks naive questions at the right moments is a much higher bandwidth medium. More of this, please.
April 29, 2026
Andrej Karpathy on Vibe Coding vs Agentic Engineering: Why He Feels More Behind Than Ever in 2026
Andrej Karpathy, co-founder of OpenAI, former head of AI at Tesla, and now founder of Eureka Labs, returned to Sequoia Capital’s AI Ascent 2026 stage for a wide-ranging conversation with partner Stephanie Zhan. One year after coining the term “vibe coding,” Karpathy unpacked what has changed, why he has never felt more behind as a programmer, and why the discipline emerging on top of vibe coding, which he calls agentic engineering, is the more serious craft worth learning right now.

The conversation covered Software 3.0, the limits of verifiability, why LLMs are better understood as ghosts than animals, and why you can outsource your thinking but never your understanding. Below is a complete breakdown of the talk for anyone building, hiring, or learning in the agent era.

TLDW

Karpathy describes a sharp transition that happened in December 2025, when agentic coding tools crossed a threshold and code chunks just started coming out fine without correction. He frames the current moment as Software 3.0, where prompting an LLM is the new programming, and entire app categories are collapsing into a single model call. He distinguishes vibe coding (raising the floor for everyone) from agentic engineering (preserving the professional quality bar at much higher speed). Models remain jagged because they are trained on what labs choose to verify, so founders should look for valuable but neglected verifiable domains. Taste, judgment, oversight, and understanding remain uniquely human responsibilities, and tools that enhance understanding are the ones he is most excited about.

Key Takeaways
- December 2025 was a clear inflection point. Code chunks from agentic tools started arriving correct without edits, and Karpathy stopped correcting the system entirely.
- Software 3.0 means programming has become prompting. The context window is your lever over the LLM interpreter, which performs computation in digital information space.
- Open Code’s installer is a software 3.0 example. Instead of a complex shell script, you copy paste a block of text to your agent, and the agent figures out your environment.
- The Menu Gen anecdote illustrates how entire apps can become spurious. What used to require OCR, image generation, and a hosted Vercell app can now be a single Gemini plus Nano Banana prompt.
- Vibe coding raises the floor. Agentic engineering preserves the professional ceiling. The two are different disciplines.
- The 10x engineer multiplier is now far higher than 10x for people who are good at agentic engineering.
- Hiring processes have not caught up. Puzzle interviews are the old paradigm. New evaluations should look like building a full Twitter clone for agents and surviving simulated red team attacks from other agents.
- Models are jagged because reinforcement learning rewards what is verifiable, and labs choose which verifiable domains to invest in. Strawberry letter counts and the 50 meter car wash question show how state-of-the-art models can refactor 100,000 line codebases yet fail at trivial reasoning.
- If you are in a verifiable setting, you can run your own fine tuning, build RL environments, and benefit even when the labs are not focused on your domain.
- LLMs are ghosts, not animals. They are statistical simulations summoned from pre training and shaped by RL appendages, not creatures with curiosity or motivation. Yelling at them does not help.
- Taste, aesthetics, spec design, and oversight remain human jobs. Models still produce bloated, copy paste heavy code with brittle abstractions.
- Documentation is still written for humans. Agent native infrastructure, where docs are explicitly designed to be copy pasted into an agent, is a major opportunity.
- The future likely involves agent representation for people and organizations, with agents talking to other agents to coordinate meetings and tasks.
- You can outsource your thinking but not your understanding. Tools that help humans understand information faster are uniquely valuable.
Detailed Summary

Why Karpathy Feels More Behind Than Ever

Karpathy opens by describing how he has been using agentic coding tools for over a year. For most of that period, the experience was mixed. The tools could write chunks of code, but they often required edits and supervision. December 2025 changed everything. With more time during a holiday break and the release of newer models, Karpathy noticed that the chunks just came out fine. He kept asking for more. He cannot remember the last time he had to correct the agent. He started trusting the system, and what followed was a cascade of side projects.

He wants to stress that anyone whose model of AI was formed by ChatGPT in early 2025 needs to look again. The agentic coherent workflow that genuinely works is a fundamentally different experience, and the transition was stark.

Software 3.0 Explained

The Software 1.0 paradigm was writing explicit code. Software 2.0 was programming by curating datasets and training neural networks. Software 3.0 is programming by prompting. When you train a GPT class model on a sufficiently large set of tasks, the model implicitly learns to multitask everything in the data. The result is a programmable computer where the context window is your interface, and the LLM is the interpreter performing computation in digital information space.

Karpathy gives two concrete examples. The first is Open Code’s installer. Normally a shell script handles installation across many platforms, and these scripts balloon in complexity. Open Code instead provides a block of text you copy paste to your agent. The agent reads your environment, follows instructions, debugs in a loop, and gets things working. You no longer specify every detail. The agent supplies its own intelligence.

The Menu Gen Story

The second example is Karpathy’s Menu Gen project. He built an app that takes a photo of a restaurant menu, OCRs the items, generates pictures for each dish, and renders the enhanced menu. The app runs on Vercell and chains together multiple services. Then he saw a software 3.0 alternative. You take a photo, give it to Gemini, and ask it to use Nano Banana to overlay generated images onto the menu. The model returns a single image with everything rendered. The entire app he built is now spurious. The neural network does the work. The prompt is the photo. The output is the photo. There is no app between them.

Karpathy uses this to argue that founders should not just think of AI as a speedup of existing patterns. Entirely new things become possible. His example is LLM driven knowledge bases that compile a wiki for an organization from raw documents. That is not a faster version of older code. It is a new capability with no prior equivalent.

What Will Look Obvious in Hindsight

Stephanie Zhan asks what the equivalent of building websites in the 1990s or mobile apps in the 2010s looks like today. Karpathy speculates about completely neural computers. Imagine a device that takes raw video and audio as input, runs a neural net as the host process, and uses diffusion to render a unique UI for each moment. He notes that early computing in the 1950s and 60s was undecided between calculator like and neural net like architectures. We went down the calculator path. He thinks the relationship may eventually flip, with neural networks becoming the host and CPUs becoming co processors used for deterministic appendages.

Verifiability and Jagged Intelligence

Karpathy spent significant writing time on verifiability. Classical computers automate what you can specify in code. The current generation of LLMs automates what you can verify. Frontier labs train models inside giant reinforcement learning environments, so the models peak in capability where verification rewards are strong, especially math and code. They stagnate or get rough around the edges elsewhere.

This explains the jagged intelligence puzzle. The classic example was counting letters in strawberry. The newer one Karpathy offers: a state of the art model will refactor a 100,000 line codebase or find zero day vulnerabilities, then tell you to walk to a car wash 50 meters away because it is so close. The two coexisting capabilities should be jarring. They reveal that you must stay in the loop, treat models as tools, and understand which RL circuits your task lands in.

He also points out that data distribution choices matter. The jump in chess capability from GPT 3.5 to GPT 4 came largely because someone at OpenAI added a huge amount of chess data to pre training. Whatever ends up in the mix gets disproportionately good. You are at the mercy of what labs prioritize, and you have to explore the model the labs hand you because there is no manual.

Founder Advice in a Lab Dominated World

Asked what founders should do given that labs are racing toward escape velocity in obvious verifiable domains, Karpathy points back to verifiability itself. If your domain is verifiable but currently neglected, you can build RL environments and run your own fine tuning. The technology works. Pull the lever with diverse RL environments and a fine tuning framework, and you get something useful. He hints there is one specific domain he finds undervalued but declines to name it on stage.

On the question of what is automatable only from a distance, Karpathy says almost everything can ultimately be made verifiable. Even writing can be assessed by councils of LLM judges. The differences are in difficulty, not in possibility.

From Vibe Coding to Agentic Engineering

Vibe coding raises the floor. Anyone can build something. Agentic engineering preserves the professional quality bar that existed before. You are still responsible for your software. You are still not allowed to ship vulnerabilities. The question is how you go faster without sacrificing standards. Karpathy calls it an engineering discipline because coordinating spiky, stochastic agents to maintain quality at speed requires real skill.

The ceiling on agentic engineering capability is very high. The old idea of a 10x engineer is now an understatement. People who are good at this peak far above 10x.

What Mediocre Versus AI Native Looks Like

Karpathy compares this to how different generations use ChatGPT. The difference between a mediocre and an AI native engineer using Claude Code, Codex, or Open Code is investment in setup and full use of available features. The same way previous generations of engineers got the most out of Vim or VSCode, today’s strong engineers tune their agentic environments deeply.

He thinks hiring processes have not caught up. Most companies still hand out puzzles. The new test should look like asking a candidate to build a full Twitter clone for agents, make it secure, simulate user activity with agents, and then run multiple Codex 5.4x high instances trying to break it. The candidate’s system should hold up.

What Humans Still Own

Agents are intern level entities right now. Humans are responsible for aesthetics, judgment, taste, and oversight. Karpathy describes a Menu Gen bug where the agent tried to associate Stripe purchases with Google accounts using email addresses as the key, instead of a persistent user ID. Email addresses can differ between Stripe and Google accounts. This kind of specification level mistake is exactly what humans must catch.

He works with agents to design detailed specs and treats those as documentation. The agent fills in the implementation. He has stopped memorizing API details for things like NumPy axis arguments or PyTorch reshape versus permute. The intern handles recall. Humans handle architecture, design, and the right questions.

Reading the actual code agents produce can still cause heart attacks. It is bloated, full of copy paste, riddled with awkward and brittle abstractions. His Micro GPT project, an attempt to simplify LLM training to its bare essence, was nearly impossible to drive through agents. The models hate simplification. That capability sits outside their RL circuits. Nothing is fundamentally preventing this from improving. The labs simply have not invested.

Animals Versus Ghosts

Karpathy returns to his framing that we are not building animals, we are summoning ghosts. Animal intelligence comes from evolution and is shaped by intrinsic motivation, fun, curiosity, and empowerment. LLMs are statistical simulation circuits where pre training is the substrate and RL is bolted on as appendages. They are jagged. They do not respond to being yelled at. They have no real curiosity. The ghost framing is partly philosophical, but it changes how you approach them. You stay suspicious. You explore. You do not assume the system you used yesterday will behave the same on a new task.

Agent Native Infrastructure

Most software, frameworks, libraries, and documentation are still written for humans. Karpathy’s pet peeve is being told to do something instead of being given a block of text to copy paste to his agent. He wants agent first infrastructure. The Menu Gen project’s hardest part was not writing code. It was deploying on Vercell, configuring DNS, navigating service settings, and stringing together integrations. He wants to give a single prompt and have the entire thing deployed without touching anything.

Long term he expects agent representation for individuals and organizations. His agent will negotiate meeting details with your agent. The world becomes one of sensors, actuators, and agent native data structures legible to LLMs.

Education and What Still Matters

The most striking line of the conversation comes near the end. Karpathy quotes a tweet that shaped his thinking: you can outsource your thinking but you cannot outsource your understanding. Information still has to make it into your brain. You still need to know what you are building and why. You cannot direct agents well if you do not understand the system.

This is part of why he is so excited about LLM driven knowledge bases. Every time he reads an article, his personal wiki absorbs it, and he can query it from new angles. Every projection onto the same information yields new insight. Tools that enhance human understanding are uniquely valuable because LLMs do not excel at understanding. That bottleneck is yours to manage.

Thoughts

The most useful frame in this talk is the distinction between vibe coding and agentic engineering. It clarifies what has been muddled for the past year. Vibe coding is about access. Anyone can produce something. Agentic engineering is about discipline. You preserve the standards that made software trustworthy in the first place, while moving at speeds that would have seemed absurd two years ago. These are not the same activity, and conflating them is part of why so many shipped products feel half built.

The Menu Gen anecdote is the kind of story that should make every solo developer pause. If a single Gemini plus Nano Banana prompt can replace a multi service Vercell deployed app, the question for any builder becomes how much of what you are working on right now is going to be made spurious by the next model release. The honest answer is probably more than you want to admit. The defensive posture is not building thicker apps. It is choosing problems where the model alone is not enough, where taste, distribution, infrastructure, or specific verifiable RL environments give you something the next model cannot collapse into a prompt.

The verifiability lens is also unusually practical. If you are a solo builder, the question shifts from what is possible to what is verifiable but neglected. The labs will eat the obvious verifiable domains because that is how their RL pipelines are set up. The opportunity is in domains where verification is possible but the labs have not yet invested. That is a much more concrete strategic filter than vague intuitions about defensibility.

The car wash example is going to stick. State of the art models can refactor enormous codebases and still tell you to walk somewhere a sane person would drive. That is the lived reality of jagged intelligence, and it argues strongly for staying in the loop on real decisions rather than handing off everything to agents. The agents are excellent fillers of blanks. They are not yet trustworthy specifiers of the spec.

Finally, the line about outsourcing thinking but not understanding is worth taping above the desk. The bottleneck is no longer typing speed, syntax recall, or even API knowledge. It is whether the human in the loop actually understands the system being built. Tools that genuinely improve human understanding, including personal knowledge bases that re project information through different prompts, are likely the most undervalued category of products being built right now. The opportunity is not just in agents. It is in the cognitive scaffolding that makes humans good directors of agents.
April 29, 2026

Day: April 29, 2026

How GPT-5, Claude, and Gemini Are Actually Trained and Served: The Real Math Behind Frontier AI Infrastructure

TLDW

Key Takeaways

Detailed Summary

The Roofline: Compute Time vs Memory Time

Why Fast Mode Costs More: The Batch Trade-Off

The 20 Millisecond Train

Why Optimal Batch Size Is About 300 Times Sparsity

Mixture of Experts Wants to Live Inside a Rack

Pipeline Parallelism: Why Ilya Said It Is Not Wise

The Memory Wall Paradox

Reinforcement Learning and the 100x Over-Training of Frontier Models

Each Model Should Output All of Human Knowledge

API Prices Leak Architecture

Why Context Length Has Plateaued at 200k

Cryptography Meets Neural Nets

Thoughts

Andrej Karpathy on Vibe Coding vs Agentic Engineering: Why He Feels More Behind Than Ever in 2026

TLDW

Key Takeaways

Detailed Summary

Why Karpathy Feels More Behind Than Ever

Software 3.0 Explained

The Menu Gen Story

What Will Look Obvious in Hindsight

Verifiability and Jagged Intelligence

Founder Advice in a Lab Dominated World

From Vibe Coding to Agentic Engineering

What Mediocre Versus AI Native Looks Like

What Humans Still Own

Animals Versus Ghosts

Agent Native Infrastructure

Education and What Still Matters

Thoughts