PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: AI hiring

  • Benedict Evans on the Economics of AI Usage, Why Foundation Models May Become Commodities, and What Comes Next for SaaS

    Benedict Evans returns to the a16z podcast to update the thesis behind his widely read “AI eats the world” presentation, and the picture he paints is less about hype and more about hard economics. In this conversation he works through what has actually played out in the last year, why agentic coding became the one use case with real product market fit, and why he keeps arguing that foundation models may end up as commodities while the value moves somewhere else entirely. You can watch the full conversation here.

    TLDW

    Benedict Evans argues that the AI moment looks a lot like the early internet, the early PC era, and the rollout of mobile data, which means it is exciting, genuinely transformative, and almost impossible to predict use case by use case. Agentic coding is the only field with clear product market fit right now, with revenue run rates exploding from roughly nine billion to forty seven billion, while consumers still use chatbots weekly rather than daily. His central claim is that foundation models show no obvious network effect or sustainable differentiation, the chatbot is a limited v1 interface, and the model labs cannot build every application, so the value will likely move up the stack the way it did with chips, ISPs, and mobile networks rather than staying with the model providers. He covers the brutal supply and demand disequilibrium driving today’s token pricing and ten thousand dollar surprise bills, the financial gravity problem of hyperscalers spending over half their revenue on capex, the Jevons paradox and consumer surplus that may compete away productivity gains, the way the important questions move out of San Francisco and into industries like law, consulting, finance, and advertising, and the distinction between automating tasks and changing jobs. His closing image is an IBM ad from the 1950s promising “150 extra engineers,” a reminder that every platform shift feels unprecedented and that in twenty years we will simply say of course computers do that.

    Thoughts

    The most useful thing Evans does here is refuse to collapse uncertainty into a clean prediction, and then explain exactly why that refusal is the correct posture rather than a cop out. He distinguishes between the parts where he will commit to a view, that foundation models are probably not a product and the chatbot is probably not the right interface, and the parts where there are simply too many open paths to call. That discipline is rare in AI commentary, where the incentive is to sound certain. The commodity argument is not “models are worthless.” It is a chain of reasoning: there is no visible network effect, no durable differentiation beyond willingness to spend, no lock in comparable to Windows or iOS, and a likely structure of three to six well funded competitors plus open source and edge models all selling the same thing. Ask where price discipline comes from in that picture and the honest answer is that it probably does not, which is how you get a commodity even when demand is effectively infinite.

    The mobile data analogy is the load bearing comparison and it deserves to be taken seriously. Mobile data traffic rose something like fifteen hundred to two thousand times over fifteen years, the networks built an extraordinary piece of global infrastructure, everyone came to depend on it, and yet the operators captured almost none of the value because all the interesting stuff got built on top by someone else. Telco stocks were flat for two decades. If that is the template, then the trillion dollars of capex flowing into AI infrastructure can be both a worthwhile investment and a terrible place to expect outsized equity returns, because building the road is not the same as owning the traffic. The counterpoint Evans keeps fairly on the table is the operating system path, where Windows and iOS did capture value, but he notes they had levers and network effects that LLMs do not appear to have.

    His framing of where the questions live is the part most people in tech underweight. Once a technology works, the interesting questions stop being technology questions. Netflix is not a tech company in the sense that matters, because its real decisions are Los Angeles decisions about shows, talent, and sports, not San Francisco decisions about infrastructure. By the same logic, what AI means for a law firm is mostly a question for people who understand what associates actually do and what clients are actually paying for, not for model researchers. This is why the “the model will just do the whole thing” story keeps running aground. Most valuable software does not solve a problem the customer already knew they had. It often takes years to convince an industry that a problem even exists, and an LLM prompt does not surface latent problems that no one has articulated.

    The economic plumbing he describes is where the near term risk actually sits. We are in extreme disequilibrium, where twenty dollars a month can buy ten thousand dollars of tokens on one side and a weekend of experimentation can produce a ten thousand dollar bill on the other, exactly the pattern mobile data went through around 2009 and 2010. That gets resolved with the boring machinery of caps, throttling, and pricing tiers, not with magic. Layered on top is the financial gravity problem: Microsoft, Meta, and Google heading toward spending more than half of revenue on capex, with roughly seven hundred billion dollars of guidance across the big players, against a hard ceiling because there is not ten trillion dollars a year available to spend. And even when the productivity gains are real, the Jevons paradox and consumer surplus suggest much of the benefit gets competed away. If a discounted cash flow model used to take a week and now takes ten seconds, you do fifty of them and charge the client the same, which is great for clients and unremarkable for margins.

    The honest takeaway for builders is that the answer to “what does this do to software” is more software, probably one or two orders of magnitude more, just as SaaS itself produced an explosion rather than a consolidation. The SaaS apocalypse is real in the sense that some meaningful percentage of existing companies get wiped out, and unknowable in the sense that no one can yet say which ones, which is why thoughtful investors are reluctant to be long software in the dark. For anyone pursuing a more deliberate, purposeful relationship with technology, the closing note is the one to keep: every one of these shifts felt singular and world ending and world making at the time, it reshaped work and put people out of jobs and created things we love, and then it quietly became invisible. The goal is to stay clear eyed about which of those buckets a given change lands in rather than getting swept up in the noise of what someone said at a party yesterday.

    Key Takeaways

    • Agentic coding shifted from “kind of useful” to “really changing everything” at the start of the year, and it is the single field with unambiguous product market fit, where customers are pulling it out of your hands.
    • Coding working first was foreseeable in hindsight: software developers were the ones messing with the tools, and the first thing people do with a new kind of computer is build more computing, just as the first thing people did with PCs was make computers.
    • Anthropic, with less capital raised, chose to focus on coding and got it working, while OpenAI cycled through a more everything all at once strategy before narrowing in.
    • The intense focus on coding comes bundled with a supply crunch, a capacity crunch, and a price and capex imbalance that defines the current moment.
    • Most of the fundamental questions from two or three years ago still have no answers: whether there will be a winner in models, whether models capture value up the stack, how much they can do, and whether consumers will use this daily rather than weekly.
    • There is a wide gap between Valley insiders running clusters of Mac Studios all day and the roughly forty percent of people who say AI is “kind of useful, I used it last week for something.”
    • Outside tech, companies are adopting AI as one at a time point solutions for specific back office processes, like a commodities company using LLMs for better cash flow forecasting, not as a general purpose assistant.
    • Adoption always compounds on prior platforms: you could not have nine hundred million weekly active users in the Netscape era because there were not nine hundred million PCs on the planet.
    • Early in any platform shift almost nothing works smoothly, from sound cards and floppy disks with TCP/IP to computers that froze and lost your work, and AI is at that stage now.
    • Today’s token pricing crunch mirrors the mobile data shock of 2009 to 2010, where flat rate plans collided with surging usage and networks had to realign price with marginal cost through caps, fair use, and throttling.
    • Mobile data traffic rose roughly fifteen hundred to two thousand times in fifteen years, mobile networks earn around a trillion dollars and spend about two hundred billion a year on capex, yet their stocks have been flat for twenty years because all the value moved up the stack.
    • The central LLM question is whether the model can do the whole thing or whether you need hundreds of applications built on top, the same way you needed apps on Windows and iOS.
    • Evans sees no network effect and no sustainable differentiation between models beyond willingness to spend money, which points toward commodity infrastructure sold near marginal cost.
    • Chip companies, ISPs, and mobile operators did not capture the value; Windows and iOS did, but only because they had levers to move up the stack and real network effects, which models lack.
    • A useful comparison is semiconductors, where each generation gets more expensive and the field narrows to fewer players, suggesting three to six frontier model makers spending somewhere between two hundred billion and two trillion dollars a year.
    • Enterprises do not standardize on a model the way they once thought about AWS; the cloud and the model get abstracted away, so customers do not even know which one their SaaS product runs on.
    • Demand for tokens being effectively infinite does not prevent a price equilibrium, exactly as infinite demand for mobile bits still produced murderous price wars between commodity carriers.
    • History teaches that something will happen but rarely what; the smartest people in tech wrongly predicted Android would crush the iPhone on open versus closed grounds.
    • One characteristic of tech is that the moment you understand how something works is the moment to move on, which is why Evans stopped updating his Apple spreadsheet years ago.
    • The people who are good at using a tool are usually not the people who are good at designing what the tool should be, which is why model labs cannot build every skill or vertical application.
    • Claude skills and similar templates resemble file new in Excel: useful starting points that users eventually outgrow, raising the question of who builds the real software.
    • The questions increasingly move out of technology and into specific industries; what AI means for law, consulting, advertising, or accounting is partly an AI question and partly a deep domain question.
    • Netflix is not a tech company in the way that matters, because its real questions are media industry questions about shows, talent, and sports, not infrastructure; the same logic now applies across industries facing AI.
    • AI differs from prior platform shifts because the physical limits are unknown; in 1995 you knew PCs cost three thousand dollars and broadband could not reach everyone overnight, but no one knows how cheap, fast, or capable models will get.
    • Evans offers four buttons to press on any use case: is it just price elasticity and the Jevons paradox, does it remove a cost barrier to entry, does it unlock a new business model, or does it make something previously impossible now possible like trains over horses or Spotify over CDs.
    • Advertising and e-commerce are a standout opportunity because today’s systems know a SKU and a metadata field but not what a product actually is or why people buy it, and LLMs could change that level of understanding.
    • The valuable shift is not doing the old thing more, like more spreadsheets or better email, but doing genuinely new things, such as asking an LLM how to change prices to improve churn using all your call recordings, CRM flows, and product telemetry.
    • Enterprise software today splits into three buckets: big horizontal systems like SAP and Workday, three to four hundred vertical SaaS apps plus a thousand internal apps, and a fuzzy improvised middle of Excel, email, and shared files, with AI arriving as a new option across all three.
    • A core design tension is where to put the probabilistic software that can make mistakes versus the deterministic database that cannot, and whether the LLM sits at the top or the bottom of the stack; the answer is probably both depending on the task.
    • The net effect on software is way more software, since SaaS itself produced one to two orders of magnitude more software and all software companies exist to solve problems created by other software companies.
    • The SaaS apocalypse is real but unknowable: some percentage of SaaS companies get wiped out, but no one knows which, so you should not derate the whole sector fifty percent and many investors are wary of being long software for now.
    • Much of what an organization does is implicit, undocumented, and not in the training data, which is exactly the value McKinsey, Bain, and BCG provide by getting license to map how a company really works.
    • The real decisions are usually exception handling: the question is always what you cannot automate and what still requires human judgment about cases that were never written down.
    • Distinguish tasks from jobs: accountants spend almost none of their time the way they did fifty years ago, yet to the client the job looks the same.
    • LLMs excel where you want the average, the answer anyone would give, and struggle where you specifically do not want the average and cannot fully explain why you did it differently.
    • There is a financial gravity ceiling: Microsoft, Meta, and Google are on track to spend over fifty percent of revenue on capex versus fifteen to twenty percent for capital intensive telecoms, with seven hundred billion in guidance this year and no path to ten trillion.
    • Hyperscalers face an existential FOMO trap: returns look positive now, but they cannot let rivals build the future of compute without participating, even as the CFO asks how much participation is enough.
    • Token maxing will face a reckoning as the disequilibrium resolves, but measuring ROI is hard because most reported benefits so far, like better analytics, support, and productivity, are tough to put a financial value on.
    • Consumer surplus means many gains get competed away: if analysis that took a week now takes a day, you do five times more analysis and charge the same, the way investment banks did with spreadsheets.
    • Evans closes with a 1950s IBM ad promising “150 extra engineers,” a reminder that every fundamental technology change feels unprecedented, and that in twenty years AI will simply be invisible magic we take for granted.

    Detailed Summary

    What changed in the last year

    Evans frames the past year as a narrowing of focus. A year and a half after the first version of his presentation, the field has developed a much clearer sense of diverging product strategies and competitive tension that goes beyond simply building a bigger model with more compute. The dominant shift is that agentic coding started genuinely working, and the entire industry narrowed in on it because it has absolute product market fit, the kind where customers pull the product out of your hands. That success arrives alongside the supply crunch, capacity constraints, and price imbalance that now define the moment. At the same time, the charts keep climbing, models keep getting bigger, capex keeps growing, and usage keeps growing, while the deep questions from a few years ago remain unanswered.

    Why coding worked first

    That coding led was predictable at a naive level: the people experimenting with the tools were software developers, and they naturally tried to make software development work. Evans compares the moment to the internet around 1997 and 1998, and also to PCs in the late seventies and early eighties, when the technology was exciting but it was not clear what it was for and it did not quite work yet. The first thing people did with PCs was make computers, and since LLMs are in a sense computers, the first thing people are doing with them is making more compute. What was harder to foresee was the precise timing of the shift, the moment when agentic coding flipped from useful to transformative at the start of this year.

    Jobs, juniors, and what we have not learned

    On the question of what this means for engineers and team structure, Evans is blunt that we have learned almost nothing yet, because this did not even work six months ago and everyone is scrambling to interpret it. The pricing crunch alone means it will take a couple of years to settle. The newly concrete questions include whether you still hire junior people and what they would do, and why you were hiring juniors in the first place, whether to do the work itself or to develop people. Because software development now genuinely automates a class of work that used to be done by people, those questions have moved from theoretical to real, but no one can responsibly claim to know what a software team or a software career looks like in three years.

    OpenAI, Anthropic, and the strategy split

    Evans dryly notes the drama around the model labs, including the disruption of a senior leadership medical leave at OpenAI. In the latter part of last year, OpenAI’s question was essentially what to build on top of the models, an everything all at once approach that looked almost like asking the model for fifteen ideas and then doing all of them. Anthropic, with less capital raised, instead committed to coding and got it working, whether by deliberate strategy or by stumbling into it. The result is that software development plus a few other fields are where things genuinely work, surrounded by a large population of people excited around the edges and corporations quietly automating specific back office processes. He cites a commodities company that wants LLMs for better cash flow forecasting across many small producers, a very different thing from asking a chatbot to summarize your meetings.

    The mobile data analogy and value capture

    The richest section is the comparison to mobile. Adoption always compounds on prior platforms, so AI inherits a far larger installed base than the internet or mobile did at their starts. Early on, nothing works smoothly, and Evans recalls the era of buying a three hundred dollar sound card or wrestling a floppy disk of TCP/IP into a machine. The pricing dynamics directly echo mobile data around 2009 and 2010, when flat rate plans met exploding usage and ten thousand dollar bills, forcing networks to realign price with marginal cost. Crucially, mobile data traffic then rose fifteen hundred to two thousand times, the networks built extraordinary global infrastructure with around a trillion dollars of revenue and two hundred billion in annual capex, and yet their stocks stayed flat for twenty years because all the cool stuff and all the value got built and captured by someone else higher up the stack. Chip companies, ISPs, and mobile operators did not capture value; Windows and iOS did, but they had levers and network effects that models do not appear to share.

    The case that models become commodities

    Evans lays out the building blocks of his commodity thesis. First, there is no clear way to build a model that is sustainably and fundamentally better than everyone else’s, with no visible network effect and no strategic lever comparable to what Instagram, YouTube, or Google search enjoy. Differences in emphasis and taste exist, but not durable competitive moats beyond spending. Second, the chatbot is a weird, limited v1 interface that works well for some tasks and people but requires tooling, the right data, configuration, control, and thoughtful design for most real jobs, and the people good at a job are rarely the people good at designing the tool for it. Third, the labs cannot build every application any more than Microsoft or Apple could build every Windows or iPhone app. Enterprises do not standardize on a model the way they never standardized on a visible cloud provider, because it gets abstracted away. Taken together, that points to low level infrastructure sold by perhaps half a dozen competitors plus open source and edge, with no obvious source of price discipline, which is the definition of a commodity even when demand is infinite.

    The questions move out of technology

    One of the next big questions is when models become good enough that you no longer need the largest, fastest, most expensive model, and can use an older model, an open source model, or one running on device where compute is effectively free to the developer. But the deeper shift is that the important questions move out of technology and into industries. Drawing on his own essays “content isn’t king” and “Netflix isn’t a tech company,” Evans argues that Netflix’s real decisions are Los Angeles media questions, not San Francisco infrastructure questions, and San Francisco does not even know what the right questions are. By the same logic, what AI means for a law firm is mostly a question for people who understand law firms, what generative video means for Hollywood is a question Ben Affleck can answer better than he can, and the questions become half AI and half something else.

    Four buttons and the new things AI unlocks

    To reason about impact, Evans offers four buttons. Is a use case just price elasticity, the Jevons paradox of doing the same thing for less or more for the same money. Does it remove a cost that was a barrier to entry, like a newspaper’s printing press. Does it unlock something in your business model. Or does it make something previously impossible now possible, the way steam engines made trains possible regardless of how many horses you bought, or Spotify turned fifteen dollars a month into all the music there is. He stresses that the same broad change can mean wildly different things by industry, just as the internet devastated newspapers but barely touched movie studios. His favorite tractable example is advertising and e-commerce, a trillion dollar advertising market against twenty five trillion in retail, where today’s systems know a SKU and a metadata field and that people who bought one thing bought another, but do not know what a product is or why people buy it. An LLM could in principle understand the product, recommend ten coats at different prices with pros and cons, or look at your Instagram and suggest a winter coat that changes your look but not too much, which would have been science fiction three years ago.

    More software, the SaaS apocalypse, and tasks versus jobs

    For software specifically, Evans expects more competition, cheaper and quicker building, and new categories that were impossible before, all under an uncertain new margin structure where outcome based pricing is hard because most software work cannot be tied cleanly to profit and loss. He frames enterprise software as three buckets, big horizontal systems, hundreds of vertical and internal apps, and a fuzzy improvised middle of Excel and email, with AI arriving as another option across all of them. The deeper design tension is where to place probabilistic software that can make mistakes versus deterministic systems that cannot, and whether the LLM sits at the top or bottom of the stack, with the answer being both depending on the task. The net result is way more software, since SaaS itself produced orders of magnitude more software and software exists to solve problems created by other software. That fuels the SaaS apocalypse anxiety: some companies clearly get wiped out, but since no one knows which, you should not derate the whole sector, even as many investors stay cautious about being long software.

    Implicit knowledge, exception handling, and where the average fails

    Much of what organizations do is implicit, undocumented, and absent from any training data, which is precisely the value of strategy consultancies that get license to map how a company really works versus how it is supposed to work. The real decisions tend to be exception handling, the cases that require human judgment because they were never written down or do not look like before. Evans separates tasks from jobs, noting accountants do almost nothing the way they did fifty years ago while the client still buys the same thing. And he offers a sharp test: LLMs are excellent where you want the average, the answer anyone would give, and weak where you specifically do not want the average and cannot fully articulate why you did it differently.

    Capex, financial gravity, and the ROI question

    On spending, Evans describes a financial gravity problem. Microsoft, Meta, and Google are on line to spend over half their revenue on capex this year, against fifteen to twenty percent for capital intensive telecoms, with roughly seven hundred billion in guidance across the big players, a sum comparable to all of telecom or oil and gas. They cannot sustainably leap to one and a half trillion next year because the money is not there, so the curve must eventually taper. The hyperscalers are caught in an existential FOMO trap: returns look positive now, but they cannot sit out what might be the future of compute without risking becoming the next stranded incumbent, even as the CFO asks how much is enough. On token maxing, he expects a reckoning as the disequilibrium resolves, but measuring ROI is genuinely hard because most reported benefits so far are soft and hard to value, and consumer surplus means much of the gain gets competed away, the way faster spreadsheets simply meant more analysis at the same price.

    Closing image

    Evans ends with an IBM advertisement from the early 1950s showing a sea of engineers holding slide rules, with the tagline that an IBM electronic calculator gives you 150 extra engineers, exactly the pitch behind countless modern startup decks. We move through these fundamental technology waves every ten or fifteen or twenty years, each one feeling completely unlike anything before, and AI is amazing and transformative in the same way mobile, the internet, and PCs were. The base case is that it will produce wonderful things, ruin some livelihoods, put people out of work, and eventually become invisible. His one line description of where it all ends up is that it will be magic, and in twenty years we will simply say of course computers do that, the way an hour of crash free streaming HD video over Wi-Fi already feels unremarkable.

    Notable Quotes

    “Agentic coding went from being kind of useful to really changing everything.”

    Benedict Evans, on the pivotal shift at the start of the year

    “We are in this extreme scarcity. We can’t spend $10 trillion a year on AI infrastructure cuz there isn’t $10 trillion a year there to spend on it.”

    Benedict Evans, on the hard ceiling of AI capex

    “I don’t think foundation models are a product. I don’t think a chatbot is a product. I think the value will be further up.”

    Benedict Evans, stating the core of his thesis

    “They built this amazing piece of global incredibly sophisticated very expensive global infrastructure with enormous growth in use, and they didn’t make any money from it because all the value moved up stack.”

    Benedict Evans, on the mobile network analogy

    “The moment that you understand something and you know how it works and what’s going to happen is the moment you should move on to something else.”

    Benedict Evans, on how to pay attention in tech

    “These are all Los Angeles questions. These are not San Francisco questions. No one in San Francisco even knows what the right questions are.”

    Benedict Evans, on why Netflix is not a tech company

    “The important stuff is not doing the old thing but more. It’s doing something new that you couldn’t have done with the old thing.”

    Benedict Evans, on where the real value of a new technology shows up

    “All software companies exist to solve problems created by other software companies.”

    Benedict Evans, on why AI produces more software, not less

    “It’s going to be magic, and in 20 years time we’ll just say, well, of course that’s how it is. Computers have always done that.”

    Benedict Evans, on how the whole shift ends up

    This is a dense, clear eyed conversation that rewards a full listen, especially if you are trying to think past the hype cycle about where AI value actually lands. Watch the full conversation here, and check out the “AI eats the world” presentation referenced throughout.

    Related Reading

    • Benedict Evans’ website home of the “AI eats the world” presentation and his newsletter referenced throughout the conversation.
    • Andreessen Horowitz (a16z) the venture firm whose podcast hosted this discussion and where Evans was formerly a partner.
    • Jevons paradox (Wikipedia) background on the price elasticity idea Evans uses to explain how cheaper AI may lead to more usage rather than savings.
    • Stratechery by Ben Thompson the analysis Evans cites on software as a designed workflow versus a process that grows out of how a business runs.
    • The Pursuit of Purpose a PJFP look at finding direction and meaning in work as automation reshapes careers and industries.
  • Andrej Karpathy on Vibe Coding vs Agentic Engineering: Why He Feels More Behind Than Ever in 2026

    Andrej Karpathy, co-founder of OpenAI, former head of AI at Tesla, and now founder of Eureka Labs, returned to Sequoia Capital’s AI Ascent 2026 stage for a wide-ranging conversation with partner Stephanie Zhan. One year after coining the term “vibe coding,” Karpathy unpacked what has changed, why he has never felt more behind as a programmer, and why the discipline emerging on top of vibe coding, which he calls agentic engineering, is the more serious craft worth learning right now.

    The conversation covered Software 3.0, the limits of verifiability, why LLMs are better understood as ghosts than animals, and why you can outsource your thinking but never your understanding. Below is a complete breakdown of the talk for anyone building, hiring, or learning in the agent era.

    TLDW

    Karpathy describes a sharp transition that happened in December 2025, when agentic coding tools crossed a threshold and code chunks just started coming out fine without correction. He frames the current moment as Software 3.0, where prompting an LLM is the new programming, and entire app categories are collapsing into a single model call. He distinguishes vibe coding (raising the floor for everyone) from agentic engineering (preserving the professional quality bar at much higher speed). Models remain jagged because they are trained on what labs choose to verify, so founders should look for valuable but neglected verifiable domains. Taste, judgment, oversight, and understanding remain uniquely human responsibilities, and tools that enhance understanding are the ones he is most excited about.

    Key Takeaways

    • December 2025 was a clear inflection point. Code chunks from agentic tools started arriving correct without edits, and Karpathy stopped correcting the system entirely.
    • Software 3.0 means programming has become prompting. The context window is your lever over the LLM interpreter, which performs computation in digital information space.
    • Open Code’s installer is a software 3.0 example. Instead of a complex shell script, you copy paste a block of text to your agent, and the agent figures out your environment.
    • The Menu Gen anecdote illustrates how entire apps can become spurious. What used to require OCR, image generation, and a hosted Vercell app can now be a single Gemini plus Nano Banana prompt.
    • Vibe coding raises the floor. Agentic engineering preserves the professional ceiling. The two are different disciplines.
    • The 10x engineer multiplier is now far higher than 10x for people who are good at agentic engineering.
    • Hiring processes have not caught up. Puzzle interviews are the old paradigm. New evaluations should look like building a full Twitter clone for agents and surviving simulated red team attacks from other agents.
    • Models are jagged because reinforcement learning rewards what is verifiable, and labs choose which verifiable domains to invest in. Strawberry letter counts and the 50 meter car wash question show how state-of-the-art models can refactor 100,000 line codebases yet fail at trivial reasoning.
    • If you are in a verifiable setting, you can run your own fine tuning, build RL environments, and benefit even when the labs are not focused on your domain.
    • LLMs are ghosts, not animals. They are statistical simulations summoned from pre training and shaped by RL appendages, not creatures with curiosity or motivation. Yelling at them does not help.
    • Taste, aesthetics, spec design, and oversight remain human jobs. Models still produce bloated, copy paste heavy code with brittle abstractions.
    • Documentation is still written for humans. Agent native infrastructure, where docs are explicitly designed to be copy pasted into an agent, is a major opportunity.
    • The future likely involves agent representation for people and organizations, with agents talking to other agents to coordinate meetings and tasks.
    • You can outsource your thinking but not your understanding. Tools that help humans understand information faster are uniquely valuable.

    Detailed Summary

    Why Karpathy Feels More Behind Than Ever

    Karpathy opens by describing how he has been using agentic coding tools for over a year. For most of that period, the experience was mixed. The tools could write chunks of code, but they often required edits and supervision. December 2025 changed everything. With more time during a holiday break and the release of newer models, Karpathy noticed that the chunks just came out fine. He kept asking for more. He cannot remember the last time he had to correct the agent. He started trusting the system, and what followed was a cascade of side projects.

    He wants to stress that anyone whose model of AI was formed by ChatGPT in early 2025 needs to look again. The agentic coherent workflow that genuinely works is a fundamentally different experience, and the transition was stark.

    Software 3.0 Explained

    The Software 1.0 paradigm was writing explicit code. Software 2.0 was programming by curating datasets and training neural networks. Software 3.0 is programming by prompting. When you train a GPT class model on a sufficiently large set of tasks, the model implicitly learns to multitask everything in the data. The result is a programmable computer where the context window is your interface, and the LLM is the interpreter performing computation in digital information space.

    Karpathy gives two concrete examples. The first is Open Code’s installer. Normally a shell script handles installation across many platforms, and these scripts balloon in complexity. Open Code instead provides a block of text you copy paste to your agent. The agent reads your environment, follows instructions, debugs in a loop, and gets things working. You no longer specify every detail. The agent supplies its own intelligence.

    The Menu Gen Story

    The second example is Karpathy’s Menu Gen project. He built an app that takes a photo of a restaurant menu, OCRs the items, generates pictures for each dish, and renders the enhanced menu. The app runs on Vercell and chains together multiple services. Then he saw a software 3.0 alternative. You take a photo, give it to Gemini, and ask it to use Nano Banana to overlay generated images onto the menu. The model returns a single image with everything rendered. The entire app he built is now spurious. The neural network does the work. The prompt is the photo. The output is the photo. There is no app between them.

    Karpathy uses this to argue that founders should not just think of AI as a speedup of existing patterns. Entirely new things become possible. His example is LLM driven knowledge bases that compile a wiki for an organization from raw documents. That is not a faster version of older code. It is a new capability with no prior equivalent.

    What Will Look Obvious in Hindsight

    Stephanie Zhan asks what the equivalent of building websites in the 1990s or mobile apps in the 2010s looks like today. Karpathy speculates about completely neural computers. Imagine a device that takes raw video and audio as input, runs a neural net as the host process, and uses diffusion to render a unique UI for each moment. He notes that early computing in the 1950s and 60s was undecided between calculator like and neural net like architectures. We went down the calculator path. He thinks the relationship may eventually flip, with neural networks becoming the host and CPUs becoming co processors used for deterministic appendages.

    Verifiability and Jagged Intelligence

    Karpathy spent significant writing time on verifiability. Classical computers automate what you can specify in code. The current generation of LLMs automates what you can verify. Frontier labs train models inside giant reinforcement learning environments, so the models peak in capability where verification rewards are strong, especially math and code. They stagnate or get rough around the edges elsewhere.

    This explains the jagged intelligence puzzle. The classic example was counting letters in strawberry. The newer one Karpathy offers: a state of the art model will refactor a 100,000 line codebase or find zero day vulnerabilities, then tell you to walk to a car wash 50 meters away because it is so close. The two coexisting capabilities should be jarring. They reveal that you must stay in the loop, treat models as tools, and understand which RL circuits your task lands in.

    He also points out that data distribution choices matter. The jump in chess capability from GPT 3.5 to GPT 4 came largely because someone at OpenAI added a huge amount of chess data to pre training. Whatever ends up in the mix gets disproportionately good. You are at the mercy of what labs prioritize, and you have to explore the model the labs hand you because there is no manual.

    Founder Advice in a Lab Dominated World

    Asked what founders should do given that labs are racing toward escape velocity in obvious verifiable domains, Karpathy points back to verifiability itself. If your domain is verifiable but currently neglected, you can build RL environments and run your own fine tuning. The technology works. Pull the lever with diverse RL environments and a fine tuning framework, and you get something useful. He hints there is one specific domain he finds undervalued but declines to name it on stage.

    On the question of what is automatable only from a distance, Karpathy says almost everything can ultimately be made verifiable. Even writing can be assessed by councils of LLM judges. The differences are in difficulty, not in possibility.

    From Vibe Coding to Agentic Engineering

    Vibe coding raises the floor. Anyone can build something. Agentic engineering preserves the professional quality bar that existed before. You are still responsible for your software. You are still not allowed to ship vulnerabilities. The question is how you go faster without sacrificing standards. Karpathy calls it an engineering discipline because coordinating spiky, stochastic agents to maintain quality at speed requires real skill.

    The ceiling on agentic engineering capability is very high. The old idea of a 10x engineer is now an understatement. People who are good at this peak far above 10x.

    What Mediocre Versus AI Native Looks Like

    Karpathy compares this to how different generations use ChatGPT. The difference between a mediocre and an AI native engineer using Claude Code, Codex, or Open Code is investment in setup and full use of available features. The same way previous generations of engineers got the most out of Vim or VSCode, today’s strong engineers tune their agentic environments deeply.

    He thinks hiring processes have not caught up. Most companies still hand out puzzles. The new test should look like asking a candidate to build a full Twitter clone for agents, make it secure, simulate user activity with agents, and then run multiple Codex 5.4x high instances trying to break it. The candidate’s system should hold up.

    What Humans Still Own

    Agents are intern level entities right now. Humans are responsible for aesthetics, judgment, taste, and oversight. Karpathy describes a Menu Gen bug where the agent tried to associate Stripe purchases with Google accounts using email addresses as the key, instead of a persistent user ID. Email addresses can differ between Stripe and Google accounts. This kind of specification level mistake is exactly what humans must catch.

    He works with agents to design detailed specs and treats those as documentation. The agent fills in the implementation. He has stopped memorizing API details for things like NumPy axis arguments or PyTorch reshape versus permute. The intern handles recall. Humans handle architecture, design, and the right questions.

    Reading the actual code agents produce can still cause heart attacks. It is bloated, full of copy paste, riddled with awkward and brittle abstractions. His Micro GPT project, an attempt to simplify LLM training to its bare essence, was nearly impossible to drive through agents. The models hate simplification. That capability sits outside their RL circuits. Nothing is fundamentally preventing this from improving. The labs simply have not invested.

    Animals Versus Ghosts

    Karpathy returns to his framing that we are not building animals, we are summoning ghosts. Animal intelligence comes from evolution and is shaped by intrinsic motivation, fun, curiosity, and empowerment. LLMs are statistical simulation circuits where pre training is the substrate and RL is bolted on as appendages. They are jagged. They do not respond to being yelled at. They have no real curiosity. The ghost framing is partly philosophical, but it changes how you approach them. You stay suspicious. You explore. You do not assume the system you used yesterday will behave the same on a new task.

    Agent Native Infrastructure

    Most software, frameworks, libraries, and documentation are still written for humans. Karpathy’s pet peeve is being told to do something instead of being given a block of text to copy paste to his agent. He wants agent first infrastructure. The Menu Gen project’s hardest part was not writing code. It was deploying on Vercell, configuring DNS, navigating service settings, and stringing together integrations. He wants to give a single prompt and have the entire thing deployed without touching anything.

    Long term he expects agent representation for individuals and organizations. His agent will negotiate meeting details with your agent. The world becomes one of sensors, actuators, and agent native data structures legible to LLMs.

    Education and What Still Matters

    The most striking line of the conversation comes near the end. Karpathy quotes a tweet that shaped his thinking: you can outsource your thinking but you cannot outsource your understanding. Information still has to make it into your brain. You still need to know what you are building and why. You cannot direct agents well if you do not understand the system.

    This is part of why he is so excited about LLM driven knowledge bases. Every time he reads an article, his personal wiki absorbs it, and he can query it from new angles. Every projection onto the same information yields new insight. Tools that enhance human understanding are uniquely valuable because LLMs do not excel at understanding. That bottleneck is yours to manage.

    Thoughts

    The most useful frame in this talk is the distinction between vibe coding and agentic engineering. It clarifies what has been muddled for the past year. Vibe coding is about access. Anyone can produce something. Agentic engineering is about discipline. You preserve the standards that made software trustworthy in the first place, while moving at speeds that would have seemed absurd two years ago. These are not the same activity, and conflating them is part of why so many shipped products feel half built.

    The Menu Gen anecdote is the kind of story that should make every solo developer pause. If a single Gemini plus Nano Banana prompt can replace a multi service Vercell deployed app, the question for any builder becomes how much of what you are working on right now is going to be made spurious by the next model release. The honest answer is probably more than you want to admit. The defensive posture is not building thicker apps. It is choosing problems where the model alone is not enough, where taste, distribution, infrastructure, or specific verifiable RL environments give you something the next model cannot collapse into a prompt.

    The verifiability lens is also unusually practical. If you are a solo builder, the question shifts from what is possible to what is verifiable but neglected. The labs will eat the obvious verifiable domains because that is how their RL pipelines are set up. The opportunity is in domains where verification is possible but the labs have not yet invested. That is a much more concrete strategic filter than vague intuitions about defensibility.

    The car wash example is going to stick. State of the art models can refactor enormous codebases and still tell you to walk somewhere a sane person would drive. That is the lived reality of jagged intelligence, and it argues strongly for staying in the loop on real decisions rather than handing off everything to agents. The agents are excellent fillers of blanks. They are not yet trustworthy specifiers of the spec.

    Finally, the line about outsourcing thinking but not understanding is worth taping above the desk. The bottleneck is no longer typing speed, syntax recall, or even API knowledge. It is whether the human in the loop actually understands the system being built. Tools that genuinely improve human understanding, including personal knowledge bases that re project information through different prompts, are likely the most undervalued category of products being built right now. The opportunity is not just in agents. It is in the cognitive scaffolding that makes humans good directors of agents.