PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: NVLink

Jensen Huang at Stanford CS153 Frontier Systems on Co-Design, Agentic Computing, Vera Rubin, Open Models, and the Million-X Decade That Reshaped AI Infrastructure
https://www.youtube.com/watch?v=tsQB0n0YV3k

NVIDIA CEO Jensen Huang returned to Stanford for the CS153 Frontier Systems class (the room nicknamed itself “AI Coachella”) to lay out, in raw form, how he thinks about the computer being reinvented for the first time in over sixty years. Across roughly seventy minutes of student questions he walks through the codesign philosophy that gave NVIDIA a million-x decade, the architectural through-line from Hopper to Grace Blackwell to Vera Rubin to Feynman, the case for open source foundation models, the realities of tokens per watt and MFU, energy demand running a thousand times higher, the China and export-control debate, and his own biggest strategic mistakes. Watch the full conversation on YouTube.

TLDW

Huang argues every layer of computing has changed: the programming model, the system architecture, the deployment pattern, the economics. Co-design across CPUs, GPUs, networking, storage, switches and compilers gave NVIDIA roughly a million-x speed-up over ten years versus the ten-x Moore’s Law era, and that headroom is what let researchers say “just train on the whole internet.” Hopper was built for pre-training, Grace Blackwell NVLink72 for inference and reasoning (50x over Hopper in two years), Vera Rubin is built for agents that load long memory, call tools and need a low-latency single-threaded CPU bolted directly to the GPU, and Feynman extends that to swarms of agents that spawn sub-agents. Open weights matter because safety, sovereignty (230-plus languages no one else will fund) and domain models for biology, autonomy, robotics and climate need a foundation that NVIDIA is willing to seed. Compute is not really the scarce resource (Huang says place the order and the chips ship), the broken thing is institutional budgeting that can’t put a billion dollars into a shared university supercomputer. Energy demand is heading a thousand times higher and this is finally the moment market forces alone will fund sustainable generation. On geopolitics he rejects the GPUs-as-atomic-bombs framing and warns America will end up like its telecom industry if it cedes two thirds of the world. On career he advises seeking suffering on purpose. On strategy he says observe, reason from first principles, build a mental model, work backwards, minimize opportunity cost, maximize optionality.

Key Takeaways
- The computing model has been substantially unchanged since the IBM System 360, sixty-plus years ago. Huang’s first computer architecture book was the System 360 manual. AI is the first true reinvention.
- Old computing was pre-recorded retrieval. New computing is generated, contextually aware and continuous. Cloud was on-demand. Agentic systems run continuously.
- Codesign is NVIDIA’s central thesis. Inherited from the Hennessy and Patterson RISC era at Stanford, extended across CPUs, GPUs, networking, switches, storage, compilers and frameworks all optimized together.
- The result of full-stack codesign: roughly 1,000,000x faster compute over ten years, versus a generous 10x to 100x for Moore’s Law in the same period. Dennard scaling effectively ended a decade ago.
- That million-x speed-up is what unlocked “train on all of the internet” as a realistic AI strategy.
- After GPT, Huang says it was obvious thinking was next. Reasoning is just generating tokens consumed internally, then using tools is generating tokens consumed externally. Agentic systems followed predictably.
- Education needs AI baked into the curriculum, not just taught as a subject. Pre-recorded textbooks cannot keep pace with knowledge being generated in real time.
- Huang says he cannot learn anymore without AI. He has the AI read the paper, then read every related paper, then become a dedicated researcher he can interrogate.
- Mead and Conway and the first-principles methodology of semiconductor design are still worth learning even though most of the scaling tricks have been exhausted.
- NVIDIA itself is one of the largest consumers of Anthropic and OpenAI tokens in the world. One hundred percent of NVIDIA engineers are now agentically supported. Huang recommends Claude and similar tools by name and says open-source downloads will not match the integrated product harness.
- NVIDIA still invests heavily in open foundation models because language and intelligence represent the codification of human knowledge. Five pillars: Nemotron (language), BioNeMo (biology), Alphamayo (autonomous vehicles), Groot (humanoid robotics) and a climate science model (mesoscale multiphysics).
- Sovereign language models matter. Roughly 230 world languages will never be a top priority for a commercial frontier lab. Nemotron is near-frontier and fully fine-tunable so any country can adapt it.
- Safety and security require open weights. You cannot defend against or audit a black box. Transparent systems let researchers interrogate models and let defenders deploy swarms.
- The future of cyber defense is not bigger-model-versus-bigger-model. It is trillions of cheap fast small models like Nemotron Nano surrounding the threat.
- Domain models fuse language priors with world models. Alphamayo learned to drive safely on a few million miles instead of billions because it can reason like a human about the road.
- MFU (Model Flops Utilization) is a misleading metric. Huang says he wants low MFU, because that means he over-provisioned every resource and never gets pinned by Amdahl’s law during a spike.
- The xAI Memphis cluster running at 11 percent MFU is not necessarily a failure mode. In disaggregated prefill plus decode inference you can deliver very high tokens per watt with very low MFU.
- The right metric is performance, ultimately tokens per watt as a proxy for intelligence per watt, and even that needs adjustment because not all tokens are equal. Coding tokens are worth more than other tokens.
- Hopper was designed for pre-training. NVIDIA chose to build multi-billion-dollar systems when the largest existing scientific supercomputer cost $350 million, with no proven customer base. It worked.
- Grace Blackwell NVLink72 was designed for inference, especially the high-memory-bandwidth decode phase. It is the world’s first rack-scale computer and delivered a 50x speed-up over Hopper in two years, against an expected 2x from Moore’s Law.
- Vera Rubin is designed for agents. Long-term memory wired into storage and into the GPU fabric, working memory, heavy tool use, and Vera, a CPU optimized for low-latency multi-core single-threaded code so a multi-billion-dollar GPU system does not stall waiting on a slow tool call.
- Feynman is being shaped for swarms of agents with sub-agents and sub-sub-agents, a recursive software topology that demands a new compute pattern.
- Tokens per watt improved 50x in one generation. Compounding energy efficiency is the lever NVIDIA controls directly.
- Total compute energy demand is heading roughly a thousand times higher than today, possibly two orders of magnitude beyond that. Huang says he would not be surprised if the estimate is low.
- For the first time in history, market forces alone are enough to fund solar, nuclear and grid upgrades. Government subsidies are no longer required to make sustainable energy investment rational.
- Copper interconnect is becoming a bottleneck. Photonics is moving from optional to structural inside racks and across them.
- Comparing NVIDIA GPUs to atomic bombs, Huang says, is a stupid analogy. A billion people use NVIDIA GPUs. He advocates them to his family. He does not advocate atomic bombs to anyone.
- If the United States cedes two thirds of the global market to competitors on policy grounds, the American technology industry will end up like American telecommunications, which was policied out of existence.
- Huang directly rejects AI doom-by-singularity narratives. It is not true that we have no idea how these systems work. It is not true that the technology becomes infinitely powerful in a nanosecond. He calls the rhetoric irresponsible and harmful to the field students are about to enter.
- On Stanford specifically: if the university president places an order, NVIDIA will deliver the chips. The bottleneck is that no university department has a billion-dollar compute budget because budgeting is fragmented across grants. Stanford’s $40 billion endowment is more than enough to fix that.
- “It’s Stanford’s fault” is meant as empowerment. If something is your fault, you can solve it.
- Career advice: do not optimize purely for passion. Most people do not yet know what they love. Pick the job in front of you and do it as well as possible. Even as CEO, Huang says, 90 percent of the work is hard and he suffers through it.
- Suffering on purpose builds the muscle of resilience. When the company, the team or the family needs you to be tough, that muscle has to already exist.
- NVIDIA’s first generation of products was technically wrong in nearly every dimension: curved surfaces instead of triangles, no Z-buffer, forward instead of inverse texture mapping, no floating point. The strategic recovery, not the technology, taught Huang the lessons that have lasted decades.
- The biggest clean strategic mistake Huang names is the move into mobile chips (Tegra). It grew to a billion dollars then went to zero when Qualcomm’s modem dominance shut NVIDIA out of the 3G to 4G transition. The recovery into automotive and robotics (the Thor chip is the great great great grandson of that mobile lineage) was real, but Huang refuses to rationalize the original choice.
- Forecasting framework: observe, reason from first principles, ask “so what” and “what next” until you have a mental model of the future, place your company inside that model, then work backwards while minimizing opportunity cost and maximizing optionality.
- Best part of the CEO job: living at the intersection of vision, strategy and execution surrounded by people capable enough to make ambitious visions real. Worst part: the responsibility for everyone who joined the spaceship, especially in the near-death moments NVIDIA had four or five times early on.
- Underrated insider note: Huang’s first apple pie with cheese, first hot fudge sandwich and first milkshake all happened at Denny’s. The Superbird, the fried chicken and a custom Superbird-style ham and cheese with tomato and mustard are his order.
Detailed Summary

Computing reinvented from the ground up

Huang frames the moment as the first true rewrite of the computer in sixty-plus years. From the IBM System 360 forward, the mental model of writing code, running code, taking a computer to market and reasoning about applications stayed roughly constant. AI changes the programming model itself. Software is no longer a compiled binary running deterministically on a CPU. It is a neural network running on a GPU producing generated, contextual, real-time output. That cascades into how companies are organized, what tools developers use, what the network and storage stack look like, and what an application is even allowed to do. Robo-taxis, he notes, are an application no one would have attempted before deep learning unlocked perception.

Codesign and the million-x decade

Codesign is the philosophical center of the talk. Huang traces it to the RISC work of John Hennessy at Stanford, where simpler instruction sets won by being co-designed with the compiler rather than maximally optimized in isolation. NVIDIA extends the principle across every layer simultaneously: GPU architecture, CPU architecture, NVLink and NVSwitch fabrics, photonic interconnects, networking silicon, storage paths, CUDA libraries, frameworks and ultimately the model design. The numbers Huang gives are arresting. Moore’s Law in its prime delivered roughly 100x per decade. By the time Dennard scaling broke, real-world gains had compressed to roughly 10x. NVIDIA’s codesigned stack delivered between 100,000x and 1,000,000x over the same ten-year window. That non-linear speed-up is, in Huang’s telling, the precondition for modern AI: it is what allowed researchers to stop curating training sets and just feed the entire internet to the model.

Education has to fuse first principles with AI tools

Asked how curriculum should evolve, Huang argues AI must be integrated into the learning process, not just taught about. He recalls Hennessy writing his textbook by hand a chapter a week while Huang was a student, and says pre-recorded textbooks cannot keep up with the rate at which AI generates new knowledge. He describes his own learning workflow: hand the paper to an AI, then have it read the entire surrounding literature, then treat the AI as a dedicated researcher who can be interrogated. At the same time he defends the classics. Mead and Conway are still the foundation. Most modern semiconductor scaling tricks have been exhausted, but knowing where the field came from sharpens judgment when designing what comes next.

Open source and the five domain pillars

Huang gives one of the most detailed public accounts of why NVIDIA invests so heavily in open foundation models even while being a top customer of closed labs. He recommends Claude and OpenAI by name for production coding work, and says 100 percent of NVIDIA engineers are now agentically supported. The open-weights case rests on three legs. First, language is the codification of intelligence, and there are at least 230 languages that no commercial lab will ever prioritize. Nemotron is built near frontier and released so any country or community can fine-tune it. Second, the same representation-learning approach has to be replicated in domains where the data is not internet text, so NVIDIA seeded BioNeMo for biology, Alphamayo for autonomy, Groot for humanoid robotics and a climate model for mesoscale multiphysics. The economics of those fields would never produce a foundation model on their own. Third, safety and security require transparency. A black box cannot be defended or audited, and the future of cyber defense is not bigger-model-versus-bigger-model but swarms of cheap fast small models like Nemotron Nano surrounding the threat.

MFU is the wrong metric, tokens per watt is closer

A student raises the leaked memo that the xAI Memphis cluster is running at 11 percent Model Flops Utilization. Huang flips the framing. He says he would rather be at low MFU all the time, because that means he over-provisioned flops, memory bandwidth, memory capacity and network capacity. Bottlenecks shift constantly, so over-provisioning across every dimension is what lets the system absorb a spike without getting pinned by Amdahl’s law. In disaggregated inference, where prefill and decode are physically separated and decode is bandwidth-bound rather than flop-bound, NVLink72 can deliver extremely high tokens per watt while reporting very low MFU. Huang argues the right framing is performance, and ultimately tokens per watt as a rough proxy for intelligence per watt, adjusted for the fact that not all tokens are equal. A coding token is worth more than a generic token.

Hopper, Grace Blackwell NVLink72, Vera Rubin, Feynman

Huang gives the clearest public framing of NVIDIA’s roadmap as a sequence of architectural answers to evolving compute patterns. Hopper was built for pre-training, at a moment when NVIDIA chose to build multi-billion-dollar machines while the largest scientific supercomputer in the world cost $350 million and the marketplace for such systems was, on paper, zero. Grace Blackwell NVLink72 was the answer to inference and reasoning: a rack-scale computer that ganged 72 GPUs together because decode needs aggregate memory bandwidth far beyond a single chip. The generation-over-generation speed-up was 50x in two years, twenty-five times what Moore’s Law would have delivered. Vera Rubin is being built explicitly for agents. Agents load long-term memory from storage that has to be wired directly into the GPU fabric, they use working memory, they call tools that run on a CPU, and they wait. So the CPU has to be Vera, optimized for low-latency single-threaded code, because the multi-billion-dollar GPU system cannot afford to idle waiting on a slow tool call. Feynman extends the pattern to swarms of agents with sub-agents and sub-sub-agents, a recursive software topology that will demand its own compute pattern.

Energy demand and the grid

Huang’s energy projection is one of the most aggressive numbers in the talk. NVIDIA can compound tokens per watt by 50x per generation through codesign, but the total compute demand is heading roughly a thousand times higher, and Huang says he would not be surprised if the real figure is one or two orders of magnitude beyond that. The reason is structural: future computing is generative and continuous, not pre-recorded and on-demand. The good news, he argues, is that this is the best moment in the history of humanity to invest in sustainable generation. Market forces alone are now sufficient to fund solar, nuclear and grid upgrades. Government subsidies are no longer required to make the math work.

Adversarial countries, export controls and the telecom warning

This is the segment where Huang is visibly fired up. He attacks the GPUs-as-atomic-bombs framing on its face. NVIDIA GPUs power medical imaging, video games and soy sauce delivery. A billion people use them. He advocates them to his family. The analogy collapses at the first comparison. He attacks the second framing, that American companies should not compete abroad because they will lose anyway, as a self-fulfilling defeat. Competition makes the company better. The third framing, that depriving the rest of the world of general-purpose computing benefits the United States, also fails on first principles: it benefits one or two American companies at the cost of an entire industry. The cautionary parallel is telecommunications. The United States once had a leading position in telecom fundamental technology and policied itself out of it. Huang’s worry, voiced explicitly to a room of CS students, is that they will graduate into a shell of a computer industry if the same path is repeated.

AI doom and rational optimism

In the same arc Huang rejects the science-fiction framing of AI as a singularity that arrives suddenly on a Wednesday at 7pm and ends civilization. He calls those claims irresponsible, says they are not true, and points out that the people advancing them are believed by audiences who then make policy on that basis. It is not true that no one understands how these systems work. It is not true that intelligence becomes infinitely powerful instantaneously. It is not true that there is no defense. His framing, which the host echoes as “rational optimism,” is that the goal is to create a future where people care about computers because the technology students are learning is worth mastering.

Stanford’s compute problem is Stanford’s fault

A student presses on the scarcity of compute for independent researchers, startups and universities inside the United States. Huang’s answer is sharp: there is no shortage. Place the order and the chips will arrive. The actual broken thing is institutional. University grants are fragmented across departments. No researcher can raise enough on a single grant to fund a billion-dollar shared cluster, and no one shares. He compares it to showing up at the grocery store demanding a billion dollars of tomatoes today. The solution is planning, aggregation and a campus-scale supercomputer, the way Stanford once built the linear accelerator. The endowment is $40 billion. Pulling a billion off it, contracting cloud capacity and giving every student and researcher AI supercomputer access is, in Huang’s view, obviously doable. When he says “it is Stanford’s fault” the host laughs, but Huang clarifies: if it is your fault you have the power to fix it.

Career, suffering and resilience

Asked how a CS student should spend the next few years, Huang pushes back on the standard “follow your passion” advice. Most people do not know what they love yet, because no one knows what they do not know. The bar of demanding joy from every working day is too high. Whatever the job is, do it as well as you can. Even as CEO of NVIDIA he says he genuinely loves about 10 percent of his work. The other 90 percent is hard and he suffers through it. He recommends suffering on purpose, because resilience is a muscle that only builds under load, and when the company, the team or the family needs that muscle, it has to already exist. Earlier in his life that meant cleaning toilets and busing tables at Denny’s. He does it today running a multi-trillion-dollar company.

The biggest mistakes

Huang separates technical mistakes from strategic mistakes. NVIDIA’s first generation of products was technically wrong in almost every way: curved surfaces instead of triangles, no Z-buffer, forward instead of inverse texture mapping, no floating point inside. The company wasted two and a half years. But the strategic genius of the recovery, the reading of the market, the conservation of resources and the reapplication of talent, is what taught him strategy. The clean strategic mistake he names is mobile. NVIDIA’s Tegra line grew to a billion dollars of revenue and then collapsed to zero when Qualcomm’s modem dominance locked NVIDIA out of the 3G to 4G transition. Huang explicitly refuses the comforting rationalization that the Tegra effort fed the Thor automotive chip (“Thor is the great great great grandson”). The original decision, he says, was a waste of time. The lesson is to think one or two clicks further about whether a market is structurally winnable before committing the company.

Forecasting under fog of war

The final substantive exchange is on forecasting. Huang’s method has four steps. Observe what is actually happening (AlexNet crushing two decades of computer vision research in one shot, GPT producing reasoning by token generation). Reason from first principles about why it works. Ask “so what” and “what next” recursively until a mental model of the future emerges. Place the company inside that future and work backwards. Crucially, expect to be partly wrong. Some outcomes will absolutely happen, some will likely happen, some might happen, and the strategy has to be robust across that distribution. The real cost of any strategic choice is the opportunity cost of the alternatives you did not take, so the discipline is to minimize that cost and maximize optionality while letting the journey itself pay for the journey.

Thoughts

The most useful thing in this conversation is the explicit architectural mapping of compute patterns to chip generations. Hopper for pre-training. Grace Blackwell NVLink72 for inference, because decode is bandwidth-bound and a single chip cannot supply it. Vera Rubin for agents, because tool calls stall multi-billion-dollar GPU systems and so the CPU has to be optimized for low-latency single-threaded code. Feynman for swarms. That sequence is not marketing. It is a falsifiable thesis about where the bottleneck moves next, and every other infrastructure company should be measuring themselves against it. If Huang is right that swarms of sub-agents are the next dominant pattern, then the design pressure shifts from raw flops to fabric topology, memory hierarchy and storage-to-GPU latency. That has implications for everyone downstream, including the hyperscalers building competing accelerators.

The MFU section is the most intellectually generous moment in the talk. The instinct in the AI ops community has been to chase MFU as if it were a virtue. Huang argues, persuasively, that low MFU is consistent with high tokens per watt in a disaggregated inference setup, and that bottlenecks rotate fast enough that over-provisioning every resource is the rational design. That reframing matters because it changes what “scarce” means. Compute is not scarce in the way the discourse treats it. What is scarce is a coherent system designed end-to-end. The xAI 11 percent number, in that frame, is not embarrassing. It is the natural reading of a workload that is mostly decode.

The Stanford segment is the part most likely to be quoted out of context. “It’s Stanford’s fault” is a deliberately provocative line, but the underlying claim is correct and load-bearing. Compute is not gated by NVIDIA refusing to ship chips. It is gated by the fact that fragmented grant funding cannot aggregate into the billion-dollar order that NVIDIA can fulfill. The implication is that universities and national labs need a structural change in how they pool capital for compute, and that the current model of every researcher buying a handful of cards is genuinely obsolete. Huang’s nudge about pulling a billion off the endowment is concrete enough to be acted on, and other major research universities should read this segment as a direct prompt.

The geopolitical segment is the highest-stakes one. The telecommunications comparison is correct as a historical pattern, and Huang is one of the very few executives in a position to deliver that warning credibly. The unresolved tension is that the argument applies symmetrically. If American AI dominance is built by selling globally, that includes selling into adversarial states, and the policy question is where the line falls. Huang does not answer that question. He attacks the framing that lets the question be answered badly. That is a meaningful contribution to the discourse even if it does not resolve the underlying tradeoff.

The career advice section is the part the social-media clips will mishandle. “Seek suffering” reads as macho when extracted. In context it is a specific operational claim about how resilience compounds, and it is paired with the Tegra story where Huang himself paid the price of not thinking one more click ahead. That kind of self-implication is rare in CEO talks, and it is the reason the talk is worth listening to in full rather than only reading the recap.

Watch the full Stanford CS153 Frontier Systems conversation with Jensen Huang here.
May 13, 2026
How GPT-5, Claude, and Gemini Are Actually Trained and Served: The Real Math Behind Frontier AI Infrastructure
Reiner Pope, CEO of MatX and former TPU architect at Google, sat down with Dwarkesh Patel for a different kind of episode: a chalk-and-blackboard lecture on how frontier LLMs like GPT-5, Claude, and Gemini are actually trained and served. With nothing but a handful of equations and public API prices, Reiner reverse engineers an astonishing amount of what the labs are doing. If you have ever wondered why Fast Mode costs more, why context length stalls around 200k tokens, why models seem 100x over-trained, or why hyperscalers are pouring half a trillion dollars into memory, this is the most lucid explanation on the internet.

TLDW

Frontier LLM economics come down to two simple budgets: compute time and memory time. Once you write the rooflines on a blackboard, almost everything else falls out of them. Optimal batch size is roughly 300 times your sparsity ratio (around 2,000 to 3,000 tokens for a DeepSeek-style model). A new batch “train” departs every 20 milliseconds because that is how long it takes to read HBM end to end. Mixture of experts strongly favors staying inside a single rack, which is why scale-up domains went from 8 GPUs (Hopper) to 72 (Blackwell) to 500-plus (Rubin). Pipeline parallelism solves weight capacity but does nothing for KV cache, and adds painful per-hop latency, which is why Ilya famously said pipelining is not wise. Because of reinforcement learning and inference economics, frontier models are roughly 100x over-trained versus Chinchilla optimal, and a well-tuned model should output roughly as many tokens during deployment as went into its pre-training corpus. API prices leak the rest: Gemini’s 50% premium above 200k tokens reveals where KV memory time crosses weight memory time, prefill being 5x cheaper than decode confirms decode is memory bandwidth bound, and cache hit pricing tiers map directly to HBM, DDR, flash, and (yes) spinning disk. The lecture closes on a beautiful detour about the convergent evolution of neural nets and cryptographic ciphers.

Key Takeaways
- Two equations explain almost everything. A roofline analysis comparing compute time to memory fetch time predicts cost, latency, and architectural choices with shocking accuracy.
- Optimal batch size is about 300 times sparsity. For a DeepSeek model that activates 32 of 256 experts, that lands around 2,000 to 3,000 tokens per batch. Real deployments go a bit higher to leave headroom.
- The 20 millisecond train. A new batch departs every 20ms because that is how long it takes to read all of HBM once. Worst-case queue latency is roughly 40ms.
- Fast Mode is just smaller batches. Pay 6x more, get 2.5x faster decode by amortizing weights over fewer users. There is a hard latency floor at the HBM read time.
- Slow Mode would not save much. Once you are past the optimal batch size, the cost-per-token plateau is dominated by compute, not weight fetches. You cannot meaningfully amortize KV cache because it is unique per sequence.
- One rack is the natural MoE unit. Expert parallelism wants all-to-all communication, which strongly favors the scale-up network (NVLink) over the scale-out network (roughly 8x slower).
- Bigger scale-up domains drove model scaling. The jump from 8 (Hopper) to 72 (Blackwell) to 500-plus (Rubin) GPUs per rack increased aggregate memory bandwidth by 8x, which is why trillion-plus parameter models only became viable recently.
- Pipeline parallelism is overrated for inference. It saves on weight memory capacity but does nothing for KV cache memory. It also adds milliseconds of latency per hop in decode.
- Why Ilya said pipelining is not wise. Architectural constraints (cross-layer residuals like in Kimi) and the inability to amortize weight loads across micro-batches make pipelining a hassle in training too.
- The memory wall is real and paradoxical. Hyperscalers reportedly spend 50% of CapEx on memory, yet racks have far more HBM than a trillion-parameter model needs. The capacity is there for KV cache and batch size, not for weights.
- Frontier models are roughly 100x over-trained vs Chinchilla. When you minimize total cost across pre-training plus RL plus inference, smaller models trained on more data win.
- Each model should output roughly all human knowledge. If you equalize pre-training and inference compute, the total tokens served by a model during its lifetime should approximate its training corpus. Roughly 150 trillion in, 150 trillion out.
- API pricing reveals architecture. Gemini’s 50% premium above 200k context, the 5x decode-vs-prefill ratio, and cache duration tiers all leak detailed information about KV size, memory bottlenecks, and storage hierarchy.
- KV cache is roughly 2KB per token. Solving Gemini’s pricing equation gives a plausible 1.6 to 2 kilobytes per token at 100B active parameters and 200k context.
- Decode is memory bandwidth bound, prefill is compute bound. The 5x price gap is direct evidence.
- Cache pricing maps to memory tiers. The 5-minute and 1-hour cache durations probably correspond to flash and spinning disk drain times respectively. LLM serving uses spinning disk.
- Context length is stuck near 200k. Memory bandwidth, not compute, is the binding constraint. Sparse attention gives a square-root improvement but is not infinite.
- Cryptography and neural nets are mathematical cousins. Both rely on jumbling information across inputs. Feistel ciphers led directly to RevNets (reversible neural networks). Adversarial attacks mirror the cipher avalanche property.
Detailed Summary

The Roofline: Compute Time vs Memory Time

Reiner starts with the simplest possible model of LLM inference. The time to do a forward pass is bounded below by the maximum of compute time and memory fetch time. Compute time is the batch size times active parameters divided by FLOPs. Memory time is total parameters divided by memory bandwidth, plus a KV cache term that scales with batch size and context length. From these two equations, almost every economic and architectural fact about modern LLMs can be derived.

Plotting cost per token against batch size gives a clean picture: at low batch you pay enormous overhead because you cannot amortize the weight fetches, and at high batch you hit a compute floor. There is a sweet spot where memory bandwidth time equals compute time. That sweet spot is what Fast Mode and Slow Mode are tuning around.

Why Fast Mode Costs More: The Batch Trade-Off

When Claude Code or Codex offers Fast Mode at 6x the price for 2.5x the speed, what is really happening is that they are running you at a smaller batch size. Smaller batch means weight loads are amortized over fewer users, so cost per token goes up. But latency goes down because each forward pass touches less data. There is a hard floor on latency because you have to read every byte of HBM at least once per token, and that takes about 20 milliseconds on Blackwell-class hardware. There is also a soft ceiling on Slow Mode savings because the unamortizable parts (KV cache fetches, compute) eventually dominate.

The 20 Millisecond Train

HBM capacity divided by HBM bandwidth lands consistently around 20 milliseconds across generations of Nvidia hardware. That is the natural cadence at which a frontier model can run a forward pass over all its weights. Reiner uses a memorable analogy: a train departs every 20 milliseconds. Any users whose requests are ready board the train. If the train is full, they wait. If it is empty, it leaves anyway. This is why you do not need millions of concurrent users to saturate a model’s batch. You only need enough to fill a 2,000-token train every 20ms.

Why Optimal Batch Size Is About 300 Times Sparsity

Setting compute time equal to weight fetch time and rearranging gives a beautiful result: batch size needs to be greater than (FLOPs / memory bandwidth) times (total params / active params). The hardware ratio is a dimensionless 300 on most GPUs and has stayed remarkably stable from A100 through Hopper, Blackwell, and Rubin. The model term is just the sparsity ratio. For DeepSeek with 32 of 256 experts active, that is 8. So optimal batch is around 2,400 tokens. Real deployments push this to 3x to leave headroom for non-ideal efficiency. At 64 trains per second, that is roughly 128,000 tokens per second per replica, or about 1/1000 of Gemini’s reported global throughput.

Mixture of Experts Wants to Live Inside a Rack

MoE all-to-all routing means every token can be sent to any expert on any GPU. The communication pattern strongly prefers the fast scale-up network (NVLink) inside a rack to the slower scale-out network between racks. Scale-out is roughly 8x slower in bandwidth. This is why one rack ends up being the natural unit for an expert layer, and why Nvidia’s progression from 8 GPUs per rack (Hopper) to 72 (Blackwell) to 500-plus (Rubin) has been such a big deal for model size scaling.

Reiner walks through the physical constraints: cable density, bend radius, weight, power, cooling. Modern racks are pushing every dimension to the limit. Stuffing more GPUs into the scale-up domain is genuinely a hardware engineering problem.

Pipeline Parallelism: Why Ilya Said It Is Not Wise

Pipelining splits model layers across racks. It is the natural way to scale beyond the scale-up domain for very large models. But it has problems. In inference, pipelining does not save runtime, it only saves memory capacity per rack, which already is not the binding constraint because trillion-parameter models only need a terabyte and racks have 10x that. In training, pipelining creates the famous bubble (idle GPU time at the start and end of each pipeline pass) and forces micro-batching, which kills your ability to amortize weight loads across the global batch.

There is also an architectural cost. Models like Kimi use cross-layer residual connections where attention attends to layers a few back, and pipelining makes those patterns very hard to implement cleanly. Ilya’s quip “as we now know, pipelining is not wise” captures all of this.

The Memory Wall Paradox

Industry analysts report that hyperscalers are spending 50% of CapEx on memory this year, while smartphones and laptops are seeing 30% volume drops because there is not enough HBM and DDR to go around. Yet a Blackwell rack already has tens of terabytes of HBM, far more than a trillion-parameter model needs. The reason is that all that extra capacity goes to KV cache, batch size, and longer context. The bandwidth, not the capacity, is what matters most for weight loading. This also implies that hardware could be designed with less HBM per GPU if you commit to pipelining the weights, which is a real architectural option for a chip startup like MatX.

Reinforcement Learning and the 100x Over-Training of Frontier Models

Chinchilla scaling laws say a model with N active parameters should be trained on roughly 20N tokens for compute-optimal training. But frontier labs do not just minimize training cost. They minimize training plus inference cost across the model’s deployment lifetime. With reinforcement learning added to the mix, the cost equation has three terms: pre-training (6 times active params times tokens), RL (somewhere between 2x and 6x times active params times RL tokens, with a 30% efficiency penalty for decode-heavy rollouts), and inference (2 times active params times inference tokens).

If you assume those three roughly equalize at the optimum (a heuristic that holds for many cost curves), you get a clean conclusion: the data going into pre-training should be roughly equal to the data going into RL, which should be roughly equal to the tokens served at inference. With 100 billion active parameters and roughly 150 trillion training tokens, that is about 75x past Chinchilla optimal. Reiner rounds it to 100x. This is the most concrete first-principles argument for why frontier models are so deeply over-trained, and it implies that as inference traffic grows, models should keep getting smaller and longer-trained.

Each Model Should Output All of Human Knowledge

The most jaw-dropping consequence: if you equalize pre-training and inference compute, then the total tokens generated by a model across its deployment lifetime should approximate the size of its training corpus. GPT-5, served to hundreds of millions of users for two months, will collectively output something on the order of 150 trillion tokens. That is roughly the sum of human knowledge in textual form. Each frontier model is, in this sense, a one-shot universal author of a corpus the size of its source material.

API Prices Leak Architecture

This is where the lecture gets really fun. Gemini 3.1 charges 50% more for context above 200k tokens. Setting memory time equal to compute time at exactly 200k context and solving for KV cache size gives roughly 1.6 to 2 kilobytes per token, which is plausible for a model with 8 KV heads, dense attention, and head dimension of 128.

The 5x premium for output (decode) tokens versus input (prefill) tokens is direct evidence that decode is severely memory bandwidth bound and prefill is compute bound. Prefill processes many tokens per weight load, so it amortizes memory cost over the whole sequence. Decode processes one token per weight load, so it pays full memory cost every time.

Cache hits priced at one tenth of cache misses tell you that storing the KV cache in HBM (or DDR or flash) is much cheaper than recomputing it from scratch. The two cache duration tiers (5 minutes and 1 hour) probably correspond to memory tiers whose drain times match those durations: flash for the 5-minute tier, spinning disk for the 1-hour tier. Yes, spinning disk is in the modern LLM serving stack, despite being decades-old technology.

Why Context Length Has Plateaued at 200k

Context lengths shot up from 8k to roughly 200k during the GPT-3 to GPT-4 era and have stayed roughly flat for the past two years. Reiner argues this is the natural balance point where memory bandwidth cost crosses compute cost. Going to a million tokens is expensive. Going to 100 million tokens (which Dario has hinted is needed for true continual learning via in-context learning) is essentially impossible without either a memory technology breakthrough or a much more aggressive sparse attention scheme. Sparse attention helps with a square-root improvement, but it is not unlimited. Going too sparse trades off too much quality.

Cryptography Meets Neural Nets

The episode ends with a lovely intellectual detour. Cryptographic protocols and transformer architectures both rely on jumbling information across all inputs. They are doing inverse versions of the same operation: ciphers take structured input and produce randomness, while neural nets take noisy input and extract structure. Both fields use differentiation as their primary attack vector (differential cryptanalysis on ciphers, gradient descent on neural nets). Adversarial attacks on image classifiers exploit exactly the avalanche property that good ciphers are designed for.

The most concrete crossover: Feistel ciphers, which let you build invertible functions out of non-invertible ones, were ported into deep learning as RevNets (reversible networks) in 2017. RevNets let you run the entire network backwards during the backward pass, eliminating the need to store activations and dramatically reducing training memory footprint. It is the opposite trade-off of KV caching: spending compute to save memory rather than spending memory to save compute.

Thoughts

The most striking thing about this episode is how much can be deduced from a few equations and the public API price sheets of the major labs. The labs treat their architectures as trade secrets, but the moment they price tokens to be close to cost (which competition forces them to do), the prices themselves leak the underlying ratios. Anyone with a pen and paper can reverse engineer the KV cache size, the memory tier hierarchy, and the compute-vs-memory bottleneck profile of a frontier model. There is a lesson here for builders: in competitive markets, the prices tell you almost everything.

The 100x over-training result has interesting implications for what comes next. If the optimal balance shifts further toward inference (as adoption keeps growing), models should get smaller and longer-trained. That is good news for serving costs and bad news for training-compute-as-moat. The biggest determinant of model quality might increasingly be data quality and RL environment design, not raw pre-training compute. This squares with what is visible publicly: the leading labs are investing heavily in RL infrastructure, evaluations, and synthetic data pipelines.

The memory wall is the most underrated infrastructure story in AI. Most people think of compute as the bottleneck, but Reiner makes it clear that memory bandwidth is what actually limits context length, which limits how agentic a model can be in practice. If you cannot get to 100 million token contexts, you probably cannot have an AI agent that has been working with you for a month and remembers everything. Either some sparse attention scheme has to give us cheap effective context length, or we need a memory hardware breakthrough, or we have to invent some form of continual learning that does not rely on context windows. None of those paths are obviously easy, and the fact that context length has been flat for two years despite enormous investment suggests we are stuck against a real wall.

The cryptography parallel is the kind of cross-disciplinary insight that does not show up enough in AI discourse. Treating neural networks as a kind of differentiable cipher reframes a lot of the architecture choices (residual connections, layer normalization, attention) as deliberate efforts to make the function smooth and invertible enough to learn, in contrast to ciphers, which are deliberately designed to resist exactly that. Adversarial robustness research probably has a lot more to learn from cryptanalysis than it currently does.

Finally, the format itself is a win. Most AI podcasts are conversational, which is great for personality but bad for technical depth. A blackboard lecture with an interlocutor who asks naive questions at the right moments is a much higher bandwidth medium. More of this, please.
April 29, 2026
Jensen Huang on Lex Fridman: NVIDIA’s CEO Reveals His Vision for the AI Revolution, Scaling Laws, and Why Intelligence Is Now a Commodity

A deep breakdown of Lex Fridman Podcast #494 featuring Jensen Huang, CEO of NVIDIA, covering extreme co-design, the four AI scaling laws, CUDA’s origin story, the future of programming, AGI timelines, and what it takes to lead the world’s most valuable company.

TLDW (Too Long, Didn’t Watch)

Jensen Huang sat down with Lex Fridman for a sprawling two-and-a-half-hour conversation covering the full arc of NVIDIA’s evolution from a GPU gaming company to the engine of the AI revolution. Jensen explains how NVIDIA now thinks in terms of rack-scale and pod-scale computing rather than individual chips, breaks down his four AI scaling laws (pre-training, post-training, test time, and agentic), and reveals the near-existential bet the company made putting CUDA on GeForce. He shares his views on China’s tech ecosystem, his deep respect for TSMC, why he turned down the chance to become TSMC’s CEO, how Elon Musk’s systems engineering approach built Colossus in record time, and why he believes AGI already exists. He also discusses why the future of programming is really about “specification,” why intelligence is being commoditized while humanity is the true superpower, and how he manages the enormous pressure of leading a company that nations and economies depend on. His core message: do not let the democratization of intelligence cause you anxiety. Instead, let it inspire you.

Key Takeaways

1. NVIDIA No Longer Thinks in Chips. It Thinks in AI Factories.

Jensen’s mental model of what NVIDIA builds has fundamentally changed. He no longer picks up a chip to represent a new product generation. Instead, his mental model is a gigawatt-scale AI factory with power generation, cooling systems, and thousands of engineers bringing it online. The unit of computing at NVIDIA has evolved from GPU to computer to cluster to AI factory. His next mental “click” is planetary-scale computing.

2. Extreme Co-Design Is NVIDIA’s Secret Weapon

The reason NVIDIA dominates is not just better GPUs. It is the extreme co-design of the entire stack: GPU, CPU, memory, networking, switching, power, cooling, storage, software, algorithms, and applications. Jensen explains that when you distribute workloads across tens of thousands of computers and want them to go a million times faster (not just 10,000 times), every single component becomes a bottleneck. This is a restatement of Amdahl’s Law at scale. NVIDIA’s organizational structure directly reflects this co-design philosophy. Jensen has 60+ direct reports, holds no one-on-ones, and runs every meeting as a collective problem-solving session where specialists across all domains are present and contribute.

3. The Four AI Scaling Laws Are a Flywheel

Jensen outlined four distinct scaling laws that form a continuous loop:

Pre-training scaling: Larger models plus more data equals smarter AI. The industry panicked when people said data was running out, but synthetic data generation has removed that ceiling. Data is now limited by compute, not by human generation.

Post-training scaling: Fine-tuning, reinforcement learning from human feedback, and curated data continue to scale AI capabilities beyond what pre-training alone achieves.

Test-time scaling: Inference is not “easy” as many predicted. It is thinking, reasoning, planning, and search. It is far more compute-intensive than memorization and pattern matching. This is why inference chips cannot be commoditized the way many predicted.

Agentic scaling: A single AI agent can spawn sub-agents, creating teams. This is like scaling a company by hiring more employees rather than trying to make one person faster. The experiences generated by agents feed back into pre-training, creating a flywheel.

4. The CUDA Bet Nearly Killed NVIDIA

Putting CUDA on GeForce was one of the most consequential technology decisions in modern history. It increased GPU costs by roughly 50%, which crushed the company’s gross margins at a time when NVIDIA was a 35% gross margin business. The company’s market cap dropped from around $7-8 billion to approximately $1.5 billion. But Jensen understood that install base defines a computing architecture, not elegance. He pointed to x86 as proof: a less-than-elegant architecture that defeated beautifully designed RISC alternatives because of its massive install base. CUDA on GeForce put a supercomputer in the hands of every researcher, every scientist, every student. It took a decade to recover, but that install base became the foundation of the deep learning revolution.

5. NVIDIA’s Moat Is Trust, Velocity, and Install Base

Jensen was direct about NVIDIA’s competitive advantage. The CUDA install base is the number one asset. Developers target CUDA first because it reaches hundreds of millions of computers, is in every cloud, every OEM, every country, every industry. NVIDIA ships a new architecture roughly every year. No company in history has built systems of this complexity at this cadence. And the trust that NVIDIA will maintain, improve, and optimize CUDA indefinitely is something developers can count on. If someone created “GUDA” or “TUDA” tomorrow, it would not matter. The install base, velocity of execution, ecosystem breadth, and earned trust create a compounding advantage that is nearly impossible to replicate.

6. Jensen Believes AGI Is Already Here

When asked about AGI timelines, Jensen said he believes AGI has been achieved. His reasoning is practical: an agentic system today could plausibly create a web service, achieve virality, and generate a billion dollars in revenue, even if temporarily. This is not meaningfully different from many internet-era companies that did the same thing with technology no more sophisticated than what current AI agents can produce. He does not believe 100,000 agents could build another NVIDIA, but he believes a single agent-driven viral product is within reach right now.

7. The Future of Programming Is Specification, Not Syntax

Jensen believes the number of programmers in the world will increase dramatically, not decrease. His reasoning: the definition of coding is expanding to include specification and architectural description in natural language. This expands the population of “coders” from roughly 30 million professional developers to potentially a billion people. Every carpenter, plumber, accountant, and farmer who can describe what they want a computer to build is now a coder. The artistry of the future is knowing where on the spectrum of specification to operate, from highly prescriptive to exploratory and open-ended.

8. China Is the Fastest Innovating Country in the World

Jensen gave a nuanced and detailed explanation of why China’s tech ecosystem is so formidable. About 50% of the world’s AI researchers are Chinese. China’s tech industry emerged during the mobile cloud era, so it was built on modern software from the start. The country’s provincial competition creates an insane internal competitive environment. And the cultural norm of knowledge-sharing through school and family networks means China effectively operates as an open-source ecosystem at all times. This is why Chinese companies contribute disproportionately to open source. Their engineers’ brothers, friends, and schoolmates work at competing companies, and sharing knowledge is the cultural default.

9. The Power Grid Has Enormous Waste That AI Can Exploit

Jensen proposed a pragmatic solution to the energy problem for AI data centers. Power grids are designed for worst-case conditions with margin, but 99% of the time they run at around 60% of peak capacity. That idle capacity is simply wasted. Jensen wants data centers to negotiate flexible contracts where they absorb excess power most of the time and gracefully degrade during rare peak demand periods. This requires three things: customers accepting that “six nines” uptime may not always be necessary, data centers that can dynamically shift workloads, and utilities that offer tiered power delivery contracts instead of all-or-nothing commitments.

10. Jensen Turned Down the CEO Role at TSMC

In 2013, TSMC founder Morris Chang offered Jensen the chance to become CEO of TSMC. Jensen confirmed the story is true and said he was deeply honored. But he had already envisioned what NVIDIA could become and felt it was his sole responsibility to make that vision happen. He sees the relationship with TSMC as one built on three decades of trust, hundreds of billions of dollars in business, and zero formal contracts.

11. Elon Musk’s Systems Engineering Approach Is Instructive

Jensen praised Elon Musk’s approach to building the Colossus supercomputer in Memphis in just four months. He highlighted several principles: Elon questions everything relentlessly, strips every process down to the minimum necessary, is physically present at the point of action, and his personal urgency creates urgency in every supplier. Jensen drew a parallel to NVIDIA’s own “speed of light” methodology, where every process is benchmarked against the physical limits of what is possible, not against historical baselines.

12. Intelligence Is a Commodity. Humanity Is Not.

Perhaps the most philosophical takeaway from the conversation: Jensen argued that intelligence is a functional, measurable thing that is being commoditized. He surrounded himself with 60 direct reports who are all “superhuman” in their respective domains, more educated and deeper in their specialties than he is. Yet he sits in the middle orchestrating all of them. This proves that intelligence alone does not determine success. Character, compassion, grit, determination, tolerance for embarrassment, and the ability to endure suffering are the real differentiators. Jensen wants the audience to understand that the word we should elevate is not intelligence but humanity.

Detailed Summary

From GPU Maker to AI Infrastructure Company

The conversation opened with Jensen explaining NVIDIA’s evolution from chip-scale to rack-scale to pod-scale design. The Vera Rubin pod, announced at GTC, contains seven chip types, five purpose-built rack types, 40 racks, 1.2 quadrillion transistors, nearly 20,000 NVIDIA dies, over 1,100 Rubin GPUs, 60 exaflops of compute, and 10 petabytes per second of scale bandwidth. And that is just one pod. NVIDIA plans to produce roughly 200 of these pods per week.

Jensen explained that extreme co-design is necessary because the problems AI must solve no longer fit inside a single computer. When you distribute a workload across 10,000 computers but want a million-fold speedup, everything becomes a bottleneck: computation, networking, switching, memory, power, cooling. This is fundamentally an Amdahl’s Law problem at planetary scale. If computation represents only 50% of the workload, speeding it up infinitely only doubles total throughput. Every layer must be co-optimized simultaneously.

NVIDIA’s organizational structure is a direct reflection of this co-design philosophy. Jensen has more than 60 direct reports, almost all with deep engineering expertise. He does not do one-on-ones. Every meeting is a collective problem-solving session where the memory expert, the networking expert, the cooling expert, and the power delivery expert are all in the room together, attacking the same problem.

The Strategic History of CUDA

Jensen walked through the step-by-step journey from graphics accelerator to computing platform. The company invented a programmable pixel shader, then added IEEE-compatible FP32 to its shaders, then put C on top of that (called Cg), and eventually arrived at CUDA. The critical strategic decision was putting CUDA on GeForce, a consumer product.

This was nearly an existential move. It increased GPU costs by roughly 50% and consumed all of the company’s gross profit at a time when NVIDIA was a 35% gross margin business. The market cap cratered from around $7-8 billion to approximately $1.5 billion. But Jensen understood a principle that many technologists overlook: install base defines a computing architecture. x86 survived not because it was elegant but because it was everywhere. CUDA on GeForce put a supercomputing capability in the hands of every gamer, every student, every researcher who built their own PC. When the deep learning revolution arrived, CUDA was already the foundation.

How Jensen Leads and Makes Decisions

Jensen described a leadership philosophy built on continuous reasoning in public. He does not make announcements in the traditional sense. Instead, he shapes the belief systems of his employees, board, partners, and the broader industry over months and years by reasoning through decisions step by step, using every new piece of external information as a brick in the foundation. By the time he formally announces a strategic direction, the reaction is not surprise but rather, “What took you so long?”

He applies this same approach to his supply chain. He personally visits CEOs of DRAM companies, packaging companies, and infrastructure providers. He explains the dynamics of the industry, shares his vision of future demand, and helps them reason through why they should make multi-billion-dollar capital investments. Three years ago, he convinced DRAM CEOs that HBM memory would become mainstream for data centers, which sounded ridiculous at the time. Those companies had record years as a result.

Jensen’s “speed of light” methodology is his framework for decision-making. Every process, every design, every cost is benchmarked against the physical limits of what is theoretically possible. He prefers this to continuous improvement, which he views as incrementalism. He would rather strip a 74-day process back to zero and ask, “If we built this from scratch today, how long would it take?” Often the answer is six days, and the remaining 68 days are filled with accumulated compromises that can be challenged individually.

AI Scaling Laws and the Future of Compute

Jensen broke down the four scaling laws in detail. The pre-training scaling law, which depends on model size and data volume, was thought to be hitting a wall when the industry worried about running out of high-quality human-generated data. Jensen argued this concern is misplaced. Synthetic data generation has effectively removed the ceiling, and the constraint is now compute, not data.

Post-training continues to scale through fine-tuning and reinforcement learning. Test-time scaling was the most counterintuitive for the industry. Many predicted that inference would be “easy” and that inference chips would be small, cheap, and commoditized. Jensen saw this as fundamentally wrong. Inference is thinking: reasoning, planning, search, decomposing novel problems into solvable pieces. Thinking is much harder than reading, and test-time compute is intensely resource-hungry.

Agentic scaling is the newest frontier. A single AI agent can spawn sub-agents, effectively multiplying intelligence the way a company scales by hiring. The experiences and data generated by agentic systems feed back into pre-training, creating a continuous improvement loop. Jensen described this as the reason NVIDIA designed the Vera Rubin rack architecture differently from the Grace Blackwell architecture. Grace Blackwell was optimized for running large language models. Vera Rubin is designed for agents, which need to access files, use tools, do research, and spin off sub-agents. NVIDIA anticipated this architectural shift two and a half years before tools like OpenClaw arrived.

China, TSMC, and the Global Supply Chain

Jensen provided a thoughtful analysis of China’s tech ecosystem. He identified several structural advantages: 50% of the world’s AI researchers are Chinese, the tech industry was born during the mobile cloud era (making it natively modern), provincial competition creates internal Darwinian pressure, and the culture of knowledge-sharing through school and family networks makes China effectively open-source by default.

On TSMC, Jensen emphasized that the deepest misunderstanding about the company is that its technology is its only advantage. Their manufacturing orchestration system, which dynamically manages the shifting demands of hundreds of companies, is “completely miraculous.” Their culture uniquely balances bleeding-edge technology excellence with world-class customer service. And the trust that Jensen places in TSMC is extraordinary: three decades of partnership, hundreds of billions of dollars in business, and no formal contract.

Jensen also discussed the AI supply chain more broadly. NVIDIA has roughly 200 suppliers contributing technology to each rack. Jensen personally manages these relationships, flying to supplier sites, explaining industry dynamics, and helping CEOs reason through multi-billion-dollar investment decisions. When asked if supply chain bottlenecks keep him up at night, he said no, because he has already communicated what NVIDIA needs, his partners have told him what they will deliver, and he believes them.

The Energy Challenge and Space Computing

On the energy front, Jensen proposed a practical approach to the power problem. Rather than waiting for new power generation, he wants to capture the enormous waste already present in the grid. Power infrastructure is designed for worst-case peak demand, but 99% of the time it runs far below capacity. AI data centers could absorb this excess capacity with flexible contracts that allow graceful degradation during rare peak periods.

On space computing, NVIDIA already has GPUs in orbit for satellite imaging. Jensen acknowledged the cooling challenge (no conduction or convection in space, only radiation) but sees it as a future frontier worth cultivating. In the meantime, he is focused on the lower-hanging fruit of eliminating waste in the terrestrial power grid.

On AGI, Jobs, and the Human Future

Jensen stated directly that he believes AGI has been achieved, at least by the practical definition of an AI system capable of creating a billion-dollar company. He sees it as plausible that an agent could build a viral web service that briefly generates enormous revenue, just as many internet-era companies did with technology no more sophisticated than what current AI agents produce.

On jobs, Jensen was both compassionate and clear-eyed. He told the story of radiology: computer vision became superhuman around 2019-2020, and the prediction was that radiologists would disappear. Instead, the number of radiologists grew because AI allowed them to study more scans, diagnose better, and serve more patients. The purpose of the job (diagnosing disease) did not change, even though the tools changed completely.

He applied this principle broadly: the number of software engineers at NVIDIA will grow, not decline, because their purpose is solving problems, not writing lines of code. The number of programmers globally will grow because the definition of coding is expanding to include natural language specification, opening it up to potentially a billion people.

His advice to anyone worried about their job is straightforward: go use AI now. Become expert in it. Every profession, from carpenter to pharmacist to lawyer, will be elevated by AI tools. The people who learn to use AI will be the ones who get hired, promoted, and empowered.

Mortality, Succession, and Legacy

The conversation closed with deeply personal reflections. Jensen said he really does not want to die. He sees the current moment as a “once in a humanity experience.” He does not believe in traditional succession planning. Instead, he believes the best succession strategy is to pass on knowledge continuously, every single day, in every meeting, as fast as possible. His hope is to die on the job, instantaneously, with no long period of suffering.

He described a vision for a kind of digital continuity: sending a humanoid robot into space, continuously improving it in flight, and eventually uploading the consciousness derived from a lifetime of communications, decisions, and reasoning to catch up with it at the speed of light.

On the emotional experience of leading NVIDIA, Jensen was candid about hitting psychological low points regularly. His coping mechanism is decomposition: break the problem into pieces, reason about what you can control, tell someone who can help, share the burden, and then deliberately forget what is behind you. He compared this to the mental discipline of great athletes who focus only on the next point.

His final message was about the relationship between intelligence and humanity. Intelligence, he argued, is functional. It is being commoditized. Humanity, character, compassion, grit, tolerance for embarrassment, and the capacity for suffering are the true superpowers. The word society should elevate is not intelligence but humanity.

Thoughts

This is one of the most substantive CEO interviews of 2026. What makes it remarkable is not just the breadth of topics but the depth of reasoning Jensen demonstrates in real time. You can actually watch him think through problems on the spot, which is rare for someone at his level.

A few things stand out. First, the CUDA origin story is one of the great strategic narratives in tech history. The decision to absorb a 50% cost increase on a consumer product, watching your market cap collapse by 80%, and holding the course for a decade because you understood the power of install base is the kind of conviction that separates generational companies from everyone else.

Second, Jensen’s framing of the four scaling laws as a flywheel is the clearest articulation anyone has given of why AI compute demand will continue to accelerate. Most people understand pre-training. Fewer understand test-time scaling. Almost nobody is thinking about agentic scaling as a compute multiplier. Jensen has been thinking about it for years and already designed hardware for it before the software ecosystem caught up.

Third, the discussion on jobs deserves attention. The radiology example is powerful because it is a completed experiment, not a prediction. The profession that was supposed to be eliminated first by AI instead grew. The mechanism is straightforward: when you automate the task, you expand the capacity of the purpose, and demand for the purpose increases. This does not mean there will be no pain or dislocation. Jensen acknowledged that explicitly. But the historical pattern is clear.

Finally, the philosophical distinction between intelligence and humanity is the kind of framing that could genuinely help people navigate the anxiety of this moment. If you define your value by your intelligence alone, AI commoditization is terrifying. If you define your value by your character, your compassion, your tolerance for suffering, and your willingness to keep going when everything goes wrong, then AI is just the most powerful set of tools you have ever been given.

Jensen Huang is 62 years old, has been running NVIDIA for 34 years, and shows no signs of slowing down. If anything, his conviction about the future is accelerating alongside his company’s growth.

Watch the full episode: Lex Fridman Podcast #494 with Jensen Huang

March 23, 2026
How NVIDIA is Revolutionizing Computing with AI: Jensen Huang on AI Infrastructure, Digital Employees, and the Future of Data Centers

NVIDIA CEO Jensen Huang discusses the company’s role in revolutionizing computing through AI, emphasizing decade-long investments in scalable, interconnected AI infrastructure, breakthroughs in efficiency, and the future of digital and embodied AI as transformative for industries globally.

NVIDIA is transforming the landscape of computing, driving innovation at every level from data centers to digital employees. In a recent conversation with Jensen Huang, NVIDIA’s CEO, he offered a rare look at the strategic direction and long-term vision that has positioned NVIDIA as a leader in the AI revolution. Through decade-long infrastructure investments, NVIDIA is not just building hardware but creating “AI factories” that promise to impact industries globally.

Decade-Long Investments in AI Infrastructure

For NVIDIA, success has come from looking far into the future. Jensen Huang emphasized the company’s commitment to ten-year investments in scalable, efficient AI infrastructure. With an eye on exponential growth, NVIDIA has focused on creating solutions that can continue to meet demand as AI expands in complexity and scope. One of the cornerstones of this approach is NVLink technology, which enables GPUs to function as a unified supercomputer, allowing unprecedented scale for AI applications.

This vision aligns with Huang’s goal of optimizing data centers for high-performance AI, making NVIDIA’s infrastructure not only capable of tackling today’s AI challenges but prepared for tomorrow’s even larger-scale demands.

Outpacing Moore’s Law with Full-Stack Integration

Huang highlighted how NVIDIA aims to surpass the limits of traditional computing, especially Moore’s Law, by focusing on a full-stack integration strategy. This strategy involves designing hardware and software as a cohesive unit, enabling a 240x reduction in AI computation costs while increasing efficiency. With this approach, NVIDIA has managed to achieve performance improvements that far exceed conventional expectations, driving both cost and energy usage down across its AI operations.

The full-stack approach has enabled NVIDIA to continually upgrade its infrastructure and enhance performance, ensuring that each component of its architecture is optimized and aligned.

The Evolution of Data Centers: From Storage to AI Factories

One of NVIDIA’s groundbreaking shifts is the redefinition of data centers from traditional storage units to “AI factories” generating intelligence. Unlike conventional data centers focused on multi-tenant storage, NVIDIA’s new data centers produce “tokens” for AI models at an industrial scale. These tokens are used in applications across industries, from robotics to biotechnology. Huang believes that every industry will benefit from AI-generated intelligence, making this shift in data centers vital to global AI adoption.

This AI-centric infrastructure is already making waves, as seen with NVIDIA’s 100,000-GPU supercluster built for X.AI. NVIDIA demonstrated its logistical prowess by setting up this supercluster rapidly, paving the way for similar large-scale projects in the future.

The Role of AI in Science, Engineering, and Digital Employees

NVIDIA’s infrastructure investments and technological advancements have far-reaching impacts, particularly in science and engineering. Huang shared that AI-driven methods are now integral to NVIDIA’s chip design process, allowing them to explore new design options and optimize faster than human engineers alone could. This innovation is just the beginning, as Huang envisions AI reshaping fields like biotechnology, materials science, and theoretical physics, creating opportunities for breakthroughs at a previously impossible scale.

Beyond science, Huang foresees AI-driven digital employees as a major component of future workforces. AI employees could assist in roles like marketing, supply chain management, and chip design, allowing human workers to focus on higher-level tasks. This shift to digital labor marks a major milestone for AI and has the potential to redefine productivity and efficiency across industries.

Embodied AI and Real-World Applications

Huang believes that embodied AI—AI in physical form—will transform industries such as robotics and autonomous vehicles. Self-driving cars and robots equipped with AI will become more common, thanks to NVIDIA’s advancements in AI infrastructure. By training these AI models on NVIDIA’s systems, industries can integrate intelligent robots and vehicles without needing substantial changes to existing environments.

This embodied AI will serve as a bridge between digital intelligence and the physical world, enabling a new generation of applications that go beyond the screen to interact directly with people and environments.

Sustaining Innovation Through Compatibility and Software Longevity

Huang stressed that compatibility and sustainability are central to NVIDIA’s long-term vision. NVIDIA’s CUDA platform has enabled the company to build a lasting ecosystem, allowing software created on earlier NVIDIA systems to operate seamlessly on newer ones. This commitment to software longevity means companies can rely on NVIDIA’s systems for years, making it a trusted partner for businesses that prioritize innovation without disruption.

NVIDIA as the “AI Factory” of the Future

As Huang puts it, NVIDIA has evolved beyond a hardware company and is now an “AI factory”—a company that produces intelligence as a commodity. Huang sees AI as a resource as valuable as energy or raw materials, with applications across nearly every industry. From providing AI-driven insights to enabling new forms of intelligence, NVIDIA’s technology is poised to transform global markets and create value on an industrial scale.

Jensen Huang’s vision for NVIDIA is not just about staying ahead in the computing industry; it’s about redefining what computing means. NVIDIA’s investments in scalable infrastructure, software longevity, digital employees, and embodied AI represent a shift in how industries will function in the future. As Huang envisions, the company is no longer just producing chips or hardware but enabling an entire ecosystem of AI-driven innovation that will touch every aspect of modern life.

November 7, 2024