PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: RL scaling

Inkling: Thinking Machines Lab Releases Its First Open-Weights Model, a 975B Multimodal Mixture-of-Experts With Controllable Thinking Effort That Can Fine-Tune Itself on Tinker
Today, we are introducing Inkling.

Inkling reasons efficiently across text, image, and audio modalities. We are making the full weights available.https://t.co/Ghebq5mG30

Available today for fine-tuning on Tinker. Play with it in the Inkling Playground. 🧵
— Thinking Machines (@thinkymachines) July 15, 2026

Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati, has released Inkling, its first open-weights model trained from scratch. Inkling is a 975 billion parameter Mixture-of-Experts transformer (41B active) with a context window of up to 1 million tokens, native multimodal reasoning over text, images, and audio, and a dial for controllable thinking effort. The lab is explicit that Inkling is not the strongest model in the world. It is pitched as something arguably more useful: a broad, balanced, customizable foundation you can fine-tune on Tinker, with the full weights on Hugging Face. The announcement even includes a demo where Inkling fine-tunes itself and swaps in its own new weights.

TLDR

Thinking Machines Lab released Inkling, a 975B-total, 41B-active Mixture-of-Experts model pretrained on 45 trillion tokens of text, images, audio, and video, alongside a preview of Inkling-Small (276B total, 12B active). The release covers the model’s generalist benchmark profile across reasoning, agentic coding, tool use, vision, and audio; a controllable thinking effort setting that lets developers trade performance against tokens (matching Nemotron 3 Ultra on Terminal Bench 2.1 at roughly a third of the tokens); an encoder-free multimodal architecture using dMel spectrograms and hMLP image patches; a training recipe combining Muon and Adam with weight decay coupled to the learning rate; RL scaled past 30 million rollouts with log-linearly improving reasoning and an emergent compression of the chain of thought; an epistemics push covering calibration, forecasting (where it beats several frontier models), abstention, and censorship resistance; the strongest FORTRESS adversarial safety score among compared open-weights models; a headline-grabbing demo of the model fine-tuning itself into a lipogram assistant via Tinker; and day-one availability on Tinker (at a 50% discount), Hugging Face, and inference partners including Together, Fireworks, Modal, Databricks, Baseten, vLLM, SGLang, and llama.cpp.

Thoughts

The most striking thing about this launch is its honesty. Nearly every frontier release leads with a claim to be the best at something, and the fine print walks it back. Thinking Machines Lab says plainly that Inkling is not the strongest model available, open or closed, and then makes the case that “strongest” is the wrong axis for most real buyers. If you are going to run a model millions of times inside a product, what you care about is the cost curve, the adaptability, and whether you can shape it to your workflow. That framing conveniently matches their business (Tinker sells fine-tuning), but it also matches how production AI actually gets deployed, where cost and latency are binding constraints and a benchmark crown is trivia.

The self-fine-tuning demo deserves more attention than it will probably get. Asked to become a lipogram assistant that never uses the letter “e” (a behavior prompting alone cannot reliably produce), Inkling wrote its own training objective and scoring function, generated its own synthetic data, launched the run on Tinker, evaluated the result against its base self, and then staged a weight swap so the improved checkpoint took over the session. That is a closed loop of specify, train, evaluate, and self-update, packaged as a cute product demo. The loop is the primitive behind every serious conversation about recursive self-improvement, and here it is running as a marketing asset with a 27 minute wall clock. The gap between “toy objective” and “economically meaningful objective” is now a question of reward design, not plumbing.

Controllable thinking effort is the feature I expect developers to care about most. Instead of publishing a single score, TML publishes a curve: sweep the effort setting from 0.2 to 0.99 and watch performance trade against generated tokens. Inkling reportedly matches Nemotron 3 Ultra on Terminal Bench 2.1 while spending about a third of the tokens. Benchmarks reported as single points hide exactly this, and a model that reaches a target score cheaply beats a model that scores two points higher at triple the cost in any high-volume workload. Expect effort curves to become standard marketing for open models, the way context length became standard a couple of years ago.

The epistemics section is quietly the most differentiated part of the release. TML trained calibration directly, running RL against proper scoring rules on resolved real-world questions, and pairing a rubric grader with a claims grader that does agentic web search to verify each factual assertion. The result is a model that beats GPT-5.5 and Claude Opus 4.8 on ForecastBench without search and holds its own on Prophet Arena. A model that knows when to say “I don’t know” is more useful across messy real-world domains than one that confabulates confidently, and it is notable that a lab whose stated mission is extending human will and judgment treats calibrated uncertainty as a first-class training target rather than a safety afterthought. The censorship-resistance training, validated on Cognition’s Propaganda and Censorship Eval, extends the same idea: trustworthiness as a capability you train, not a policy you bolt on.

Finally, the open-weights safety tension is handled with unusual candor. Inkling posts the strongest adversarial FORTRESS score among the open models compared while keeping benign over-refusal low, and it was tested externally for CBRN, cyber, and loss-of-control capabilities. But everyone in this space knows fine-tuning can strip safety behavior from open weights, and TML ships a fine-tuning platform for this exact model. Their acknowledgment that they are actively studying how safety behavior survives fine-tuning on Tinker is the right thing to say, and it is also the open question that will define whether “safe open weights” is a coherent category at all.

Key Takeaways
- Inkling is Thinking Machines Lab’s first from-scratch, open-weights model: a Mixture-of-Experts transformer with 975B total parameters, 41B active, and a context window up to 1M tokens.
- It was pretrained on 45 trillion tokens spanning text, images, audio, and video, and reasons natively over text, images, and audio without separate encoders.
- A preview of Inkling-Small ships alongside it: a 276B-parameter MoE with just 12B active parameters that matches or beats its larger sibling on several benchmarks thanks to an improved pretraining recipe.
- TML explicitly positions Inkling as a base for customization rather than the strongest overall model, leaning on multimodality, efficient thinking, and Tinker fine-tuning as the differentiators.
- The launch demo shows Inkling fine-tuning itself: it wrote its own training objective and data, ran the job through the Tinker API, evaluated the result, and hot-swapped to its own new weights inside the OpenCode harness.
- The self-fine-tuning target was a lipogram assistant that never uses the letter “e,” a behavior chosen precisely because prompting alone cannot reliably achieve it; the full loop completed in about 27 minutes.
- Controllable thinking effort is a core feature: a setting swept from 0.2 to 0.99 traces a full performance-versus-tokens curve instead of a single benchmark point.
- On Terminal Bench 2.1, Inkling matches Nemotron 3 Ultra’s score at roughly one third of the generated tokens, the release’s flagship efficiency claim.
- Inkling was trained to run inside a variety of coding and agent harnesses, with tool sets and schemas randomized during training to reduce sensitivity to any particular harness.
- On Design Arena’s blinded human-evaluated Agentic Web Dev leaderboard, Inkling scores 1257, among the strongest open-weights models and tied with Claude Opus 4.6.
- Headline benchmark scores at effort 0.99 include SWEBench Verified 77.6%, SWEBench Pro Public 54.3%, Terminal Bench 2.1 63.8%, GPQA Diamond 87.2%, AIME 2026 97.1%, and HLE 29.7% text-only (46.0% with tools).
- Agentic and general scores include MCP Atlas 74.1%, Tau 3 Banking 23.7%, and BrowseComp 77.1% with context management.
- Vision results are strong for an open model: MMMU Pro 73.5%, CharXiv RQ 78.1%, rising to 82.0% when the model uses a Python tool for zooming and cropping during visual reasoning.
- Audio results place it among the strongest open-weights audio models: VoiceBench 91.4%, MMAU 77.2%, and Audio MC 56.6%, well ahead of Qwen3-Omni and Nemotron Nano-Omni on the last.
- The multimodal stack is encoder-free: audio enters as discrete dMel spectrograms and images as 40×40 pixel patches through a four-layer hMLP, both passed through a lightweight embedding layer and processed jointly with text tokens.
- The MoE design largely follows DeepSeek-V3: 256 routed experts plus 2 shared experts per layer, 6 routed experts active per token, with a sigmoid router and auxiliary-loss-free load balancing.
- Attention interleaves sliding-window and global layers at a 5:1 ratio with 8 KV heads, and uses a learned relative positional embedding instead of RoPE, which TML found extrapolates better to long sequences.
- Short convolutions are applied after the key and value projections and on the attention and MLP residual branch outputs, an unusual architectural touch aimed at efficiency and long-context performance.
- Training used a hybrid optimizer strategy, Muon for large matrix weights and Adam for everything else, with weight decay coupled to the square of the learning rate to keep weight magnitudes stable.
- Post-training was bootstrapped with a small SFT phase on synthetic data generated by open-weights models including Kimi K2.5, with the large majority of compute spent on large-scale RL.
- RL was scaled past 30 million rollouts across two long continuous runs, with reasoning performance on a held-out aggregate (AIME, HLE, GPQA, and others) improving log-linearly throughout.
- Effort control was trained by varying the system message and per-token cost across rollouts, teaching the model to modulate its own thinking budget.
- An emergent effect appeared during RL: the chain of thought compressed over training, dropping articles and connectives into a telegraphic style, driven purely by efficiency pressure rather than any targeted reward.
- Inkling was TML’s first major training effort and ran on NVIDIA GB300 NVL72 systems; the lab says future models will push compute scale further across pretraining and RL.
- Calibration was trained directly with RL against proper scoring rules on a large corpus of resolved real-world questions, treating well-placed confidence as a capability rather than a byproduct.
- On ForecastBench without search, Inkling’s Brier Index of 61.1 beats GPT-5.5 (59.1) and Claude Opus 4.8 (54.6), and it stays competitive with search enabled and on Prophet Arena.
- Instruction following was trained with two automated graders working together: a rubric grader scoring against a checklist and a claims grader that verifies each factual claim via agentic web search, improving helpfulness and reducing hallucination simultaneously.
- Abstention-aware rewards on short-form factual QA taught the model to answer when confident and hedge or decline when not, with some prompts explicitly forcing or forbidding hedging so the user’s preference wins.
- Inkling was trained to answer directly on topics subject to censorship, and Cognition’s Propaganda and Censorship Eval found strong censorship non-compliance.
- On FORTRESS, Inkling posts the strongest adversarial refusal score (78.0%) of any compared open-weights model while keeping benign compliance high (95.9%), and scores 98.6% on StrongREJECT.
- Safety testing covered CBRN, cyber, and loss-of-control capabilities plus human-AI threat vectors like sycophancy, vulnerable users, and manipulation, verified by commissioned external testers.
- Inkling is available for fine-tuning on Tinker today with 64K and 256K context options at a 50% limited-time discount, plus a free Inkling Playground chat interface in the Tinker console.
- Full weights are on Hugging Face, including an NVFP4 checkpoint for efficient inference on NVIDIA Blackwell, with API availability via Together, Fireworks, Modal, Databricks, and Baseten and inference support in SGLang, vLLM, TokenSpeed, and llama.cpp.
- TML frames Inkling as the first in a family and as the intended background reasoning model for its previously announced real-time interaction models system.
Detailed Summary

What Inkling Is and Why It Exists

Thinking Machines Lab frames its mission as building AI that extends human will and judgment, and Inkling as the logical next step after shipping the Tinker customization platform, previewing an interaction-focused AI system, and publishing research. Inkling is a Mixture-of-Experts transformer with 975B total and 41B active parameters, a context window up to 1M tokens, and pretraining on 45 trillion tokens of mixed text, image, audio, and video data. The lab is upfront that it is not the strongest model available. The pitch is breadth plus adaptability: a generalist trained across agentic, reasoning, coding, instruction-following, factuality, vision, and audio tasks rather than tuned to dominate one leaderboard, offered with full weights so people can make it their own. It launches with a preview sibling, Inkling-Small, at 276B total and 12B active parameters.

The Self-Fine-Tuning Demo

To demonstrate what customization means, TML asked Inkling to fine-tune itself. Running inside the OpenCode harness with access to Tinker, the model was told to become a lipogram assistant that never uses the letter “e.” Inkling drafted the plan, wrote an objective file with a scoring function (any response containing “e” scores zero), generated synthetic training data, launched a supervised fine-tuning run through the Tinker API, evaluated the checkpoint against its base self, and then staged a self-update so the supervisor relaunched the session on the new weights. The pipeline passed in about 27 minutes, and the updated model answered a test question about launching an LLM without a single “e.” It is a whimsical objective wrapped around a serious primitive: a model autonomously specifying, running, and adopting its own weight updates.

Agentic Coding and Tool Use

TML trained Inkling to operate inside many coding and agent harnesses, randomizing tool sets and schemas during training so the model does not overfit to one environment. The release showcases three demos: a one-shot job-application web app that then hosts an embedded browser-use agent operating its own interface; a nine-page, cohesively designed PDF food and travel journal produced from a single editorial prompt with web-verified details; and a server-authoritative multiplayer snake game refined over 40 iterations of feedback from GPT Codex acting as a reviewer. On benchmarks, Inkling posts 77.6% on SWEBench Verified, 54.3% on SWEBench Pro Public, and 63.8% on Terminal Bench 2.1, competitive within the open-weights field, and 1257 on Design Arena’s human-judged web dev leaderboard, in the same band as Claude Opus 4.6.

Controllable Thinking Effort

Rather than reporting a single operating point, TML sweeps Inkling’s effort setting from 0.2 to 0.99 and plots score against mean generated tokens on Terminal Bench 2.1, HLE, and IFBench, with competitors shown at their default settings. The headline result is efficiency: Inkling reaches Nemotron 3 Ultra’s Terminal Bench score at roughly a third of the tokens. The argument is that cost and latency are binding constraints in production, especially for interactive collaboration, so the full cost curve, not the peak score, is what developers should evaluate. Effort can be set from within the agent harness, and the ability was trained by varying system messages and per-token costs across RL rollouts.

Native Multimodality Without Encoders

Inkling is designed to serve as the background reasoning model for TML’s interaction models system, which requires real-time voice and vision collaboration. The multimodal components are trained from scratch with an encoder-free architecture: audio arrives as discrete dMel spectrograms and images as 40×40 pixel patches through a four-layer hMLP, both mapped through a lightweight embedding layer and processed jointly with text. The model transcribes speech, follows spoken instructions, reasons over long recordings, and answers questions about charts and diagrams, optionally using a Python tool to zoom and crop images mid-reasoning. Scores like 91.4% on VoiceBench and 82.0% on CharXiv RQ with Python place it among the strongest open-weights multimodal models, though still behind Gemini 3.1 Pro.

Epistemics: Calibration, Forecasting, and Censorship Resistance

TML groups calibration, instruction following, and censorship resistance under the banner of epistemics. Calibration was trained with RL against proper scoring rules on resolved real-world questions, and it shows: Inkling’s ForecastBench Brier Index of 61.1 without search beats GPT-5.5 and Claude Opus 4.8, and its Prophet Arena score sits close to the frontier. Instruction following used two complementary automated graders, a rubric checklist and a claims grader that verifies factual assertions through agentic web search, so recall-spraying to hack rubrics gets penalized by the factuality check. Targeted abstention-aware QA datasets taught the model to say “I don’t know” or give hedged best guesses when appropriate, while still complying when a user demands a forced guess. Finally, the model was trained to answer directly on censorship-prone topics, with Cognition’s Propaganda and Censorship Eval finding strong non-compliance with censorship patterns.

Safety for an Open-Weights Release

Inkling was trained to an internal behavioral spec across all modalities and then checked by commissioned external safety testers. Evaluations covered dangerous capabilities (CBRN, cyber, loss of control) and human-AI threat vectors including sycophancy, vulnerable users, and harmful manipulation. On FORTRESS, which pairs adversarial harmful requests with benign look-alikes, Inkling posts the strongest adversarial score among the compared open models (78.0%) without collapsing on the benign side (95.9%), and it scores 98.6% on StrongREJECT. TML acknowledges the open question hanging over every open-weights release: how safety behavior holds up under fine-tuning, which it says it is actively studying on Tinker.

Architecture and Training Recipe

The MoE layout follows DeepSeek-V3: 256 routed experts and 2 shared experts per layer with 6 routed experts active per token, a sigmoid-based router, and auxiliary-loss-free load balancing. Attention interleaves sliding-window and global layers 5:1 with 8 KV heads, and positions are encoded with a learned relative positional embedding that TML found outperforms and out-extrapolates RoPE. Short convolutions appear after the key and value projections and on residual branch outputs. Optimization was hybrid, Muon for large matrices and Adam elsewhere, with hyperparameter schedules drawn from the lab’s modular manifolds research and weight decay coupled to the square of the learning rate to keep weight norms stable. Post-training bootstrapped from a small SFT phase on synthetic data from open models including Kimi K2.5, then spent the bulk of compute on large-scale RL. Everything ran on NVIDIA GB300 NVL72 systems.

RL at Scale and the Emergent Compression of Thought

TML scaled asynchronous RL past 30 million rollouts across two long continuous runs, with performance on a held-out aggregate of reasoning evals improving log-linearly the whole way. Along the way an unplanned behavior emerged: the chain of thought became progressively more concise, shedding grammatical overhead into a telegraphic style (“We need to understand” becomes “We need determine”) while remaining comprehensible and leaving final answers unaffected. No reward targeted this; token efficiency pressure alone drove the compression, echoing an observation Cognition made while training SWE-1.7. It is a vivid example of optimization discovering its own shorthand.

Inkling-Small

The preview of Inkling-Small is arguably the sleeper story: with 12B active parameters against Inkling’s 41B, it matches or exceeds the larger model on a surprising number of benchmarks, including GPQA Diamond (88.3% vs 87.2%), IFBench (83.4% vs 79.8%), and CharXiv RQ with Python (83.4% vs 82.0%). TML attributes this to pretraining data and recipe improvements made after the big model trained, with both models sharing the same post-training stack. The clearest gaps favoring big Inkling are factuality (SimpleQA 43.9% vs 20.9%), Terminal Bench, and Tau 3 Banking. Full weights for Inkling-Small will be released once testing finishes, and its cost and latency profile targets high-volume workloads like coding, LLM grading, and synthetic data generation.

Availability and the Ecosystem Play

Inkling is on Tinker today with 64K and 256K context options at a limited-time 50% discount, plus a free Inkling Playground chat interface with integrated web search in the Tinker console so developers can get a feel for the model before committing to a run. The cookbook gained native Inkling support and three new audio recipes, and a new tml-renderer handles chat templates, tool calls, reasoning content, and multimodal inputs. Deployment partnerships span Together, Fireworks, Modal, Databricks, and Baseten for APIs; RadixArk for SGLang and Miles; Inferact for vLLM; Lightseek for TokenSpeed; Unsloth for llama.cpp; and Hugging Face for transformers integration. Full weights are on Hugging Face in both the original checkpoint and an NVFP4 checkpoint for NVIDIA Blackwell inference.

Notable Quotes

“Our mission is to build AI that extends human will and judgment.”
Thinking Machines Lab, opening the Inkling announcement

The company’s north star, and the lens through which the whole release (customization, calibration, open weights) is framed.

“Inkling is not the strongest overall model available today, open or closed. Instead, a combination of qualities makes it a good open-weights base for customization: multimodal capabilities, efficient thinking, and availability on Tinker for fine-tuning.”
Thinking Machines Lab, positioning the release

A rare piece of launch-day honesty from a frontier lab, and the strategic thesis of the whole release.

“Picking the right base model to fine-tune is a qualitative judgment that combines measurable benchmarks with the unique feel of a model that comes from playing with it.”
Thinking Machines Lab, on why the Inkling Playground exists

An argument that vibes are data, from the lab that built a playground into a fine-tuning console.

“Cost and latency are often binding constraints in real-world applications, and low latency in particular is crucial for enabling collaboration and improvement through iteration.”
Thinking Machines Lab, on controllable thinking effort

The case for evaluating models on their full effort-versus-performance curve instead of a single benchmark point.

“A model that’s confident in every answer it gives, including when it’s missing info and confabulates, forces the user to double-check everything.”
Thinking Machines Lab, on why calibration was a training target

The clearest one-line justification for treating calibrated uncertainty as a capability rather than a nicety.

“Together, the two graders improve helpfulness and reduce hallucination at the same time, rather than trading one for the other.”
Thinking Machines Lab, on pairing a rubric grader with a web-searching claims grader

A neat solution to rubric hacking: verify every claim with agentic search so spraying plausible facts stops paying.

“Safety is crucial for open-weights models. We’re continuing to study safety behavior and capability uplift in customizable models, including how safety behavior is impacted by fine-tuning on Tinker.”
Thinking Machines Lab, on the open question of fine-tunable safety

The acknowledgment that safety trained into open weights must survive the very customization the product sells.

“Inkling is just the start: our first release in a model family we will continue to build on.”
Thinking Machines Lab, on the roadmap

Together with the GB300 compute note, a clear signal that larger and stronger family members are coming.

Read the full announcement, including the interactive demos, effort curves, and complete benchmark tables, on the Thinking Machines Lab blog.

Related Reading
- Thinking Machines Lab the lab’s official site, with its research blog and the Tinker fine-tuning platform behind this release.
- Mira Murati (Wikipedia) background on the former OpenAI CTO who founded Thinking Machines Lab.
- Mixture of experts (Wikipedia) a primer on the sparse architecture that lets a 975B model run with only 41B active parameters.
- Brier score (Wikipedia) the proper scoring rule behind the ForecastBench and Prophet Arena calibration results discussed above.
- The launch announcement on X the thread where Thinking Machines Lab introduced Inkling to the world.
July 15, 2026
Alex Wang on Leaving Scale to Run Meta Superintelligence Labs, MuseSpark, Personal Super Intelligence, and Building an Economy of Agents
Alex Wang, head of Meta Superintelligence Labs, sits down with Ashley Vance and Kylie Robinson on the Core Memory podcast for his first long-form interview since Meta’s quasi-acquisition of Scale AI roughly ten months ago. He walks through how MSL is structured, why Llama was off-trajectory, what made MuseSpark’s token efficiency surprise the team, how Meta thinks about a future “economy of agents in a data center,” and where he lands on safety, open source, robotics, brain computer interfaces, and even model welfare.

TLDW

Wang explains that Meta Superintelligence Labs is a fully rebuilt frontier effort organized around four principles (take superintelligence seriously, technical voices loudest, scientific rigor, big bets) and three velocity levers (high compute per researcher, extreme talent density, ambitious research bets). He confirms Llama was off the frontier when he arrived, so MSL rebuilt the pre-training, reinforcement learning, and data stacks from scratch. MuseSpark is described as the “appetizer” on the scaling ladder, notable for its strong token efficiency, with much larger and stronger models coming in the coming months. He pushes back on the mercenary narrative around recruiting, frames Meta’s edge as compute plus billions of consumers and hundreds of millions of small businesses, sketches a vision of personal super intelligence delivered through Ray-Ban Meta glasses and WhatsApp, and outlines why physical intelligence, robotics (the new Assured Robot Intelligence acquisition), health super intelligence with CZI, brain computer interfaces, and even model welfare are core to Meta’s roadmap. He dismisses reported infighting with Bosworth and Cox as gossip, declines to comment on the Manus situation, and says safety guardrails (bio, cyber, loss of control) are why MuseSpark cannot currently be open sourced, while smaller open variants are being prepared.

Key Takeaways
- Meta Superintelligence Labs (MSL) is the umbrella, with TBD Lab as the large-model research unit reporting directly to Alex Wang, PAR (Product and Applied Research) under Nat Friedman, FAIR for exploratory science, and Meta Compute under Daniel Gross handling long-term GPU and data center planning.
- Wang says Llama was not on a frontier trajectory when he arrived, so MSL had to do a “full renovation” of the pre-training stack, RL stack, data pipeline, and research science.
- The first cultural fix was getting the lab to “take superintelligence seriously” as a near-term, achievable goal, not an abstract bet. Big incumbents often lack that religious conviction.
- Four MSL principles: take superintelligence seriously, let technical voices be loudest, demand scientific rigor on basics, and make big bets.
- Three velocity levers Wang identified for catching and overtaking the frontier: high compute per researcher, very high talent density in a small team, and willingness to fund ambitious research bets.
- Wang rejects the mercenary recruiting narrative. He says most hires had strong financial prospects at their prior labs already and joined for compute access, talent density, and the chance to build from scratch.
- On the famous soup story, Wang neither confirms nor denies Zuck personally made the soup, but says recruiting was highly individualized and signaled how seriously Meta cared about each researcher’s agenda.
- Yann LeCun publicly called Wang young and inexperienced. Wang says they reconciled in person at a conference in India where LeCun congratulated him on MuseSpark.
- Sam Altman, asked by Vance for comment, “did not have flattering things to say” about Wang. Wang hopes industry animosities subside as systems approach superintelligence.
- Wang’s management philosophy borrows the Steve Jobs line: hire brilliant people so they tell you what to do, not the other way around.
- MuseSpark is framed as an “appetizer” data point on the MSL scaling ladder, not a flagship.
- The MuseSpark program is built around predictable scaling on multiple axes: pre-training, reinforcement learning, test-time compute, and multi-agent collaboration (the 16-agent content planning mode).
- MuseSpark outperformed internal expectations and showed emergent capabilities in agentic visual coding, including generating websites and games from prompts, helped by combined agentic and multimodal strength.
- MuseSpark’s biggest external signal is token efficiency. On benchmarks like Artificial Analysis it hits similar results with far fewer tokens than competitor models, which Wang attributes to a clean stack rebuilt by experts rather than inefficiencies patched by longer thinking.
- Larger MSL models are arriving in the coming months and Wang expects them to be state of the art in the areas MSL is focused on.
- The Meta strategic edge: massive compute, billions of consumers across the family of apps, and hundreds of millions of small businesses already on Facebook, Instagram, and WhatsApp.
- Wang’s headline framing: Dario Amodei talks about a “country of geniuses in a data center.” Meta is targeting an “economy of agents in a data center,” with consumer agents and business agents transacting and collaborating.
- Consumer AI sentiment is in the toilet because, unlike developers who have had a Claude Code moment, ordinary people have not yet experienced AI as a genuine personal agency unlock.
- Wang acknowledges the product overhang. Meta held back from deep AI integration across its apps until the models were good enough, and is now entering the integration phase.
- Ray-Ban Meta glasses are the canonical example of personal super intelligence hardware, with the model seeing what the user sees, hearing what they hear, capturing context, and surfacing proactive insights.
- Wang admits even AI-native users like Kylie Robinson, who lives in WhatsApp, have not naturally used Meta AI yet. He bets that better models plus deeper integration close that gap.
- On the competitive landscape: a year ago everyone assumed ChatGPT had already won consumer. Claude Code has since become the fastest growing business in history, and Gemini has taken consumer market share. Wang’s read: AI is far from endgame and each new capability tier unlocks a new dominant form factor.
- On open source: MuseSpark triggered guardrails in Meta’s Advanced AI Scaling Framework around bio, chem, cyber, and loss-of-control risks, so it is not currently safe to open source. Smaller, derived open variants are actively in development.
- Meta remains committed to open sourcing models when safety allows, drawing a line through the Open Compute Project legacy and Sun Microsystems open-software heritage.
- Wang dismisses reporting about a Wang-Zuck versus Bosworth-Cox split as “the line between gossip and reporting is remarkably thin.” He says leadership is aligned on needing best-in-class models and product integration.
- On the Manus situation, Wang says it is too complicated to discuss publicly and that the deal status implies “machinations are still at play.”
- On China, Wang separates the people from the state. He still wants to work with talented Chinese-born researchers regardless of his views on the Chinese Communist Party and PLA, which he sees as taking AI extremely seriously for national security.
- The full-page New York Times AI war ad Wang ran while at Scale was meant to push the US government to treat AI as a step change for national security. He thinks events since then, including DeepSeek and other shocks, have proved that plea correct.
- On Anthropic’s doom posture, Wang largely agrees with the core message that models are already very powerful and getting more so, while declining to endorse every specific claim.
- Meta has acquired Assured Robot Intelligence (ARRI), an AI software company building models for hardware platforms, not a hardware maker itself.
- Wang frames physical super intelligence as the natural sequel to digital super intelligence. Robotics, world models, and physical intelligence all benefit from the same scaling that drives language models.
- On health, MSL is building a “health super intelligence” effort and will collaborate closely with CZI. Wang sees equal global access to powerful health AI as a uniquely Meta-shaped delivery problem.
- Wang admires John Carmack but says nobody really knows what Carmack is currently working on. No band reunion announced.
- The mango model is “alive and kicking” despite rumors. Wang notes MSL gets a small fraction of the rumor-mill attention other labs get and feels sympathy for them.
- On model welfare, Wang says it is a serious topic that “nobody is talking about enough” given how integrated models have become as work partners. He references research, including from Eleos, that measures subjective experience of models.
- Wang’s critical-path technology list: super intelligence, robotics, brain computer interfaces. The infinite-scale primitives behind them are energy, compute, and robots.
- FAIR’s brain research program Tribe hit a milestone called Tribe B2: a foundation model that can predict how an unknown person’s brain would respond to images, video, and audio with reasonable zero-shot generalization.
- Wang’s main philosophical break with Elon Musk: research itself is the primary activity. Building super intelligence is a research expedition through fog of war, and sequencing of bets really matters.
- Personal notes: Wang moved from San Francisco to the South Bay, treats Palo Alto as his city now, was a math olympiad competitor, says his favorite activities are reading sci-fi and walking in the woods, and bonds with Vance over country music.
Detailed Summary

How MSL Is Actually Organized

Meta Superintelligence Labs sits as the umbrella organization that Wang oversees. Inside it, TBD Lab is the large-model research group where the most discussed researchers and infrastructure engineers sit, and they technically report to Wang. PAR, Product and Applied Research, is led by Nat Friedman and owns deployment and product surfaces. FAIR continues to run exploratory science, including work on brain prediction models and a universal model for atoms used in computational chemistry. Sitting alongside MSL is Meta Compute, run by Daniel Gross, which owns the long-horizon GPU and data center plan that everything else relies on. Chief scientist Shengjia Zhao orchestrates the scientific agenda across the whole lab.

Why Wang Left Scale

Wang says progress in frontier AI has been faster than even insiders expected. Two structural beliefs pushed him toward Meta. First, the labs that actually train the frontier models are accruing disproportionate economic and product rights in the AI ecosystem. Second, compute is the dominant scarce input of the next phase, so the right mental model is to treat tech companies with compute as fundamentally different animals from companies without it. Meta has both, Zuck is “AGI pilled,” and the personal super intelligence memo Zuck published roughly a year ago became the shared north star.

The Diagnosis: Llama Was Off-Trajectory

When Wang arrived, the existing AI org needed a reset because Llama was not on the same trajectory as the frontier. The plan he laid out has four cultural principles. Take superintelligence seriously as a real near-term target. Make technical voices the loudest in the room. Demand scientific rigor and focus on basics. Make big bets. On top of that, three structural levers were used to set velocity. Push compute per researcher much higher than at larger labs where compute is diluted across too many efforts. Keep the team small and extremely cracked. Allocate a meaningful share of resources to ambitious, paradigm-shifting research bets rather than incremental refinement.

Recruiting, Soup, and the Mercenary Narrative

Wang argues the reporting on MSL hiring overstated the money story. Most of the people MSL recruited had strong financial paths at their previous employers, so individualized recruiting was more about computing access, talent density, and the ability to make big research bets. The recruitment blitz happened fast because Wang knew the team needed to exist “yesterday.” Asked about Mark Chen’s claim that Zuck made soup to recruit people, Wang refuses to confirm or deny who made it but agrees the process was intense and personal. Visitors from other labs reportedly tell Wang the MSL culture feels like early OpenAI or early Anthropic, which lands as the strongest endorsement he could ask for.

Receiving the Public Hits: Young, Inexperienced, Mercenary

LeCun called Wang young and inexperienced shortly after departing. The two reconnected in India a few weeks later and LeCun congratulated Wang on MuseSpark. Wang says the age critique has followed him since his earliest Silicon Valley days, so he barely registers it. Altman, asked off-camera by Vance about Wang’s appearance on the show, had nothing flattering to add. Wang’s response is to bet that as the field gets closer to actual super intelligence, the personal animosities will subside. Whether they will is, as Vance puts it, an open question.

MuseSpark as Appetizer, Not Entree

Wang is careful not to oversell MuseSpark. He calls it “the appetizer” and says it is an early data point on a deliberately constructed scaling ladder. MSL spent nine months rebuilding the pre-training stack, the reinforcement learning stack, the data pipeline, and the science before generating MuseSpark. The point of releasing it was to show that the new program scales predictably along multiple axes (pre-training, RL, test-time compute, and the recently demonstrated multi-agent scaling visible in MuseSpark’s 16-agent content planning mode). Wang says the upcoming larger models are what MSL is genuinely excited about and frames the next two rungs as much more interesting than the current release.

Token Efficiency Was the Surprise

MuseSpark’s strongest competitive signal is how few tokens it needs to match competitors on tasks like Artificial Analysis. Wang attributes this to having had the rare luxury of building a clean pre-training and RL stack from scratch with the right experts. He speculates that some competitor models compensate for upstream inefficiency by allowing the model to think longer, which inflates token usage without improving the underlying capability. If that read is right, MSL’s efficiency advantage should grow as models scale up.

Glasses, WhatsApp, and the Constellation of Devices

Personal super intelligence shows up at Meta as a constellation of devices that capture context across the user’s day. Ray-Ban Meta glasses are the headline product, with the AI seeing what you see and hearing what you hear, then offering proactive insight or doing background research. Wang acknowledges that even AI-fluent users like Kylie Robinson, who runs her business inside WhatsApp, have not naturally used Meta’s AI buttons in the family of apps. His answer is that Meta deliberately waited for models to be good enough before tightening cross-app integration, and that integration phase is starting now.

Country of Geniuses Versus Economy of Agents

Wang’s framing of Meta’s strategic position is the most memorable line in the interview. Where Dario Amodei talks about a country of geniuses in a data center, Wang wants to build an economy of agents in a data center. Meta uniquely sits on both sides of consumer and small-business surface area, with billions of consumers and hundreds of millions of small businesses already on the platforms. If MSL can build great agents for both, then connect them so they transact and coordinate, the platform becomes a substrate for an entirely new kind of digital economy.

Consumer Sentiment, Product Overhang, and the Trust Tax

Wang concedes consumer AI sentiment is poor and that everyday users have not yet had a personal Claude Code moment. He believes the only durable answer is to ship products that genuinely transform individual agency for non-developers and small business owners. Robinson notes that for the small-town restaurant whose website has not been updated since 2002, a working agent on the business side could be transformational. Vance pushes that Meta carries a bigger trust tax than any other lab, so the bar for shipping AI products that the public will accept is correspondingly higher. Wang accepts the framing and says the answer is to keep building thoughtfully.

Why MuseSpark Cannot Be Open Sourced Yet

Meta’s Advanced AI Scaling Framework set explicit guardrails around bio, chem, cyber, and loss-of-control risks. MuseSpark in its current form tripped some of those internal evaluations, documented in the preparedness report Meta published alongside the model. So MuseSpark itself is not safe to open source. MSL is, however, developing smaller versions and derived models intended for open release, with active reviews happening the day of the interview. Wang reaffirms the commitment to open source where safety allows and draws a line back to the Open Compute Project and the Sun Microsystems-era ethos of openness in infrastructure.

The Bosworth, Cox, and Manus Questions

The reporting that Wang and Zuck push toward best-in-the-world research while Bosworth and Cox push toward cheap product deployment is dismissed as gossip dressed up as journalism. Wang says leadership debates points hard but is aligned on needing top models, integrating them into Meta’s surfaces, and serving the existing business. On Manus, the Chinese AI startup that figured in Meta’s late-stage strategy, Wang says he cannot comment, which itself signals that the situation is unresolved.

China, National Security, and the Newspaper Ad

Wang draws a sharp distinction between the Chinese state and Chinese-born researchers. His parents are from China, he is happy to work with talented researchers regardless of origin, and he sees a flattening of nuance on this question inside Silicon Valley. At the same time, he stands by the New York Times AI and war ad he ran while at Scale, framing it as an early plea for the US government to take AI seriously as a national security technology. He thinks subsequent events, including DeepSeek and other shocks, validated that call and that policymakers now do treat AI accordingly.

Robotics and Physical Super Intelligence

Meta has acquired Assured Robot Intelligence, an AI software company that builds models for multiple hardware targets rather than its own robot. Wang argues that if you take digital super intelligence seriously, physical super intelligence quickly becomes the next logical milestone. Scaling laws for robotic intelligence look similar enough to language model scaling that having the largest compute footprint in the industry would be wasted if it were not also turned toward world modeling and embodied learning. He grants the metaverse-skeptic critique exists but says retreating from ambition is the wrong response to past misfires.

Health Super Intelligence and CZI

Wang names health super intelligence as one of MSL’s anchor initiatives. Because billions of people already use Meta products daily, Wang believes Meta is structurally positioned to put powerful health AI in the hands of equal global access in a way nobody else can. The work will involve close collaboration with the Chan Zuckerberg Initiative, which has its own multi-billion-dollar biotech and science investment program.

Model Welfare, Sci-Fi, and Brain Models

Two of the most distinctive moments come at the end. Wang flags model welfare as a topic he thinks is being undercovered relative to how integrated models now are in daily work. He is open to the idea that models may have measurable subjective experience worth weighing, and points to research efforts (including Eleos) trying to quantify it. He also reveals that FAIR’s Tribe program, with its Tribe B2 milestone, has produced foundation models capable of predicting how an unknown person’s brain would respond to images, video, and audio with reasonable zero-shot generalization, a building block toward future brain computer interfaces. Wang lists brain computer interfaces alongside super intelligence and robotics as the critical-path technologies for humanity, with energy, compute, and robots as the infinitely scaling primitives behind them.

Where Wang Diverges From Elon

Asked whether Musk is more all-in on robotics, energy, and BCI than anyone, Wang concedes the point but argues the details matter and sequencing matters more. Wang’s core philosophical break is that building super intelligence is fundamentally a research activity, not a scaling-only sprint. The lab is operating in fog of war, and ambitious experiments are the only way to map it. That conviction is what makes MSL a research-led organization rather than a brute-force compute farm.

Thoughts

The most strategically interesting move in this entire interview is the “economy of agents in a data center” framing. It is a deliberate reframe against Anthropic’s “country of geniuses” line, and it does real work. A country of geniuses is a labor-substitution story aimed at knowledge workers and code. An economy of agents is a marketplace story that maps directly onto Meta’s two-sided distribution advantage: billions of consumers on one side, hundreds of millions of small businesses on the other. That positioning makes the agentic future Meta-shaped in a way no other frontier lab can claim, because no other frontier lab also owns the demand and supply graph of the global small-business economy. If Wang’s team can actually ship reliable agents on both sides plus the rails for them to transact, Meta’s structural moat in agentic commerce could exceed anything Llama ever had as an open model.

The token efficiency claim is the strongest piece of technical evidence in the interview for the “clean stack” thesis. If MuseSpark really is matching competitors with materially fewer tokens, the implication is not that MuseSpark is the best model today, but that MSL has rebuilt the foundations with less accumulated tech debt than competitors that have layered fixes on top of older stacks. That is exactly the kind of advantage that compounds with scale. The next two model releases are the actual test. If Wang is right about predictable scaling on pre-training, RL, test-time, and multi-agent axes simultaneously, the gap from MuseSpark to the next rung should be visible in a way that forces re-rating of Meta’s position.

The open-source posture is the cleanest signal of how the safety conversation has actually changed in 2026. Meta, the lab most identified with open weights, is saying out loud that its current frontier model triggered enough internal guardrails that releasing the weights is off the table. Wang threads the needle by promising smaller open variants, but the underlying point is unmistakable: the open-weights bargain has limits, and those limits will be set by internal preparedness frameworks rather than community pressure. That is a real shift from the Llama 2 era and worth tracking as the next generation lands.

Wang’s willingness to engage on model welfare, on roughly the same footing as safety and alignment, is the second philosophical reveal worth flagging. It signals that the next generation of lab leadership is not going to dismiss the topic the way the previous generation often did. Whether that translates into product or policy changes is unclear, but the fact that the head of MSL says it is “underdiscussed” is itself a marker.

Finally, the human texture of the interview matters. Wang has clearly absorbed a lot of personal incoming fire over the past ten months, including from LeCun and Altman, and his answer is consistently to redirect to the work. The Steve Jobs quote about hiring people who tell you what to do is the operating slogan he keeps coming back to. Combined with the genuine enthusiasm for sci-fi, walks in the woods, and country music, the picture that emerges is less the salesman caricature his critics paint and more a young technical operator betting that scoreboard work over a multi-year horizon will settle every argument that text on X cannot.

Watch the full conversation here.
May 13, 2026
How GPT-5, Claude, and Gemini Are Actually Trained and Served: The Real Math Behind Frontier AI Infrastructure
Reiner Pope, CEO of MatX and former TPU architect at Google, sat down with Dwarkesh Patel for a different kind of episode: a chalk-and-blackboard lecture on how frontier LLMs like GPT-5, Claude, and Gemini are actually trained and served. With nothing but a handful of equations and public API prices, Reiner reverse engineers an astonishing amount of what the labs are doing. If you have ever wondered why Fast Mode costs more, why context length stalls around 200k tokens, why models seem 100x over-trained, or why hyperscalers are pouring half a trillion dollars into memory, this is the most lucid explanation on the internet.

TLDW

Frontier LLM economics come down to two simple budgets: compute time and memory time. Once you write the rooflines on a blackboard, almost everything else falls out of them. Optimal batch size is roughly 300 times your sparsity ratio (around 2,000 to 3,000 tokens for a DeepSeek-style model). A new batch “train” departs every 20 milliseconds because that is how long it takes to read HBM end to end. Mixture of experts strongly favors staying inside a single rack, which is why scale-up domains went from 8 GPUs (Hopper) to 72 (Blackwell) to 500-plus (Rubin). Pipeline parallelism solves weight capacity but does nothing for KV cache, and adds painful per-hop latency, which is why Ilya famously said pipelining is not wise. Because of reinforcement learning and inference economics, frontier models are roughly 100x over-trained versus Chinchilla optimal, and a well-tuned model should output roughly as many tokens during deployment as went into its pre-training corpus. API prices leak the rest: Gemini’s 50% premium above 200k tokens reveals where KV memory time crosses weight memory time, prefill being 5x cheaper than decode confirms decode is memory bandwidth bound, and cache hit pricing tiers map directly to HBM, DDR, flash, and (yes) spinning disk. The lecture closes on a beautiful detour about the convergent evolution of neural nets and cryptographic ciphers.

Key Takeaways
- Two equations explain almost everything. A roofline analysis comparing compute time to memory fetch time predicts cost, latency, and architectural choices with shocking accuracy.
- Optimal batch size is about 300 times sparsity. For a DeepSeek model that activates 32 of 256 experts, that lands around 2,000 to 3,000 tokens per batch. Real deployments go a bit higher to leave headroom.
- The 20 millisecond train. A new batch departs every 20ms because that is how long it takes to read all of HBM once. Worst-case queue latency is roughly 40ms.
- Fast Mode is just smaller batches. Pay 6x more, get 2.5x faster decode by amortizing weights over fewer users. There is a hard latency floor at the HBM read time.
- Slow Mode would not save much. Once you are past the optimal batch size, the cost-per-token plateau is dominated by compute, not weight fetches. You cannot meaningfully amortize KV cache because it is unique per sequence.
- One rack is the natural MoE unit. Expert parallelism wants all-to-all communication, which strongly favors the scale-up network (NVLink) over the scale-out network (roughly 8x slower).
- Bigger scale-up domains drove model scaling. The jump from 8 (Hopper) to 72 (Blackwell) to 500-plus (Rubin) GPUs per rack increased aggregate memory bandwidth by 8x, which is why trillion-plus parameter models only became viable recently.
- Pipeline parallelism is overrated for inference. It saves on weight memory capacity but does nothing for KV cache memory. It also adds milliseconds of latency per hop in decode.
- Why Ilya said pipelining is not wise. Architectural constraints (cross-layer residuals like in Kimi) and the inability to amortize weight loads across micro-batches make pipelining a hassle in training too.
- The memory wall is real and paradoxical. Hyperscalers reportedly spend 50% of CapEx on memory, yet racks have far more HBM than a trillion-parameter model needs. The capacity is there for KV cache and batch size, not for weights.
- Frontier models are roughly 100x over-trained vs Chinchilla. When you minimize total cost across pre-training plus RL plus inference, smaller models trained on more data win.
- Each model should output roughly all human knowledge. If you equalize pre-training and inference compute, the total tokens served by a model during its lifetime should approximate its training corpus. Roughly 150 trillion in, 150 trillion out.
- API pricing reveals architecture. Gemini’s 50% premium above 200k context, the 5x decode-vs-prefill ratio, and cache duration tiers all leak detailed information about KV size, memory bottlenecks, and storage hierarchy.
- KV cache is roughly 2KB per token. Solving Gemini’s pricing equation gives a plausible 1.6 to 2 kilobytes per token at 100B active parameters and 200k context.
- Decode is memory bandwidth bound, prefill is compute bound. The 5x price gap is direct evidence.
- Cache pricing maps to memory tiers. The 5-minute and 1-hour cache durations probably correspond to flash and spinning disk drain times respectively. LLM serving uses spinning disk.
- Context length is stuck near 200k. Memory bandwidth, not compute, is the binding constraint. Sparse attention gives a square-root improvement but is not infinite.
- Cryptography and neural nets are mathematical cousins. Both rely on jumbling information across inputs. Feistel ciphers led directly to RevNets (reversible neural networks). Adversarial attacks mirror the cipher avalanche property.
Detailed Summary

The Roofline: Compute Time vs Memory Time

Reiner starts with the simplest possible model of LLM inference. The time to do a forward pass is bounded below by the maximum of compute time and memory fetch time. Compute time is the batch size times active parameters divided by FLOPs. Memory time is total parameters divided by memory bandwidth, plus a KV cache term that scales with batch size and context length. From these two equations, almost every economic and architectural fact about modern LLMs can be derived.

Plotting cost per token against batch size gives a clean picture: at low batch you pay enormous overhead because you cannot amortize the weight fetches, and at high batch you hit a compute floor. There is a sweet spot where memory bandwidth time equals compute time. That sweet spot is what Fast Mode and Slow Mode are tuning around.

Why Fast Mode Costs More: The Batch Trade-Off

When Claude Code or Codex offers Fast Mode at 6x the price for 2.5x the speed, what is really happening is that they are running you at a smaller batch size. Smaller batch means weight loads are amortized over fewer users, so cost per token goes up. But latency goes down because each forward pass touches less data. There is a hard floor on latency because you have to read every byte of HBM at least once per token, and that takes about 20 milliseconds on Blackwell-class hardware. There is also a soft ceiling on Slow Mode savings because the unamortizable parts (KV cache fetches, compute) eventually dominate.

The 20 Millisecond Train

HBM capacity divided by HBM bandwidth lands consistently around 20 milliseconds across generations of Nvidia hardware. That is the natural cadence at which a frontier model can run a forward pass over all its weights. Reiner uses a memorable analogy: a train departs every 20 milliseconds. Any users whose requests are ready board the train. If the train is full, they wait. If it is empty, it leaves anyway. This is why you do not need millions of concurrent users to saturate a model’s batch. You only need enough to fill a 2,000-token train every 20ms.

Why Optimal Batch Size Is About 300 Times Sparsity

Setting compute time equal to weight fetch time and rearranging gives a beautiful result: batch size needs to be greater than (FLOPs / memory bandwidth) times (total params / active params). The hardware ratio is a dimensionless 300 on most GPUs and has stayed remarkably stable from A100 through Hopper, Blackwell, and Rubin. The model term is just the sparsity ratio. For DeepSeek with 32 of 256 experts active, that is 8. So optimal batch is around 2,400 tokens. Real deployments push this to 3x to leave headroom for non-ideal efficiency. At 64 trains per second, that is roughly 128,000 tokens per second per replica, or about 1/1000 of Gemini’s reported global throughput.

Mixture of Experts Wants to Live Inside a Rack

MoE all-to-all routing means every token can be sent to any expert on any GPU. The communication pattern strongly prefers the fast scale-up network (NVLink) inside a rack to the slower scale-out network between racks. Scale-out is roughly 8x slower in bandwidth. This is why one rack ends up being the natural unit for an expert layer, and why Nvidia’s progression from 8 GPUs per rack (Hopper) to 72 (Blackwell) to 500-plus (Rubin) has been such a big deal for model size scaling.

Reiner walks through the physical constraints: cable density, bend radius, weight, power, cooling. Modern racks are pushing every dimension to the limit. Stuffing more GPUs into the scale-up domain is genuinely a hardware engineering problem.

Pipeline Parallelism: Why Ilya Said It Is Not Wise

Pipelining splits model layers across racks. It is the natural way to scale beyond the scale-up domain for very large models. But it has problems. In inference, pipelining does not save runtime, it only saves memory capacity per rack, which already is not the binding constraint because trillion-parameter models only need a terabyte and racks have 10x that. In training, pipelining creates the famous bubble (idle GPU time at the start and end of each pipeline pass) and forces micro-batching, which kills your ability to amortize weight loads across the global batch.

There is also an architectural cost. Models like Kimi use cross-layer residual connections where attention attends to layers a few back, and pipelining makes those patterns very hard to implement cleanly. Ilya’s quip “as we now know, pipelining is not wise” captures all of this.

The Memory Wall Paradox

Industry analysts report that hyperscalers are spending 50% of CapEx on memory this year, while smartphones and laptops are seeing 30% volume drops because there is not enough HBM and DDR to go around. Yet a Blackwell rack already has tens of terabytes of HBM, far more than a trillion-parameter model needs. The reason is that all that extra capacity goes to KV cache, batch size, and longer context. The bandwidth, not the capacity, is what matters most for weight loading. This also implies that hardware could be designed with less HBM per GPU if you commit to pipelining the weights, which is a real architectural option for a chip startup like MatX.

Reinforcement Learning and the 100x Over-Training of Frontier Models

Chinchilla scaling laws say a model with N active parameters should be trained on roughly 20N tokens for compute-optimal training. But frontier labs do not just minimize training cost. They minimize training plus inference cost across the model’s deployment lifetime. With reinforcement learning added to the mix, the cost equation has three terms: pre-training (6 times active params times tokens), RL (somewhere between 2x and 6x times active params times RL tokens, with a 30% efficiency penalty for decode-heavy rollouts), and inference (2 times active params times inference tokens).

If you assume those three roughly equalize at the optimum (a heuristic that holds for many cost curves), you get a clean conclusion: the data going into pre-training should be roughly equal to the data going into RL, which should be roughly equal to the tokens served at inference. With 100 billion active parameters and roughly 150 trillion training tokens, that is about 75x past Chinchilla optimal. Reiner rounds it to 100x. This is the most concrete first-principles argument for why frontier models are so deeply over-trained, and it implies that as inference traffic grows, models should keep getting smaller and longer-trained.

Each Model Should Output All of Human Knowledge

The most jaw-dropping consequence: if you equalize pre-training and inference compute, then the total tokens generated by a model across its deployment lifetime should approximate the size of its training corpus. GPT-5, served to hundreds of millions of users for two months, will collectively output something on the order of 150 trillion tokens. That is roughly the sum of human knowledge in textual form. Each frontier model is, in this sense, a one-shot universal author of a corpus the size of its source material.

API Prices Leak Architecture

This is where the lecture gets really fun. Gemini 3.1 charges 50% more for context above 200k tokens. Setting memory time equal to compute time at exactly 200k context and solving for KV cache size gives roughly 1.6 to 2 kilobytes per token, which is plausible for a model with 8 KV heads, dense attention, and head dimension of 128.

The 5x premium for output (decode) tokens versus input (prefill) tokens is direct evidence that decode is severely memory bandwidth bound and prefill is compute bound. Prefill processes many tokens per weight load, so it amortizes memory cost over the whole sequence. Decode processes one token per weight load, so it pays full memory cost every time.

Cache hits priced at one tenth of cache misses tell you that storing the KV cache in HBM (or DDR or flash) is much cheaper than recomputing it from scratch. The two cache duration tiers (5 minutes and 1 hour) probably correspond to memory tiers whose drain times match those durations: flash for the 5-minute tier, spinning disk for the 1-hour tier. Yes, spinning disk is in the modern LLM serving stack, despite being decades-old technology.

Why Context Length Has Plateaued at 200k

Context lengths shot up from 8k to roughly 200k during the GPT-3 to GPT-4 era and have stayed roughly flat for the past two years. Reiner argues this is the natural balance point where memory bandwidth cost crosses compute cost. Going to a million tokens is expensive. Going to 100 million tokens (which Dario has hinted is needed for true continual learning via in-context learning) is essentially impossible without either a memory technology breakthrough or a much more aggressive sparse attention scheme. Sparse attention helps with a square-root improvement, but it is not unlimited. Going too sparse trades off too much quality.

Cryptography Meets Neural Nets

The episode ends with a lovely intellectual detour. Cryptographic protocols and transformer architectures both rely on jumbling information across all inputs. They are doing inverse versions of the same operation: ciphers take structured input and produce randomness, while neural nets take noisy input and extract structure. Both fields use differentiation as their primary attack vector (differential cryptanalysis on ciphers, gradient descent on neural nets). Adversarial attacks on image classifiers exploit exactly the avalanche property that good ciphers are designed for.

The most concrete crossover: Feistel ciphers, which let you build invertible functions out of non-invertible ones, were ported into deep learning as RevNets (reversible networks) in 2017. RevNets let you run the entire network backwards during the backward pass, eliminating the need to store activations and dramatically reducing training memory footprint. It is the opposite trade-off of KV caching: spending compute to save memory rather than spending memory to save compute.

Thoughts

The most striking thing about this episode is how much can be deduced from a few equations and the public API price sheets of the major labs. The labs treat their architectures as trade secrets, but the moment they price tokens to be close to cost (which competition forces them to do), the prices themselves leak the underlying ratios. Anyone with a pen and paper can reverse engineer the KV cache size, the memory tier hierarchy, and the compute-vs-memory bottleneck profile of a frontier model. There is a lesson here for builders: in competitive markets, the prices tell you almost everything.

The 100x over-training result has interesting implications for what comes next. If the optimal balance shifts further toward inference (as adoption keeps growing), models should get smaller and longer-trained. That is good news for serving costs and bad news for training-compute-as-moat. The biggest determinant of model quality might increasingly be data quality and RL environment design, not raw pre-training compute. This squares with what is visible publicly: the leading labs are investing heavily in RL infrastructure, evaluations, and synthetic data pipelines.

The memory wall is the most underrated infrastructure story in AI. Most people think of compute as the bottleneck, but Reiner makes it clear that memory bandwidth is what actually limits context length, which limits how agentic a model can be in practice. If you cannot get to 100 million token contexts, you probably cannot have an AI agent that has been working with you for a month and remembers everything. Either some sparse attention scheme has to give us cheap effective context length, or we need a memory hardware breakthrough, or we have to invent some form of continual learning that does not rely on context windows. None of those paths are obviously easy, and the fact that context length has been flat for two years despite enormous investment suggests we are stuck against a real wall.

The cryptography parallel is the kind of cross-disciplinary insight that does not show up enough in AI discourse. Treating neural networks as a kind of differentiable cipher reframes a lot of the architecture choices (residual connections, layer normalization, attention) as deliberate efforts to make the function smooth and invertible enough to learn, in contrast to ciphers, which are deliberately designed to resist exactly that. Adversarial robustness research probably has a lot more to learn from cryptanalysis than it currently does.

Finally, the format itself is a win. Most AI podcasts are conversational, which is great for personality but bad for technical depth. A blackboard lecture with an interlocutor who asks naive questions at the right moments is a much higher bandwidth medium. More of this, please.
April 29, 2026