Tag: sparse attention

SubQ 1.1 Small Explained: How Subquadratic Sparse Attention Hits 98% Retrieval at 12 Million Tokens With 64.5x Less Compute Than Dense Attention
Subquadratic, a frontier AI research and infrastructure company, has released the model card and technical report for SubQ 1.1 Small, a long-context language model built on a new attention mechanism the company calls Subquadratic Sparse Attention (SSA). The headline claim is unusual in two directions at once: the model retains 98% single-fact retrieval accuracy at 12 million tokens, roughly twelve times the length it was primarily trained on, while cutting attention compute by 64.5x against dense attention at a 1 million token context. The deeper argument in the report is not really about a single model at all. It is about what happens to the entire retrieval-and-orchestration stack once reasoning over a complete artifact stops being prohibitively expensive.

TLDR

SubQ 1.1 Small is a small long-context model that replaces the dense attention of an existing open-weight frontier model with Subquadratic Sparse Attention, a learned, content-dependent sparse attention mechanism that scales linearly in compute and memory rather than quadratically. On retrieval it posts 99.12% on NVIDIA’s 13-task RULER suite at 128K tokens and 100% needle-in-a-haystack accuracy at 1M and 2M tokens, holding at 98% out to 6M and 12M tokens while attending to only 0.13% of token pairs. It keeps competitive general ability, scoring 85.4% on GPQA Diamond and 89.7% pass@4 on LiveCodeBench v6, and reaches 13% on the long-horizon AutomationBench Finance agentic benchmark, close to Opus 4.8 and GPT-5.5 and well ahead of mid and small tiers. The efficiency story is a scaling win rather than a constant-factor one: 64.5x fewer attention FLOPs than dense attention at 1M tokens and 56x faster than FlashAttention-2 on a single attention layer. The report frames cheap long-context compute as a research accelerator that let the team run more than one hundred million-token experiments and find a training recipe (long-context continued pretraining is the strongest lever) rather than guess at one, positions SSA against FlashAttention, DeepSeek’s Lightning Indexer line, state space models like Mamba, and hybrids, invokes Sutton’s Bitter Lesson to argue that RAG, chunking, and agentic scaffolding are partly workarounds for context scarcity, and was independently verified by Appen. Deployment is starting with design partners now, with a 2M to 12M token lineup planned by year end.

Thoughts

The most interesting move in this report is the framing, not the benchmark. Subquadratic plants its flag on Richard Sutton’s Bitter Lesson and argues that much of the modern AI stack, the retrieval pipelines, the chunkers, the re-rankers, the agentic orchestration, is scaffolding built around a single computational constraint: dense attention costs grow with the square of context length. If that constraint relaxes, a lot of hand-engineered machinery that exists to feed a model the right fragments at the right moment starts to look like the task-specific pipelines that learned representations eventually displaced. That is a genuinely provocative thesis, and it is the right lens for reading the rest of the document. The company is not selling a longer context window as a feature. It is betting that whole-artifact reasoning is a different shape of capability than retrieval over fragments, and that fragmentation destroys the cross-references a contract or a codebase actually depends on before the model ever sees them.

The part of the paper most teams will undervalue is the claim that the real payoff of efficient attention is not cheaper inference but cheaper experimentation. A dense long-context training campaign is expensive enough that most groups get a handful of attempts and are forced to guess at the recipe. Subquadratic says SSA let them run more than a hundred experiments across six model generations with per-step iteration under a minute at million-token context, which is how they discovered that long-context continued pretraining, not clever post-training, was the dominant lever. If that holds, algorithmic efficiency becomes a first-class scaling variable alongside parameters and data, because capability becomes responsive to iteration velocity rather than raw compute alone. It reframes efficiency from a deployment line item into a research multiplier, and that is a more durable advantage than any single benchmark number.

The generalization result deserves scrutiny precisely because it is so clean. A model trained overwhelmingly at 1M tokens, with a sliver at 2M and nothing beyond, holds 98% retrieval at 12M. The proposed explanation is that SSA routes attention by content relevance rather than fixed positional pattern, so there may simply be no obvious length boundary once the routing behavior is learned. That is plausible and the report is careful to say the 12M result emerged rather than being designed for. But single-needle NIAH is a deliberately clean probe with one target and a binary answer. The far harder RULER suite is only reported at 128K, the longest standardized length in the original benchmark, so the multi-hop, aggregation, and distractor-heavy capability that whole-artifact reasoning actually requires has public numbers at 128K, not at 12M. The honest read is that precise retrieval generalizes spectacularly and composite reasoning at extreme length is still an open question the report does not over-claim on.

What lends the report credibility is how much counter-evidence it volunteers. It walks through MiniMax abandoning its hybrid M1 architecture and returning to full attention for M2 after efficient variants showed multi-hop reasoning deficits at scale. It admits that earlier SubQ checkpoints improved retrieval while regressing on knowledge benchmarks, forcing dedicated capability-balancing work. It describes catching a case where the MRCR benchmark moved up while the model felt worse in real workflow spot-checks, and switching its development signal to RULER as a result. That last point is a quietly important methodological argument: benchmark score and deployment behavior diverged enough to change checkpoint selection, which is a warning every team shipping long-context models should internalize. A vendor confident enough to show where its own metrics misled it is more trustworthy than one that only shows the wins.

A few caveats keep the enthusiasm grounded. AutomationBench Finance at 13% is genuinely strong relative to peers, but it is a low absolute score across the board, including for GPT-5.5 at 18% and Opus 4.8 at 16%, so this is early evidence of agentic transfer rather than proof of a finished agent. The efficiency comparisons isolate a single attention layer rather than full end-to-end model throughput, which is the right way to expose the scaling shape but not the same as a wall-clock serving benchmark. The model is built from an unnamed donor open-weight frontier model, so some of its general-knowledge and coding strength is inherited rather than created here. And the most aggressive claims about the future, a 2M to 12M lineup and much higher sparsity, are roadmap, not released artifacts. None of that undercuts the core result. It just means the right posture is to treat SubQ 1.1 Small as a strong proof of concept for an architecture that, if it scales as advertised, could quietly remove a layer of the AI stack that everyone currently takes for granted.

Key Takeaways
- SubQ 1.1 Small is a long-context language model from Subquadratic AI, built on a new attention mechanism called Subquadratic Sparse Attention (SSA), released June 16, 2026 alongside a model card and technical report.
- SSA is a learned, content-dependent sparse attention mechanism that scales linearly in both compute and memory with sequence length, rather than quadratically like dense attention.
- The central result is context-length generalization: the model was trained primarily at 1M tokens, with some training at 2M and none beyond, yet retrieval held far past the training window.
- Needle-in-a-haystack accuracy is 100% at 1M and 2M tokens and 98% at both 6M and 12M tokens, roughly twelve times the primary training length.
- At 12M tokens the model attends to only 0.13% of token pairs, close to a 1,000x reduction in attention relationships, while still retrieving accurately.
- On NVIDIA’s 13-task RULER benchmark at 128K tokens, SubQ 1.1 Small scores 99.12%, with the remaining errors concentrated in aggregation-style tasks rather than retrieval.
- RULER tests beyond single-fact lookup: single-key and multi-key retrieval, common-word and frequent-word extraction, and multi-hop variable tracing across positions.
- At 1M tokens, SSA requires 64.5x fewer attention FLOPs than dense attention (3.9 PFLOP versus 252 PFLOP per attention layer).
- On a single attention layer, SSA runs 56x faster than FlashAttention-2 at 1M tokens (966 ms versus 54,164 ms on an H100), reaching parity near 16K tokens and pulling away as context grows.
- The efficiency gain is a scaling-law win, not a constant-factor speedup: the advantage over dense attention grows as context length increases.
- On general knowledge, SubQ 1.1 Small scores 85.4% on GPQA Diamond (pass@1), below GPT-5.5 (93.2) and Opus 4.8 (92), near Sonnet 4.6 and GPT-5.4-mini (87.5), and above GPT-5.4-nano (81.7) and Haiku 4.5 (67.2).
- On coding, it reaches 89.7% pass@4 on LiveCodeBench v6, close to the absolute frontier (GPT-5.5 92, Opus 4.8 92.2) and ahead of the smaller tiers.
- On AutomationBench Finance, a long-horizon agentic benchmark, it scores 13%, close to Opus 4.8 (16%) and GPT-5.5 (18%) and ahead of Sonnet 4.6 (8%), Haiku 4.5 (3%), and GPT-5.4-mini (0%). Absolute scores are low across all models.
- The model was not trained from scratch. The team converted an existing open-weight frontier model by replacing dense attention with SSA, then built long-context ability through staged context extension and continued pretraining.
- Context was extended in stages (262K, 512K, 1M, 2M) using YaRN positional scaling, with long-context continued pretraining performed between extension stages on naturally long data: books, long documents, and repository-scale code.
- Roughly one trillion tokens of continued pretraining were performed, most of it at the 1M-token stage.
- Long-context continued pretraining was the most consistent predictor of long-context retrieval gains across the experiments, more so than post-training tweaks.
- The team ran more than one hundred long-context experiments across six major model generations, which the report argues is only possible because SSA made million-token iteration cheap (under a minute per step).
- Capability balance was a recurring challenge: gains in long-context retrieval often regressed short-context knowledge and reasoning unless training was explicitly managed for both.
- Benchmark scores and real deployment behavior diverged. The MRCR benchmark moved up while qualitative workflow spot-checks got worse, so the team switched its primary development signal to RULER.
- The report frames RAG, chunking, summarization, and agentic orchestration as scaffolding built around context scarcity, drawing an analogy to Sutton’s Bitter Lesson, where hand-engineered mechanisms get displaced by larger-scale learning.
- SSA is positioned against FlashAttention (a memory optimization that does not change quadratic compute), fixed-pattern sparse attention, DeepSeek’s learned sparse line, state space models, and hybrid architectures.
- DeepSeek’s Lightning Indexer (used in DSA and CSA) is the closest published comparison. Its quadratic scoring overtakes the sparse attention it feeds around 52,000 tokens, reaching roughly 16x the attention cost at 1M and 190x at 12M.
- State space models like Mamba achieve linear cost through a compressed fixed-size state, but that compression is lossy and weakens exact retrieval, which is why production efficient models are usually hybrids with some dense attention layers retained.
- MiniMax is cited as a cautionary case: it moved from a hybrid M1 to a full-attention M2 after hybrids showed multi-hop reasoning deficits at scale and less mature supporting infrastructure.
- The benchmark results were independently verified by Appen, a third-party evaluation firm.
- The named use cases are financial analysis and due diligence, legal and contract work, and software engineering (architecture-level reasoning, cross-file refactoring, dependency tracing, planning, review, and long-horizon memory).
- Sparsity settings were deliberately conservative, tuned for maximum context length rather than maximum sparsity. Limited experiments at 4x the sparsity reported positive early results.
- The training infrastructure used a memory-scaling ladder: single node, intra-node sequence parallelism, CPU offload, multi-node sequence parallelism, nested offloading, and Ring Attention for the longest contexts.
- Beyond about 8M tokens, BF16 numerical underflow and stability became practical constraints on evaluation.
- The technical report is authored by Saul Ramirez, Alex Whedon, Ashmal Vayani, and Phong Vo of Subquadratic AI.
- Deployment is starting with a first cohort of design partners, with broader rollout through the quarter and a general model lineup ranging from 2M to 12M tokens by the end of the year.
- The company’s framing line is “Efficiency is intelligence,” and its broader thesis is that the point is not bigger context windows for their own sake but reasoning directly over complete artifacts with less surrounding scaffolding.
Detailed Summary

The problem: whole-artifact reasoning and context scarcity

The report opens by naming a class of tasks it calls whole-artifact reasoning: problems whose structure requires reasoning across a complete artifact rather than over isolated fragments. A legal agreement may define a term on page 2, qualify it on page 12, carve out an exception on page 46, and amend it in a schedule. A function may be defined in one file, called from forty others, and constrained by invariants encoded in the architecture rather than in comments. A financial review may require connecting filings, earnings reports, contracts, and internal records. In each case the difficulty is not locating a passage, it is reasoning over relationships distributed throughout a large artifact. Most production systems do not do this directly. They rely on retrieval pipelines, chunking, summaries, and agentic workflows that partition information and reconstruct fragments at inference time, because dense attention scales quadratically with context length and makes direct reasoning over large artifacts expensive. Subquadratic argues that much of the modern AI stack is therefore designed to manage context scarcity rather than reason over complete artifacts, and it connects this to Sutton’s Bitter Lesson: sophisticated hand-engineered mechanisms historically get displaced once larger-scale learning becomes practical.

What SSA is and the three requirements it targets

Subquadratic Sparse Attention is a content-dependent sparse attention mechanism designed to satisfy three requirements at once, a combination the report argues prior approaches never achieved in a practical long-context system. First, dense-attention-level retrieval and reasoning quality, which requires routing that is content-dependent (determined by the tokens themselves) rather than driven by a fixed positional pattern. Second, subquadratic scaling, where selection, retrieval, and attention are each linear in sequence length so the mechanism is linear end to end, not only within the attention read. Third, full-context training with standard autoregressive generation, so the model can optimize over the entire context during training while keeping efficient token-by-token decoding at inference. The internal mechanism by which SSA achieves this is held back as outside the scope of the report, which focuses instead on the requirements and the experimental program that followed.

Where SSA sits among prior approaches

The background section is effectively a taxonomy of long-context modeling. FlashAttention is treated not as a competitor but as the standard dense-attention baseline: it solved the memory problem by never materializing the full attention matrix, but it left the quadratic compute cost untouched, so doubling context still quadruples attention computation. Fixed-pattern sparse attention (sliding-window, strided, as in Longformer, BigBird, and the sliding window in Gemma) scales well but sacrifices content-dependent routing and tends to fail on retrieval benchmarks like RULER. Compression methods like Multi-head Latent Attention reduce KV-cache memory at inference but do not change the quadratic prefill cost. Learned sparse attention, exemplified by DeepSeek’s Native Sparse Attention and its Lightning Indexer, learns where to route but pays a quadratic cost in the indexer itself. State space models and linear attention (Mamba, Mamba-2 and Mamba-3, RetNet, RWKV, gated delta networks) achieve linear cost through a compressed fixed-size state, but that compression is lossy and weak on exact retrieval. Hybrids (Jamba, Kimi Linear, Qwen3 Next, Nemotron) keep a few dense layers to preserve retrieval, which means the quadratic component still dominates at long context. System-level workarounds (RAG, agentic frameworks, recursive language models) move retrieval outside the model entirely. The report’s stated open problem is to combine subquadratic scaling end to end with content-dependent retrieval, arbitrary-position access, and practical ultra-long-context training in one system, which it claims no widely deployed architecture provides and which SSA targets.

Training: conversion, staged context extension, and continued pretraining

Rather than training from scratch, the team converted an existing open-weight frontier model that supported a 262K-token context by replacing its dense attention with SSA. They then extended the context window in stages (262K to 512K to 1M to 2M) using YaRN to rescale positional representations, performing long-context continued pretraining between extension stages rather than jumping straight to the final length. The training mixture emphasized naturally long data such as books, long documents, and repository-scale code, packed to the target length with document separators and without masking cross-document attention boundaries. Most continued-pretraining tokens were trained at the 1M-token stage, with roughly one trillion tokens total. Post-training played a separate role: shaping how the long-context capability was expressed while preserving reasoning, coding, and instruction following. The team explored sample-level loss aggregation to keep a few extremely long examples from dominating gradient updates, and staged the post-training corpus across synthetic retrieval tasks, long-context reasoning, coding, educational material, and general instruction following, alternating capability-building phases with recovery phases.

Results: retrieval, knowledge, coding, and agentic tasks

On retrieval, SubQ 1.1 Small scores 99.12% on the 13-task RULER average at 128K, with errors concentrated in aggregation-style tasks like common-word and frequent-word extraction. On needle-in-a-haystack, evaluated on 50 held-out UUID samples per length, it scores 100% at 1M and 2M (within the training window) and 98% at 6M and 12M (held out), attending to only 0.13% of token pairs at 12M. On knowledge, GPQA Diamond pass@1 is 85.4%, landing between the small and mid frontier tiers and confirming that long-context optimization need not sacrifice reasoning, a result the report credits to its capability-balancing stages after earlier checkpoints showed retrieval gains coming at the cost of knowledge. On coding, LiveCodeBench v6 pass@4 is 89.7%, and the report notes coding data played a dual role, also improving non-code long-context retrieval because code is dense with the cross-position dependencies that train general routing. On long-horizon agentic work, AutomationBench Finance is 13%, where agents must discover the right endpoints among roughly 500 across 47 applications, make interdependent API calls, follow layered business rules, and ignore seeded distractors, graded on binary end-state correctness with no partial credit.

Efficiency and the DeepSeek comparison

Efficiency is measured on one attention layer against a dense baseline on the same backbone. Per-forward-pass attention FLOPs scale from a 2.1x reduction at 32K to 8x at 128K, 31.5x at 512K, and 64.5x at 1M tokens (3.9 PFLOP for SSA versus 252 PFLOP for dense). Measured against FlashAttention-2 in isolation, SSA reaches parity near 16K tokens and pulls away to 56x at 1M, where it runs in 966 ms versus 54,164 ms on an H100. The report devotes a discussion section to DeepSeek’s sparse attention line as the closest published comparison. DeepSeek’s Lightning Indexer is a learned selector, but it is a full-attention distilled transformer, so it scales quadratically: in a V3.2-style configuration the indexer is cheaper than the sparse attention it feeds only below about 52,000 tokens, then overtakes it, reaching roughly 16x the attention cost at 1M tokens and 190x at 12M. SSA targets that same selection role with a selector the report says is dramatically cheaper and linear throughout, and notes SSA could conceptually replace the selector over either uncompressed or compressed representations.

Efficiency as a research accelerator and the evaluation lessons

A recurring theme is that the most valuable effect of cheap long-context compute was on the research loop, not just inference. Where a dense campaign would allow a handful of attempts, SSA enabled more than a hundred experiments across six model generations with per-step iteration under a minute at million-token context. That throughput is what surfaced the finding that long-context continued pretraining is the strongest lever, and it leads the authors to argue that algorithmic efficiency should be treated as a first-class scaling variable alongside model and dataset size. The report is unusually candid about evaluation pitfalls. It describes how the MRCR benchmark diverged from deployment behavior, with MRCR-optimized checkpoints often feeling worse on repository-scale code reasoning, multi-document synthesis, and contract analysis, which pushed the team to rely on RULER and a fixed set of qualitative workflow spot-checks as development signals. It also cites MiniMax returning from a hybrid M1 to a full-attention M2 as evidence that reducing asymptotic cost is not sufficient on its own if retrieval quality, reasoning at scale, and system maturity are not preserved at the same time.

Implications, availability, and what comes next

The report’s deployment argument is that the most important enterprise implication of long-context models is not larger windows but the ability to reason directly over complete or more-complete artifacts, moving retrieval, re-ranking, and orchestration logic into the model where the task is naturally whole-artifact rather than naturally decomposable. It is careful not to declare retrieval obsolete: for corpora larger than any plausible context window, fast-changing knowledge, and genuinely multi-stage workflows, RAG and orchestration remain the right tools. The narrower claim is that the class of scaffolding that exists only to compensate for context limits gets smaller as efficient long-context models extend the reachable window. The benchmark results were independently verified by Appen. Subquadratic is deploying SubQ 1.1 Small with a first cohort of design partners now, with broader rollout through the quarter and a general lineup spanning 2M to 12M tokens planned by the end of the year, and it flags much higher sparsity as future work.

Notable Quotes

“Much of the modern AI stack is therefore designed to manage context scarcity rather than reason over complete artifacts directly.”
SubQ-1.1-Small Technical Report, framing retrieval and orchestration as workarounds for an architectural limit

“The hybrid has moved the line, but not changed its shape.”
SubQ-1.1-Small Technical Report, on why hybrid models keep their quadratic component at long context

“A routing mechanism intended to make long context affordable becomes the dominant long-context cost, reintroducing quadratic scaling after providing scalar compute savings.”
SubQ-1.1-Small Technical Report, on DeepSeek’s Lightning Indexer overtaking the attention it feeds

“If the cost of long-context experiments is too high, teams are forced to guess at the recipe. If the cost falls far enough, they can search for it.”
SubQ-1.1-Small Technical Report, on efficient attention as a research accelerator

“Fragmentation systematically destroys those relationships before the model ever sees them.”
SubQ-1.1-Small Technical Report, on why chunking hurts whole-artifact reasoning

“Holding the whole artifact in context changes the shape of the task rather than only the speed of it.”
SubQ-1.1-Small Technical Report, on the difference between bigger windows and direct reasoning

“The value of SSA is therefore not only that it makes long-context inference cheaper. It makes long-context experimentation cheaper.”
SubQ-1.1-Small Technical Report, conclusion

Read the full SubQ 1.1 Small technical report and model card here.

Related Reading
- Subquadratic (subq.ai) the company behind SubQ 1.1 Small and the Subquadratic Sparse Attention architecture, where you can join the waitlist.
- The Bitter Lesson by Richard Sutton the short essay whose argument the report leans on, that hand-engineered mechanisms lose to general methods that scale with computation.
- Attention Is All You Need the original Transformer paper that introduced the dense attention whose quadratic cost SSA is built to remove.
- RULER (arXiv) NVIDIA’s long-context benchmark that the report uses as its primary retrieval signal, and that fixed-pattern sparse methods historically struggle with.
- Retrieval-augmented generation (Wikipedia) background on the RAG approach that the report frames as scaffolding around context scarcity rather than a permanent fixture.
June 18, 2026
How GPT-5, Claude, and Gemini Are Actually Trained and Served: The Real Math Behind Frontier AI Infrastructure
Reiner Pope, CEO of MatX and former TPU architect at Google, sat down with Dwarkesh Patel for a different kind of episode: a chalk-and-blackboard lecture on how frontier LLMs like GPT-5, Claude, and Gemini are actually trained and served. With nothing but a handful of equations and public API prices, Reiner reverse engineers an astonishing amount of what the labs are doing. If you have ever wondered why Fast Mode costs more, why context length stalls around 200k tokens, why models seem 100x over-trained, or why hyperscalers are pouring half a trillion dollars into memory, this is the most lucid explanation on the internet.

TLDW

Frontier LLM economics come down to two simple budgets: compute time and memory time. Once you write the rooflines on a blackboard, almost everything else falls out of them. Optimal batch size is roughly 300 times your sparsity ratio (around 2,000 to 3,000 tokens for a DeepSeek-style model). A new batch “train” departs every 20 milliseconds because that is how long it takes to read HBM end to end. Mixture of experts strongly favors staying inside a single rack, which is why scale-up domains went from 8 GPUs (Hopper) to 72 (Blackwell) to 500-plus (Rubin). Pipeline parallelism solves weight capacity but does nothing for KV cache, and adds painful per-hop latency, which is why Ilya famously said pipelining is not wise. Because of reinforcement learning and inference economics, frontier models are roughly 100x over-trained versus Chinchilla optimal, and a well-tuned model should output roughly as many tokens during deployment as went into its pre-training corpus. API prices leak the rest: Gemini’s 50% premium above 200k tokens reveals where KV memory time crosses weight memory time, prefill being 5x cheaper than decode confirms decode is memory bandwidth bound, and cache hit pricing tiers map directly to HBM, DDR, flash, and (yes) spinning disk. The lecture closes on a beautiful detour about the convergent evolution of neural nets and cryptographic ciphers.

Key Takeaways
- Two equations explain almost everything. A roofline analysis comparing compute time to memory fetch time predicts cost, latency, and architectural choices with shocking accuracy.
- Optimal batch size is about 300 times sparsity. For a DeepSeek model that activates 32 of 256 experts, that lands around 2,000 to 3,000 tokens per batch. Real deployments go a bit higher to leave headroom.
- The 20 millisecond train. A new batch departs every 20ms because that is how long it takes to read all of HBM once. Worst-case queue latency is roughly 40ms.
- Fast Mode is just smaller batches. Pay 6x more, get 2.5x faster decode by amortizing weights over fewer users. There is a hard latency floor at the HBM read time.
- Slow Mode would not save much. Once you are past the optimal batch size, the cost-per-token plateau is dominated by compute, not weight fetches. You cannot meaningfully amortize KV cache because it is unique per sequence.
- One rack is the natural MoE unit. Expert parallelism wants all-to-all communication, which strongly favors the scale-up network (NVLink) over the scale-out network (roughly 8x slower).
- Bigger scale-up domains drove model scaling. The jump from 8 (Hopper) to 72 (Blackwell) to 500-plus (Rubin) GPUs per rack increased aggregate memory bandwidth by 8x, which is why trillion-plus parameter models only became viable recently.
- Pipeline parallelism is overrated for inference. It saves on weight memory capacity but does nothing for KV cache memory. It also adds milliseconds of latency per hop in decode.
- Why Ilya said pipelining is not wise. Architectural constraints (cross-layer residuals like in Kimi) and the inability to amortize weight loads across micro-batches make pipelining a hassle in training too.
- The memory wall is real and paradoxical. Hyperscalers reportedly spend 50% of CapEx on memory, yet racks have far more HBM than a trillion-parameter model needs. The capacity is there for KV cache and batch size, not for weights.
- Frontier models are roughly 100x over-trained vs Chinchilla. When you minimize total cost across pre-training plus RL plus inference, smaller models trained on more data win.
- Each model should output roughly all human knowledge. If you equalize pre-training and inference compute, the total tokens served by a model during its lifetime should approximate its training corpus. Roughly 150 trillion in, 150 trillion out.
- API pricing reveals architecture. Gemini’s 50% premium above 200k context, the 5x decode-vs-prefill ratio, and cache duration tiers all leak detailed information about KV size, memory bottlenecks, and storage hierarchy.
- KV cache is roughly 2KB per token. Solving Gemini’s pricing equation gives a plausible 1.6 to 2 kilobytes per token at 100B active parameters and 200k context.
- Decode is memory bandwidth bound, prefill is compute bound. The 5x price gap is direct evidence.
- Cache pricing maps to memory tiers. The 5-minute and 1-hour cache durations probably correspond to flash and spinning disk drain times respectively. LLM serving uses spinning disk.
- Context length is stuck near 200k. Memory bandwidth, not compute, is the binding constraint. Sparse attention gives a square-root improvement but is not infinite.
- Cryptography and neural nets are mathematical cousins. Both rely on jumbling information across inputs. Feistel ciphers led directly to RevNets (reversible neural networks). Adversarial attacks mirror the cipher avalanche property.
Detailed Summary

The Roofline: Compute Time vs Memory Time

Reiner starts with the simplest possible model of LLM inference. The time to do a forward pass is bounded below by the maximum of compute time and memory fetch time. Compute time is the batch size times active parameters divided by FLOPs. Memory time is total parameters divided by memory bandwidth, plus a KV cache term that scales with batch size and context length. From these two equations, almost every economic and architectural fact about modern LLMs can be derived.

Plotting cost per token against batch size gives a clean picture: at low batch you pay enormous overhead because you cannot amortize the weight fetches, and at high batch you hit a compute floor. There is a sweet spot where memory bandwidth time equals compute time. That sweet spot is what Fast Mode and Slow Mode are tuning around.

Why Fast Mode Costs More: The Batch Trade-Off

When Claude Code or Codex offers Fast Mode at 6x the price for 2.5x the speed, what is really happening is that they are running you at a smaller batch size. Smaller batch means weight loads are amortized over fewer users, so cost per token goes up. But latency goes down because each forward pass touches less data. There is a hard floor on latency because you have to read every byte of HBM at least once per token, and that takes about 20 milliseconds on Blackwell-class hardware. There is also a soft ceiling on Slow Mode savings because the unamortizable parts (KV cache fetches, compute) eventually dominate.

The 20 Millisecond Train

HBM capacity divided by HBM bandwidth lands consistently around 20 milliseconds across generations of Nvidia hardware. That is the natural cadence at which a frontier model can run a forward pass over all its weights. Reiner uses a memorable analogy: a train departs every 20 milliseconds. Any users whose requests are ready board the train. If the train is full, they wait. If it is empty, it leaves anyway. This is why you do not need millions of concurrent users to saturate a model’s batch. You only need enough to fill a 2,000-token train every 20ms.

Why Optimal Batch Size Is About 300 Times Sparsity

Setting compute time equal to weight fetch time and rearranging gives a beautiful result: batch size needs to be greater than (FLOPs / memory bandwidth) times (total params / active params). The hardware ratio is a dimensionless 300 on most GPUs and has stayed remarkably stable from A100 through Hopper, Blackwell, and Rubin. The model term is just the sparsity ratio. For DeepSeek with 32 of 256 experts active, that is 8. So optimal batch is around 2,400 tokens. Real deployments push this to 3x to leave headroom for non-ideal efficiency. At 64 trains per second, that is roughly 128,000 tokens per second per replica, or about 1/1000 of Gemini’s reported global throughput.

Mixture of Experts Wants to Live Inside a Rack

MoE all-to-all routing means every token can be sent to any expert on any GPU. The communication pattern strongly prefers the fast scale-up network (NVLink) inside a rack to the slower scale-out network between racks. Scale-out is roughly 8x slower in bandwidth. This is why one rack ends up being the natural unit for an expert layer, and why Nvidia’s progression from 8 GPUs per rack (Hopper) to 72 (Blackwell) to 500-plus (Rubin) has been such a big deal for model size scaling.

Reiner walks through the physical constraints: cable density, bend radius, weight, power, cooling. Modern racks are pushing every dimension to the limit. Stuffing more GPUs into the scale-up domain is genuinely a hardware engineering problem.

Pipeline Parallelism: Why Ilya Said It Is Not Wise

Pipelining splits model layers across racks. It is the natural way to scale beyond the scale-up domain for very large models. But it has problems. In inference, pipelining does not save runtime, it only saves memory capacity per rack, which already is not the binding constraint because trillion-parameter models only need a terabyte and racks have 10x that. In training, pipelining creates the famous bubble (idle GPU time at the start and end of each pipeline pass) and forces micro-batching, which kills your ability to amortize weight loads across the global batch.

There is also an architectural cost. Models like Kimi use cross-layer residual connections where attention attends to layers a few back, and pipelining makes those patterns very hard to implement cleanly. Ilya’s quip “as we now know, pipelining is not wise” captures all of this.

The Memory Wall Paradox

Industry analysts report that hyperscalers are spending 50% of CapEx on memory this year, while smartphones and laptops are seeing 30% volume drops because there is not enough HBM and DDR to go around. Yet a Blackwell rack already has tens of terabytes of HBM, far more than a trillion-parameter model needs. The reason is that all that extra capacity goes to KV cache, batch size, and longer context. The bandwidth, not the capacity, is what matters most for weight loading. This also implies that hardware could be designed with less HBM per GPU if you commit to pipelining the weights, which is a real architectural option for a chip startup like MatX.

Reinforcement Learning and the 100x Over-Training of Frontier Models

Chinchilla scaling laws say a model with N active parameters should be trained on roughly 20N tokens for compute-optimal training. But frontier labs do not just minimize training cost. They minimize training plus inference cost across the model’s deployment lifetime. With reinforcement learning added to the mix, the cost equation has three terms: pre-training (6 times active params times tokens), RL (somewhere between 2x and 6x times active params times RL tokens, with a 30% efficiency penalty for decode-heavy rollouts), and inference (2 times active params times inference tokens).

If you assume those three roughly equalize at the optimum (a heuristic that holds for many cost curves), you get a clean conclusion: the data going into pre-training should be roughly equal to the data going into RL, which should be roughly equal to the tokens served at inference. With 100 billion active parameters and roughly 150 trillion training tokens, that is about 75x past Chinchilla optimal. Reiner rounds it to 100x. This is the most concrete first-principles argument for why frontier models are so deeply over-trained, and it implies that as inference traffic grows, models should keep getting smaller and longer-trained.

Each Model Should Output All of Human Knowledge

The most jaw-dropping consequence: if you equalize pre-training and inference compute, then the total tokens generated by a model across its deployment lifetime should approximate the size of its training corpus. GPT-5, served to hundreds of millions of users for two months, will collectively output something on the order of 150 trillion tokens. That is roughly the sum of human knowledge in textual form. Each frontier model is, in this sense, a one-shot universal author of a corpus the size of its source material.

API Prices Leak Architecture

This is where the lecture gets really fun. Gemini 3.1 charges 50% more for context above 200k tokens. Setting memory time equal to compute time at exactly 200k context and solving for KV cache size gives roughly 1.6 to 2 kilobytes per token, which is plausible for a model with 8 KV heads, dense attention, and head dimension of 128.

The 5x premium for output (decode) tokens versus input (prefill) tokens is direct evidence that decode is severely memory bandwidth bound and prefill is compute bound. Prefill processes many tokens per weight load, so it amortizes memory cost over the whole sequence. Decode processes one token per weight load, so it pays full memory cost every time.

Cache hits priced at one tenth of cache misses tell you that storing the KV cache in HBM (or DDR or flash) is much cheaper than recomputing it from scratch. The two cache duration tiers (5 minutes and 1 hour) probably correspond to memory tiers whose drain times match those durations: flash for the 5-minute tier, spinning disk for the 1-hour tier. Yes, spinning disk is in the modern LLM serving stack, despite being decades-old technology.

Why Context Length Has Plateaued at 200k

Context lengths shot up from 8k to roughly 200k during the GPT-3 to GPT-4 era and have stayed roughly flat for the past two years. Reiner argues this is the natural balance point where memory bandwidth cost crosses compute cost. Going to a million tokens is expensive. Going to 100 million tokens (which Dario has hinted is needed for true continual learning via in-context learning) is essentially impossible without either a memory technology breakthrough or a much more aggressive sparse attention scheme. Sparse attention helps with a square-root improvement, but it is not unlimited. Going too sparse trades off too much quality.

Cryptography Meets Neural Nets

The episode ends with a lovely intellectual detour. Cryptographic protocols and transformer architectures both rely on jumbling information across all inputs. They are doing inverse versions of the same operation: ciphers take structured input and produce randomness, while neural nets take noisy input and extract structure. Both fields use differentiation as their primary attack vector (differential cryptanalysis on ciphers, gradient descent on neural nets). Adversarial attacks on image classifiers exploit exactly the avalanche property that good ciphers are designed for.

The most concrete crossover: Feistel ciphers, which let you build invertible functions out of non-invertible ones, were ported into deep learning as RevNets (reversible networks) in 2017. RevNets let you run the entire network backwards during the backward pass, eliminating the need to store activations and dramatically reducing training memory footprint. It is the opposite trade-off of KV caching: spending compute to save memory rather than spending memory to save compute.

Thoughts

The most striking thing about this episode is how much can be deduced from a few equations and the public API price sheets of the major labs. The labs treat their architectures as trade secrets, but the moment they price tokens to be close to cost (which competition forces them to do), the prices themselves leak the underlying ratios. Anyone with a pen and paper can reverse engineer the KV cache size, the memory tier hierarchy, and the compute-vs-memory bottleneck profile of a frontier model. There is a lesson here for builders: in competitive markets, the prices tell you almost everything.

The 100x over-training result has interesting implications for what comes next. If the optimal balance shifts further toward inference (as adoption keeps growing), models should get smaller and longer-trained. That is good news for serving costs and bad news for training-compute-as-moat. The biggest determinant of model quality might increasingly be data quality and RL environment design, not raw pre-training compute. This squares with what is visible publicly: the leading labs are investing heavily in RL infrastructure, evaluations, and synthetic data pipelines.

The memory wall is the most underrated infrastructure story in AI. Most people think of compute as the bottleneck, but Reiner makes it clear that memory bandwidth is what actually limits context length, which limits how agentic a model can be in practice. If you cannot get to 100 million token contexts, you probably cannot have an AI agent that has been working with you for a month and remembers everything. Either some sparse attention scheme has to give us cheap effective context length, or we need a memory hardware breakthrough, or we have to invent some form of continual learning that does not rely on context windows. None of those paths are obviously easy, and the fact that context length has been flat for two years despite enormous investment suggests we are stuck against a real wall.

The cryptography parallel is the kind of cross-disciplinary insight that does not show up enough in AI discourse. Treating neural networks as a kind of differentiable cipher reframes a lot of the architecture choices (residual connections, layer normalization, attention) as deliberate efforts to make the function smooth and invertible enough to learn, in contrast to ciphers, which are deliberately designed to resist exactly that. Adversarial robustness research probably has a lot more to learn from cryptanalysis than it currently does.

Finally, the format itself is a win. Most AI podcasts are conversational, which is great for personality but bad for technical depth. A blackboard lecture with an interlocutor who asks naive questions at the right moments is a much higher bandwidth medium. More of this, please.
April 29, 2026

Tag: sparse attention

SubQ 1.1 Small Explained: How Subquadratic Sparse Attention Hits 98% Retrieval at 12 Million Tokens With 64.5x Less Compute Than Dense Attention

TLDR

Thoughts

Key Takeaways

Detailed Summary

The problem: whole-artifact reasoning and context scarcity

What SSA is and the three requirements it targets

Where SSA sits among prior approaches

Training: conversion, staged context extension, and continued pretraining

Results: retrieval, knowledge, coding, and agentic tasks

Efficiency and the DeepSeek comparison

Efficiency as a research accelerator and the evaluation lessons

Implications, availability, and what comes next

Notable Quotes

Related Reading

How GPT-5, Claude, and Gemini Are Actually Trained and Served: The Real Math Behind Frontier AI Infrastructure

TLDW

Key Takeaways

Detailed Summary

The Roofline: Compute Time vs Memory Time

Why Fast Mode Costs More: The Batch Trade-Off

The 20 Millisecond Train

Why Optimal Batch Size Is About 300 Times Sparsity

Mixture of Experts Wants to Live Inside a Rack

Pipeline Parallelism: Why Ilya Said It Is Not Wise

The Memory Wall Paradox

Reinforcement Learning and the 100x Over-Training of Frontier Models

Each Model Should Output All of Human Knowledge

API Prices Leak Architecture

Why Context Length Has Plateaued at 200k

Cryptography Meets Neural Nets

Thoughts