PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: GPT-5.5

SubQ 1.1 Small Explained: How Subquadratic Sparse Attention Hits 98% Retrieval at 12 Million Tokens With 64.5x Less Compute Than Dense Attention
Subquadratic, a frontier AI research and infrastructure company, has released the model card and technical report for SubQ 1.1 Small, a long-context language model built on a new attention mechanism the company calls Subquadratic Sparse Attention (SSA). The headline claim is unusual in two directions at once: the model retains 98% single-fact retrieval accuracy at 12 million tokens, roughly twelve times the length it was primarily trained on, while cutting attention compute by 64.5x against dense attention at a 1 million token context. The deeper argument in the report is not really about a single model at all. It is about what happens to the entire retrieval-and-orchestration stack once reasoning over a complete artifact stops being prohibitively expensive.

TLDR

SubQ 1.1 Small is a small long-context model that replaces the dense attention of an existing open-weight frontier model with Subquadratic Sparse Attention, a learned, content-dependent sparse attention mechanism that scales linearly in compute and memory rather than quadratically. On retrieval it posts 99.12% on NVIDIA’s 13-task RULER suite at 128K tokens and 100% needle-in-a-haystack accuracy at 1M and 2M tokens, holding at 98% out to 6M and 12M tokens while attending to only 0.13% of token pairs. It keeps competitive general ability, scoring 85.4% on GPQA Diamond and 89.7% pass@4 on LiveCodeBench v6, and reaches 13% on the long-horizon AutomationBench Finance agentic benchmark, close to Opus 4.8 and GPT-5.5 and well ahead of mid and small tiers. The efficiency story is a scaling win rather than a constant-factor one: 64.5x fewer attention FLOPs than dense attention at 1M tokens and 56x faster than FlashAttention-2 on a single attention layer. The report frames cheap long-context compute as a research accelerator that let the team run more than one hundred million-token experiments and find a training recipe (long-context continued pretraining is the strongest lever) rather than guess at one, positions SSA against FlashAttention, DeepSeek’s Lightning Indexer line, state space models like Mamba, and hybrids, invokes Sutton’s Bitter Lesson to argue that RAG, chunking, and agentic scaffolding are partly workarounds for context scarcity, and was independently verified by Appen. Deployment is starting with design partners now, with a 2M to 12M token lineup planned by year end.

Thoughts

The most interesting move in this report is the framing, not the benchmark. Subquadratic plants its flag on Richard Sutton’s Bitter Lesson and argues that much of the modern AI stack, the retrieval pipelines, the chunkers, the re-rankers, the agentic orchestration, is scaffolding built around a single computational constraint: dense attention costs grow with the square of context length. If that constraint relaxes, a lot of hand-engineered machinery that exists to feed a model the right fragments at the right moment starts to look like the task-specific pipelines that learned representations eventually displaced. That is a genuinely provocative thesis, and it is the right lens for reading the rest of the document. The company is not selling a longer context window as a feature. It is betting that whole-artifact reasoning is a different shape of capability than retrieval over fragments, and that fragmentation destroys the cross-references a contract or a codebase actually depends on before the model ever sees them.

The part of the paper most teams will undervalue is the claim that the real payoff of efficient attention is not cheaper inference but cheaper experimentation. A dense long-context training campaign is expensive enough that most groups get a handful of attempts and are forced to guess at the recipe. Subquadratic says SSA let them run more than a hundred experiments across six model generations with per-step iteration under a minute at million-token context, which is how they discovered that long-context continued pretraining, not clever post-training, was the dominant lever. If that holds, algorithmic efficiency becomes a first-class scaling variable alongside parameters and data, because capability becomes responsive to iteration velocity rather than raw compute alone. It reframes efficiency from a deployment line item into a research multiplier, and that is a more durable advantage than any single benchmark number.

The generalization result deserves scrutiny precisely because it is so clean. A model trained overwhelmingly at 1M tokens, with a sliver at 2M and nothing beyond, holds 98% retrieval at 12M. The proposed explanation is that SSA routes attention by content relevance rather than fixed positional pattern, so there may simply be no obvious length boundary once the routing behavior is learned. That is plausible and the report is careful to say the 12M result emerged rather than being designed for. But single-needle NIAH is a deliberately clean probe with one target and a binary answer. The far harder RULER suite is only reported at 128K, the longest standardized length in the original benchmark, so the multi-hop, aggregation, and distractor-heavy capability that whole-artifact reasoning actually requires has public numbers at 128K, not at 12M. The honest read is that precise retrieval generalizes spectacularly and composite reasoning at extreme length is still an open question the report does not over-claim on.

What lends the report credibility is how much counter-evidence it volunteers. It walks through MiniMax abandoning its hybrid M1 architecture and returning to full attention for M2 after efficient variants showed multi-hop reasoning deficits at scale. It admits that earlier SubQ checkpoints improved retrieval while regressing on knowledge benchmarks, forcing dedicated capability-balancing work. It describes catching a case where the MRCR benchmark moved up while the model felt worse in real workflow spot-checks, and switching its development signal to RULER as a result. That last point is a quietly important methodological argument: benchmark score and deployment behavior diverged enough to change checkpoint selection, which is a warning every team shipping long-context models should internalize. A vendor confident enough to show where its own metrics misled it is more trustworthy than one that only shows the wins.

A few caveats keep the enthusiasm grounded. AutomationBench Finance at 13% is genuinely strong relative to peers, but it is a low absolute score across the board, including for GPT-5.5 at 18% and Opus 4.8 at 16%, so this is early evidence of agentic transfer rather than proof of a finished agent. The efficiency comparisons isolate a single attention layer rather than full end-to-end model throughput, which is the right way to expose the scaling shape but not the same as a wall-clock serving benchmark. The model is built from an unnamed donor open-weight frontier model, so some of its general-knowledge and coding strength is inherited rather than created here. And the most aggressive claims about the future, a 2M to 12M lineup and much higher sparsity, are roadmap, not released artifacts. None of that undercuts the core result. It just means the right posture is to treat SubQ 1.1 Small as a strong proof of concept for an architecture that, if it scales as advertised, could quietly remove a layer of the AI stack that everyone currently takes for granted.

Key Takeaways
- SubQ 1.1 Small is a long-context language model from Subquadratic AI, built on a new attention mechanism called Subquadratic Sparse Attention (SSA), released June 16, 2026 alongside a model card and technical report.
- SSA is a learned, content-dependent sparse attention mechanism that scales linearly in both compute and memory with sequence length, rather than quadratically like dense attention.
- The central result is context-length generalization: the model was trained primarily at 1M tokens, with some training at 2M and none beyond, yet retrieval held far past the training window.
- Needle-in-a-haystack accuracy is 100% at 1M and 2M tokens and 98% at both 6M and 12M tokens, roughly twelve times the primary training length.
- At 12M tokens the model attends to only 0.13% of token pairs, close to a 1,000x reduction in attention relationships, while still retrieving accurately.
- On NVIDIA’s 13-task RULER benchmark at 128K tokens, SubQ 1.1 Small scores 99.12%, with the remaining errors concentrated in aggregation-style tasks rather than retrieval.
- RULER tests beyond single-fact lookup: single-key and multi-key retrieval, common-word and frequent-word extraction, and multi-hop variable tracing across positions.
- At 1M tokens, SSA requires 64.5x fewer attention FLOPs than dense attention (3.9 PFLOP versus 252 PFLOP per attention layer).
- On a single attention layer, SSA runs 56x faster than FlashAttention-2 at 1M tokens (966 ms versus 54,164 ms on an H100), reaching parity near 16K tokens and pulling away as context grows.
- The efficiency gain is a scaling-law win, not a constant-factor speedup: the advantage over dense attention grows as context length increases.
- On general knowledge, SubQ 1.1 Small scores 85.4% on GPQA Diamond (pass@1), below GPT-5.5 (93.2) and Opus 4.8 (92), near Sonnet 4.6 and GPT-5.4-mini (87.5), and above GPT-5.4-nano (81.7) and Haiku 4.5 (67.2).
- On coding, it reaches 89.7% pass@4 on LiveCodeBench v6, close to the absolute frontier (GPT-5.5 92, Opus 4.8 92.2) and ahead of the smaller tiers.
- On AutomationBench Finance, a long-horizon agentic benchmark, it scores 13%, close to Opus 4.8 (16%) and GPT-5.5 (18%) and ahead of Sonnet 4.6 (8%), Haiku 4.5 (3%), and GPT-5.4-mini (0%). Absolute scores are low across all models.
- The model was not trained from scratch. The team converted an existing open-weight frontier model by replacing dense attention with SSA, then built long-context ability through staged context extension and continued pretraining.
- Context was extended in stages (262K, 512K, 1M, 2M) using YaRN positional scaling, with long-context continued pretraining performed between extension stages on naturally long data: books, long documents, and repository-scale code.
- Roughly one trillion tokens of continued pretraining were performed, most of it at the 1M-token stage.
- Long-context continued pretraining was the most consistent predictor of long-context retrieval gains across the experiments, more so than post-training tweaks.
- The team ran more than one hundred long-context experiments across six major model generations, which the report argues is only possible because SSA made million-token iteration cheap (under a minute per step).
- Capability balance was a recurring challenge: gains in long-context retrieval often regressed short-context knowledge and reasoning unless training was explicitly managed for both.
- Benchmark scores and real deployment behavior diverged. The MRCR benchmark moved up while qualitative workflow spot-checks got worse, so the team switched its primary development signal to RULER.
- The report frames RAG, chunking, summarization, and agentic orchestration as scaffolding built around context scarcity, drawing an analogy to Sutton’s Bitter Lesson, where hand-engineered mechanisms get displaced by larger-scale learning.
- SSA is positioned against FlashAttention (a memory optimization that does not change quadratic compute), fixed-pattern sparse attention, DeepSeek’s learned sparse line, state space models, and hybrid architectures.
- DeepSeek’s Lightning Indexer (used in DSA and CSA) is the closest published comparison. Its quadratic scoring overtakes the sparse attention it feeds around 52,000 tokens, reaching roughly 16x the attention cost at 1M and 190x at 12M.
- State space models like Mamba achieve linear cost through a compressed fixed-size state, but that compression is lossy and weakens exact retrieval, which is why production efficient models are usually hybrids with some dense attention layers retained.
- MiniMax is cited as a cautionary case: it moved from a hybrid M1 to a full-attention M2 after hybrids showed multi-hop reasoning deficits at scale and less mature supporting infrastructure.
- The benchmark results were independently verified by Appen, a third-party evaluation firm.
- The named use cases are financial analysis and due diligence, legal and contract work, and software engineering (architecture-level reasoning, cross-file refactoring, dependency tracing, planning, review, and long-horizon memory).
- Sparsity settings were deliberately conservative, tuned for maximum context length rather than maximum sparsity. Limited experiments at 4x the sparsity reported positive early results.
- The training infrastructure used a memory-scaling ladder: single node, intra-node sequence parallelism, CPU offload, multi-node sequence parallelism, nested offloading, and Ring Attention for the longest contexts.
- Beyond about 8M tokens, BF16 numerical underflow and stability became practical constraints on evaluation.
- The technical report is authored by Saul Ramirez, Alex Whedon, Ashmal Vayani, and Phong Vo of Subquadratic AI.
- Deployment is starting with a first cohort of design partners, with broader rollout through the quarter and a general model lineup ranging from 2M to 12M tokens by the end of the year.
- The company’s framing line is “Efficiency is intelligence,” and its broader thesis is that the point is not bigger context windows for their own sake but reasoning directly over complete artifacts with less surrounding scaffolding.
Detailed Summary

The problem: whole-artifact reasoning and context scarcity

The report opens by naming a class of tasks it calls whole-artifact reasoning: problems whose structure requires reasoning across a complete artifact rather than over isolated fragments. A legal agreement may define a term on page 2, qualify it on page 12, carve out an exception on page 46, and amend it in a schedule. A function may be defined in one file, called from forty others, and constrained by invariants encoded in the architecture rather than in comments. A financial review may require connecting filings, earnings reports, contracts, and internal records. In each case the difficulty is not locating a passage, it is reasoning over relationships distributed throughout a large artifact. Most production systems do not do this directly. They rely on retrieval pipelines, chunking, summaries, and agentic workflows that partition information and reconstruct fragments at inference time, because dense attention scales quadratically with context length and makes direct reasoning over large artifacts expensive. Subquadratic argues that much of the modern AI stack is therefore designed to manage context scarcity rather than reason over complete artifacts, and it connects this to Sutton’s Bitter Lesson: sophisticated hand-engineered mechanisms historically get displaced once larger-scale learning becomes practical.

What SSA is and the three requirements it targets

Subquadratic Sparse Attention is a content-dependent sparse attention mechanism designed to satisfy three requirements at once, a combination the report argues prior approaches never achieved in a practical long-context system. First, dense-attention-level retrieval and reasoning quality, which requires routing that is content-dependent (determined by the tokens themselves) rather than driven by a fixed positional pattern. Second, subquadratic scaling, where selection, retrieval, and attention are each linear in sequence length so the mechanism is linear end to end, not only within the attention read. Third, full-context training with standard autoregressive generation, so the model can optimize over the entire context during training while keeping efficient token-by-token decoding at inference. The internal mechanism by which SSA achieves this is held back as outside the scope of the report, which focuses instead on the requirements and the experimental program that followed.

Where SSA sits among prior approaches

The background section is effectively a taxonomy of long-context modeling. FlashAttention is treated not as a competitor but as the standard dense-attention baseline: it solved the memory problem by never materializing the full attention matrix, but it left the quadratic compute cost untouched, so doubling context still quadruples attention computation. Fixed-pattern sparse attention (sliding-window, strided, as in Longformer, BigBird, and the sliding window in Gemma) scales well but sacrifices content-dependent routing and tends to fail on retrieval benchmarks like RULER. Compression methods like Multi-head Latent Attention reduce KV-cache memory at inference but do not change the quadratic prefill cost. Learned sparse attention, exemplified by DeepSeek’s Native Sparse Attention and its Lightning Indexer, learns where to route but pays a quadratic cost in the indexer itself. State space models and linear attention (Mamba, Mamba-2 and Mamba-3, RetNet, RWKV, gated delta networks) achieve linear cost through a compressed fixed-size state, but that compression is lossy and weak on exact retrieval. Hybrids (Jamba, Kimi Linear, Qwen3 Next, Nemotron) keep a few dense layers to preserve retrieval, which means the quadratic component still dominates at long context. System-level workarounds (RAG, agentic frameworks, recursive language models) move retrieval outside the model entirely. The report’s stated open problem is to combine subquadratic scaling end to end with content-dependent retrieval, arbitrary-position access, and practical ultra-long-context training in one system, which it claims no widely deployed architecture provides and which SSA targets.

Training: conversion, staged context extension, and continued pretraining

Rather than training from scratch, the team converted an existing open-weight frontier model that supported a 262K-token context by replacing its dense attention with SSA. They then extended the context window in stages (262K to 512K to 1M to 2M) using YaRN to rescale positional representations, performing long-context continued pretraining between extension stages rather than jumping straight to the final length. The training mixture emphasized naturally long data such as books, long documents, and repository-scale code, packed to the target length with document separators and without masking cross-document attention boundaries. Most continued-pretraining tokens were trained at the 1M-token stage, with roughly one trillion tokens total. Post-training played a separate role: shaping how the long-context capability was expressed while preserving reasoning, coding, and instruction following. The team explored sample-level loss aggregation to keep a few extremely long examples from dominating gradient updates, and staged the post-training corpus across synthetic retrieval tasks, long-context reasoning, coding, educational material, and general instruction following, alternating capability-building phases with recovery phases.

Results: retrieval, knowledge, coding, and agentic tasks

On retrieval, SubQ 1.1 Small scores 99.12% on the 13-task RULER average at 128K, with errors concentrated in aggregation-style tasks like common-word and frequent-word extraction. On needle-in-a-haystack, evaluated on 50 held-out UUID samples per length, it scores 100% at 1M and 2M (within the training window) and 98% at 6M and 12M (held out), attending to only 0.13% of token pairs at 12M. On knowledge, GPQA Diamond pass@1 is 85.4%, landing between the small and mid frontier tiers and confirming that long-context optimization need not sacrifice reasoning, a result the report credits to its capability-balancing stages after earlier checkpoints showed retrieval gains coming at the cost of knowledge. On coding, LiveCodeBench v6 pass@4 is 89.7%, and the report notes coding data played a dual role, also improving non-code long-context retrieval because code is dense with the cross-position dependencies that train general routing. On long-horizon agentic work, AutomationBench Finance is 13%, where agents must discover the right endpoints among roughly 500 across 47 applications, make interdependent API calls, follow layered business rules, and ignore seeded distractors, graded on binary end-state correctness with no partial credit.

Efficiency and the DeepSeek comparison

Efficiency is measured on one attention layer against a dense baseline on the same backbone. Per-forward-pass attention FLOPs scale from a 2.1x reduction at 32K to 8x at 128K, 31.5x at 512K, and 64.5x at 1M tokens (3.9 PFLOP for SSA versus 252 PFLOP for dense). Measured against FlashAttention-2 in isolation, SSA reaches parity near 16K tokens and pulls away to 56x at 1M, where it runs in 966 ms versus 54,164 ms on an H100. The report devotes a discussion section to DeepSeek’s sparse attention line as the closest published comparison. DeepSeek’s Lightning Indexer is a learned selector, but it is a full-attention distilled transformer, so it scales quadratically: in a V3.2-style configuration the indexer is cheaper than the sparse attention it feeds only below about 52,000 tokens, then overtakes it, reaching roughly 16x the attention cost at 1M tokens and 190x at 12M. SSA targets that same selection role with a selector the report says is dramatically cheaper and linear throughout, and notes SSA could conceptually replace the selector over either uncompressed or compressed representations.

Efficiency as a research accelerator and the evaluation lessons

A recurring theme is that the most valuable effect of cheap long-context compute was on the research loop, not just inference. Where a dense campaign would allow a handful of attempts, SSA enabled more than a hundred experiments across six model generations with per-step iteration under a minute at million-token context. That throughput is what surfaced the finding that long-context continued pretraining is the strongest lever, and it leads the authors to argue that algorithmic efficiency should be treated as a first-class scaling variable alongside model and dataset size. The report is unusually candid about evaluation pitfalls. It describes how the MRCR benchmark diverged from deployment behavior, with MRCR-optimized checkpoints often feeling worse on repository-scale code reasoning, multi-document synthesis, and contract analysis, which pushed the team to rely on RULER and a fixed set of qualitative workflow spot-checks as development signals. It also cites MiniMax returning from a hybrid M1 to a full-attention M2 as evidence that reducing asymptotic cost is not sufficient on its own if retrieval quality, reasoning at scale, and system maturity are not preserved at the same time.

Implications, availability, and what comes next

The report’s deployment argument is that the most important enterprise implication of long-context models is not larger windows but the ability to reason directly over complete or more-complete artifacts, moving retrieval, re-ranking, and orchestration logic into the model where the task is naturally whole-artifact rather than naturally decomposable. It is careful not to declare retrieval obsolete: for corpora larger than any plausible context window, fast-changing knowledge, and genuinely multi-stage workflows, RAG and orchestration remain the right tools. The narrower claim is that the class of scaffolding that exists only to compensate for context limits gets smaller as efficient long-context models extend the reachable window. The benchmark results were independently verified by Appen. Subquadratic is deploying SubQ 1.1 Small with a first cohort of design partners now, with broader rollout through the quarter and a general lineup spanning 2M to 12M tokens planned by the end of the year, and it flags much higher sparsity as future work.

Notable Quotes

“Much of the modern AI stack is therefore designed to manage context scarcity rather than reason over complete artifacts directly.”
SubQ-1.1-Small Technical Report, framing retrieval and orchestration as workarounds for an architectural limit

“The hybrid has moved the line, but not changed its shape.”
SubQ-1.1-Small Technical Report, on why hybrid models keep their quadratic component at long context

“A routing mechanism intended to make long context affordable becomes the dominant long-context cost, reintroducing quadratic scaling after providing scalar compute savings.”
SubQ-1.1-Small Technical Report, on DeepSeek’s Lightning Indexer overtaking the attention it feeds

“If the cost of long-context experiments is too high, teams are forced to guess at the recipe. If the cost falls far enough, they can search for it.”
SubQ-1.1-Small Technical Report, on efficient attention as a research accelerator

“Fragmentation systematically destroys those relationships before the model ever sees them.”
SubQ-1.1-Small Technical Report, on why chunking hurts whole-artifact reasoning

“Holding the whole artifact in context changes the shape of the task rather than only the speed of it.”
SubQ-1.1-Small Technical Report, on the difference between bigger windows and direct reasoning

“The value of SSA is therefore not only that it makes long-context inference cheaper. It makes long-context experimentation cheaper.”
SubQ-1.1-Small Technical Report, conclusion

Read the full SubQ 1.1 Small technical report and model card here.

Related Reading
- Subquadratic (subq.ai) the company behind SubQ 1.1 Small and the Subquadratic Sparse Attention architecture, where you can join the waitlist.
- The Bitter Lesson by Richard Sutton the short essay whose argument the report leans on, that hand-engineered mechanisms lose to general methods that scale with computation.
- Attention Is All You Need the original Transformer paper that introduced the dense attention whose quadratic cost SSA is built to remove.
- RULER (arXiv) NVIDIA’s long-context benchmark that the report uses as its primary retrieval signal, and that fixed-pattern sparse methods historically struggle with.
- Retrieval-augmented generation (Wikipedia) background on the RAG approach that the report frames as scaffolding around context scarcity rather than a permanent fixture.
June 18, 2026
US Government Orders Anthropic to Suspend Claude Fable 5 and Mythos 5: Inside the Export Control Directive, the Jailbreak Dispute, and What It Means for Frontier AI
On June 12, 2026, Anthropic published a statement announcing that the US government, citing national security authorities, has issued an export control directive forcing the company to suspend all access to its newest frontier models, Claude Fable 5 and Claude Mythos 5. The order technically targets foreign nationals inside and outside the United States, including Anthropic’s own foreign national employees, but the practical effect is that both models are going dark for every customer worldwide. It is the first publicly known instance of the US government ordering a deployed frontier AI model offline, and Anthropic is complying while openly disputing the basis for the decision.

TLDR

The US government delivered an export control directive to Anthropic at 5:21pm ET on June 12, 2026, suspending all access to Fable 5 and Mythos 5 over an alleged jailbreak of Fable 5’s safeguards. Anthropic says the letter contained no specific details, that the only evidence shared was verbal, and that the technique in question amounts to asking the model to read a codebase and fix software flaws, a capability the company says is freely available from other models including OpenAI’s GPT-5.5 and used daily by cyber defenders. Anthropic defends its defense in depth strategy, notes that thousands of hours of red teaming by the US government, the UK AISI, and third parties found no universal jailbreak, and warns that recalling a commercial model over a narrow, non-universal jailbreak would effectively halt all new frontier model deployments if applied industry-wide. Access to all other Anthropic models, including Claude Opus, Sonnet, and Haiku, is unaffected, and the company says it believes the situation is a misunderstanding and is working to restore access, with more details promised within 24 hours.

Thoughts

This is a watershed moment regardless of how it resolves. Governments have blocked AI exports before, but ordering a deployed commercial model recalled out from under hundreds of millions of users is a new kind of intervention, closer to a product recall than a trade restriction. The mechanism matters too. Export control authority aimed at foreign nationals, including a company’s own employees, that cascades into a global shutdown is a blunt instrument doing the work of a regulatory regime that does not exist yet. The US has no statutory process for recalling an AI model, so the government reached for the closest tool on the shelf, and the result is a precedent built on improvisation.

There is real irony in who got hit first. Anthropic has spent years arguing, publicly and in Washington, that governments should have the power to block unsafe AI deployments. Now the company that asked for a referee is the first one whistled, and its complaint is not about the existence of the power but about the process: a letter at 5:21pm with no specifics, verbal evidence only, and no transparent or technically grounded procedure. That distinction is the whole ballgame for AI governance. A power to halt deployments without due process standards is not regulation, it is discretion, and discretion cuts in every direction depending on who holds it.

The technical dispute underneath is genuinely interesting because it exposes how unsettled the definition of a dangerous jailbreak is. Anthropic’s account of the offending technique, asking the model to read a specific codebase and fix any software flaws, describes something security teams do on purpose every single day. Vulnerability discovery is the canonical dual use capability: the same analysis that lets a defender patch a hole lets an attacker find one. If the bar for recall is that a model can be coaxed into doing competent security analysis, then every capable model on the market fails that bar, which is exactly Anthropic’s point about GPT-5.5. The hard question the directive dodges is not whether Fable 5 can find bugs but whether it provides meaningful uplift beyond what is already freely available, and Anthropic says it does not.

For builders, the immediate lesson is uncomfortable: model availability is now a political variable, not just an engineering one. Teams that built directly on Fable 5 lost a production dependency overnight through no fault of Anthropic’s infrastructure, their own code, or any terms of service violation. Multi-model fallback strategies, abstraction layers over providers, and graceful degradation paths just moved from nice-to-have to table stakes for anyone running serious workloads on frontier models. The companies that absorbed this outage gracefully are the ones that assumed any single model could vanish.

The next 24 hours matter more than the directive itself. Anthropic has promised more details, and the government will face pressure to either substantiate a concern that justifies a global recall or quietly walk it back. Either outcome sets the real precedent. If the directive holds on thin evidence, every frontier lab now operates under the threat of arbitrary shutdown. If it collapses under scrutiny, the case for a formal, transparent statutory process for AI deployment decisions, which Anthropic explicitly endorses in its own statement, gets a lot stronger in Congress than it was a week ago.

Key Takeaways
- The US government issued an export control directive on June 12, 2026 suspending all access to Claude Fable 5 and Claude Mythos 5, citing national security authorities.
- The directive formally targets access by any foreign national, inside or outside the United States, including Anthropic’s own foreign national employees.
- The net effect is that Anthropic must disable Fable 5 and Mythos 5 for all customers worldwide to ensure compliance, not just for foreign users.
- Access to all other Anthropic models, including the Claude Opus, Sonnet, and Haiku families, is not affected by the order.
- Anthropic received the directive at 5:21pm ET the same day it published its statement, and says the letter did not provide specific details of the national security concern.
- Anthropic’s understanding is that the government believes it has become aware of a method of bypassing, or jailbreaking, Fable 5’s safeguards.
- Anthropic reviewed a demonstration of the specific technique and says it only identified a small number of previously known, minor vulnerabilities.
- The company says other publicly available models can discover the same vulnerabilities without requiring any bypass at all.
- Before launch, Fable 5’s safeguards were red-teamed for thousands of hours in total by the US government, the UK AISI, multiple private third-party organizations, and internal teams.
- No tester has found a universal jailbreak for Fable 5, meaning a method that broadly bypasses safeguards and unlocks a wide range of cyber capabilities.
- Anthropic openly states that perfect jailbreak resistance does not appear possible for any model provider today, and that every safeguard in the industry is vulnerable to non-universal jailbreaks.
- Fable 5 was deployed under a defense in depth strategy: make jailbreaks either narrow or very expensive to produce, then combine that with monitoring to quickly detect and shut down successful attacks.
- Anthropic’s 30-day customer data retention requirement for Fable exists specifically to support jailbreak research and mitigation, a policy the company says carries real costs with customers.
- Anthropic says it has not received any disclosure of a concerning non-universal jailbreak that led to a harmful result; disclosed potential jailbreaks were benign or provided no Mythos-specific uplift.
- The only evidence the government has provided is verbal, describing a narrow, non-universal jailbreak that essentially consists of asking the model to read a specific codebase and fix any software flaws.
- Anthropic reviewed a report it believes is the basis of the directive and validated that the capability level shown is widely available from other models, including OpenAI’s GPT-5.5, and is used every day by cyber defenders.
- Anthropic is complying with the legal directive while explicitly disagreeing that a narrow potential jailbreak justifies recalling a commercial model deployed to hundreds of millions of people.
- The company warns that if this recall standard were applied across the industry, it would essentially halt all new model deployments for every frontier model provider.
- Anthropic supports government power to block unsafe deployments in principle, but only through a statutory process that is transparent, fair, clear, and grounded in technical facts, and says this action meets none of those principles.
- Anthropic apologized to customers, called the situation a misunderstanding, said it is working to restore access as soon as possible, and promised more details within 24 hours.
Detailed Summary

What the directive actually does

The order arrived as a letter from the US government at 5:21pm ET on June 12, 2026, invoking national security authorities under export control law. On paper it suspends access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, a category that includes some of Anthropic’s own employees. In practice, Anthropic says compliance requires abruptly disabling both models for every customer, since there is no clean way to enforce a nationality-based access boundary across a global product. The letter did not spell out the specific national security concern. Everything else in Anthropic’s statement is the company’s own reconstruction of what prompted the action.

The jailbreak at the center of the dispute

Anthropic’s understanding is that the government became aware of a method for bypassing Fable 5’s safeguards. The company reviewed a demonstration of the technique and characterizes the results as a small number of previously known, minor vulnerabilities, all relatively simple, all discoverable by other publicly available models without any jailbreak at all. According to Anthropic, the government’s evidence so far has been entirely verbal, and the technique boils down to asking the model to read a specific codebase and fix any software flaws. The company reviewed a report it believes underlies the directive and validated that the displayed capability is widely available elsewhere, naming OpenAI’s GPT-5.5 directly, and noted that this exact kind of analysis is what defenders use to keep systems safe.

Anthropic’s defense in depth posture

The statement restates the safety posture Anthropic laid out at Fable 5’s launch. The safeguards around cybersecurity tasks are strong enough that users have complained they are overly broad. In the weeks before launch, the US government, the UK AISI, multiple private third-party organizations, and internal teams red-teamed the safeguards for thousands of hours combined, and those tests showed Fable’s protections to be substantially more effective than any previously deployed model. No tester found a universal jailbreak. Anthropic is candid that perfect jailbreak resistance is likely impossible for anyone today, which is why the strategy is defense in depth: keep jailbreaks narrow or expensive, monitor aggressively, and shut down attacks fast. The 30-day customer data retention requirement on Fable exists to support that monitoring and mitigation loop. The company says this posture makes Fable’s risks comparable to models already deployed across the industry.

Complying while disputing the standard

Anthropic is removing access for all users as legally required, but the statement draws a hard line on the principle. The company disagrees that a narrow potential jailbreak, one that produced no disclosed harmful result, justifies recalling a commercial model serving hundreds of millions of people. Its broader warning is that this standard, applied evenly, would halt all new frontier model deployments industry-wide, since every provider’s safeguards are vulnerable to narrow jailbreaks. Anthropic also turns its own policy position into a critique: the company has publicly supported giving government the ability to block unsafe deployments, but through a statutory process that is transparent, fair, clear, and grounded in technical facts, and it says this action does not adhere to those principles.

What happens next

Anthropic closed by apologizing to customers, calling the situation a misunderstanding, and committing to restore access as soon as possible. The company promised to share more details over the next 24 hours, which makes this a developing story. The open questions are whether the government substantiates its concern with written technical evidence, whether the directive survives that scrutiny, and whether this episode accelerates the formal statutory process for AI deployment decisions that Anthropic says should have governed the action in the first place.

Notable Quotes

“The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance.”
Anthropic, on why a directive aimed at foreign nationals becomes a global shutdown

“We received the directive from the government today at 5:21pm (ET). The letter did not provide specific details of its national security concern.”
Anthropic, on the abruptness and opacity of the order

“These vulnerabilities all appear relatively simple, and we have found that other publicly-available models are able to discover them as well without requiring a bypass.”
Anthropic, on its review of the demonstrated jailbreak technique

“We suspect that perfect jailbreak resistance is not currently possible for any model provider.”
Anthropic, restating the position it disclosed at Fable 5’s launch

“We stand by this defense in depth strategy. It reduces the risks posed by Fable, making them comparable to the risks of existing models already deployed across the industry.”
Anthropic, defending its layered safeguards approach

“To date, the government has only given us verbal evidence of a potential narrow, non-universal jailbreak, which essentially consists of asking the model to read a specific codebase and fix any software flaws.”
Anthropic, describing the technique behind the directive

“However, we disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people.”
Anthropic, on complying while contesting the decision

“If this standard was applied across the industry, we believe it would essentially halt all new model deployments for all frontier model providers.”
Anthropic, on the industry-wide implications of the recall standard

“As we have stated publicly, we believe the government should have the ability to block unsafe deployments, as part of a statutory process that is transparent, fair, clear, and grounded in technical facts. This action does not adhere to those principles.”
Anthropic, on the kind of oversight process it says should have governed the action

“We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible.”
Anthropic, closing its statement to customers

Read the full statement on Anthropic’s site here.

Related Reading
- Anthropic’s Claude Fable 5 and Mythos 5 launch announcement the original deployment post that laid out the safeguards posture now at the center of the dispute.
- US Bureau of Industry and Security the agency that administers US export controls, the kind of authority a directive like this one invokes.
- Export control (Wikipedia) background on how export control law works and why it can reach foreign nationals inside the United States.
- Prompt injection and jailbreaking (Wikipedia) primer on the techniques used to bypass language model safeguards.
- UK AI Security Institute one of the third-party organizations that red-teamed Fable 5’s safeguards before launch.
June 13, 2026
Claude Opus 4.8 Released: Anthropic Bets on Honesty, Dynamic Workflows, Effort Control, and Cheaper Fast Mode
Anthropic has released Claude Opus 4.8, the newest member of its flagship Opus class, available today across every surface and priced exactly like the model it replaces. The company calls it “a modest but tangible improvement” on Opus 4.7, but the framing undersells what is actually interesting here: the headline upgrade is not a benchmark number, it is honesty. Opus 4.8 is built to know when it does not know, and that single behavioral shift may matter more for real agent work than any raw capability bump.

TLDR

Claude Opus 4.8 is an across-the-board upgrade to Anthropic’s Opus class that ships today at the same regular price as Opus 4.7 ($5 per million input tokens, $25 per million output tokens), with the model positioned as “a more effective collaborator.” The marquee improvement is honesty: Opus 4.8 is roughly four times less likely than its predecessor to let flaws in its own code pass unremarked, and it is more willing to flag uncertainty rather than confidently claim progress on thin evidence. A pre-release alignment assessment found new highs on prosocial traits like supporting user autonomy and acting in the user’s best interest, with misaligned behavior at rates similar to Anthropic’s best-aligned model, Claude Mythos Preview. Three things launch alongside the model: dynamic workflows in Claude Code (research preview), where Claude plans work then runs hundreds of parallel subagents that run even longer and verify their own outputs before reporting back; effort control in claude.ai and Cowork, a slider for how hard Claude thinks; and a Messages API update that accepts system entries inside the messages array so developers can update instructions mid-task without breaking the prompt cache. Fast mode now runs at 2.5x speed and is three times cheaper than before ($10 / $50 per million tokens). The roadmap points to cheaper Opus-equivalent models, a higher-intelligence class above Opus, and a wider rollout of Mythos-class models gated behind stronger cyber safeguards under Project Glasswing.

Thoughts

The most important sentence in this announcement is not about coding scores. It is the claim that Opus 4.8 is about four times less likely than Opus 4.7 to let flaws in its own code slip by without comment. For a chat assistant, overconfidence is annoying. For an agent, it is catastrophic. The whole premise of long-running autonomous work is that you hand the model a task and walk away, which means the model’s own judgment about whether it succeeded becomes the only judgment in the loop until you come back. A model that confidently declares victory on a half-finished migration does not save you time, it costs you a debugging session plus the time you spent trusting it. Honesty, framed this way, is not a soft virtue. It is the load-bearing reliability property that makes unattended agents usable at all.

Read the launch as a single coherent argument rather than a list of features, and the pieces lock together. Dynamic workflows let Claude plan a job and fan out hundreds of parallel subagents that, with Opus 4.8, run longer than before. Effort control lets you dial up how much the model thinks. The honesty improvement means the model checks its own work and flags what it is unsure about instead of papering over it. Put those three together and you get one product thesis: let it run longer, let it think harder, and trust it to tell you when something is wrong. The codebase-scale migration example, hundreds of thousands of lines from kickoff to merge with the existing test suite as the bar, is the proof point. None of those three capabilities is worth much alone. A model that runs for hours but lies about its results is a liability. A model that flags uncertainty but cannot sustain a long task never reaches the moment where its honesty matters. Anthropic shipped all three at once because they only pay off together.

The economics deserve a closer look than the “same price” headline invites. Regular pricing is flat versus Opus 4.7, which is the polite way of saying you get a better model for free. The real move is fast mode: 2.5x the speed at three times cheaper than it cost on previous models, landing at $10 per million input and $50 per million output. That is Anthropic quietly attacking the latency-versus-cost tradeoff that has shaped how teams deploy frontier models. Until now, “fast” meant “expensive,” so you reserved it for interactive moments and ate the wait everywhere else. Collapsing that premium changes the default. And note the subtle token story underneath: Opus 4.8 at its default high effort spends roughly the same tokens on coding as Opus 4.7’s default while performing better, so the effort slider is not a way to bleed you dry, it is an honest exposure of the quality-cost dial that was always there implicitly.

The Messages API change is the kind of unglamorous plumbing that practitioners will appreciate immediately. Letting system entries live inside the messages array means you can update an agent’s instructions, permissions, token budget, or environment context partway through a task without smuggling the update through a fake user turn and without blowing up your prompt cache. Anyone who has built a long-running agent has hit this wall: the world changes mid-task, the agent needs new constraints, and the only clean way to inject them previously was a cache-busting hack. This is Anthropic treating agents as first-class, stateful, long-lived processes rather than oversized chat sessions. It is a small spec change with outsized implications for how you architect an agent that runs for an hour.

Then there is the roadmap, where the most telling line is the quietest. Anthropic says a small number of organizations are already using Claude Mythos Preview for cybersecurity work under Project Glasswing, and that models of this capability level require stronger cyber safeguards before general release. Notice that they are pinning Opus 4.8’s alignment numbers to Mythos as the benchmark for “best-aligned,” while simultaneously holding Mythos back from general availability on safety grounds. That is a deliberate signal: the next class of model is good enough that they are gating it on cyber-offense risk, not on capability. For a site about the pursuit of joy, fulfillment, and purpose through AI, this is the part worth sitting with. The frontier is increasingly defined not by what the models can do, but by what their builders decide it is responsible to ship. Honesty in the small (flagging a bad line of code) and restraint in the large (holding back a cyber-capable model) are the same instinct expressed at two different scales.

Key Takeaways
- Claude Opus 4.8 is now available everywhere, replacing Opus 4.7 as Anthropic’s flagship Opus-class model and positioned as “a more effective collaborator.”
- Regular usage pricing is unchanged from Opus 4.7, holding at $5 per million input tokens and $25 per million output tokens, so the capability gains come at no added cost.
- The single most emphasized improvement is honesty, which Anthropic treats as a core trained behavior rather than a marketing flourish.
- Evaluations show Opus 4.8 is around four times less likely than its predecessor to let flaws in its own code pass unremarked, a direct reliability win for autonomous coding.
- Early testers report the model is more likely to flag uncertainty about its work and less likely to make unsupported claims or jump to conclusions on thin evidence.
- A detailed alignment assessment was run before release and concluded Opus 4.8 reaches new highs on prosocial traits like supporting user autonomy and acting in the user’s best interest.
- Misaligned behavior such as deception or cooperation with misuse is at rates substantially lower than Opus 4.7 and similar to Anthropic’s best-aligned model, Claude Mythos Preview.
- The full alignment assessment and pre-deployment safety tests are documented in the public Claude Opus 4.8 System Card.
- Dynamic workflows launch as a research preview inside Claude Code, letting Claude plan the work and then run hundreds of parallel subagents in a single session.
- With Opus 4.8, those subagents can run even longer, and Claude verifies its outputs before reporting back rather than declaring success blindly.
- Anthropic’s flagship example for dynamic workflows is a codebase-scale migration across hundreds of thousands of lines of code, from kickoff to merge, using the existing test suite as the success bar.
- Dynamic workflows are available in Claude Code for the Enterprise, Team, and Max plans.
- Effort control arrives in claude.ai and Cowork as a setting next to the model selector that lets users choose how much effort Claude puts into a response.
- Higher effort makes Claude think more frequently and deeply for better answers; lower effort responds faster and consumes rate limits more slowly. Effort control is available on all plans.
- Opus 4.8 defaults to “high” effort, judged the best overall balance of quality and user experience.
- On coding tasks, the default effort spends a similar number of tokens as Opus 4.7’s default but delivers better performance, so quality rises without a token penalty.
- Users can select “extra” (called “xhigh” in Claude Code) or “max” to spend more tokens for stronger results, and Anthropic recommends “extra” for difficult tasks and long-running asynchronous workflows.
- Rate limits in Claude Code were increased to accommodate the higher token usage of the higher effort levels.
- The Messages API now accepts system entries inside the messages array, a meaningful change for agent developers.
- That update lets developers change Claude’s instructions mid-task, adjusting permissions, token budgets, or environment context, without breaking the prompt cache or routing through a user turn.
- Fast mode now runs at 2.5x speed and is three times cheaper than it was for previous models, priced at $10 per million input tokens and $50 per million output tokens.
- Developers access the model as claude-opus-4-8 through the Claude API.
- Partner Miguel Gonzalez reports Opus 4.8 scored 84% on Online-Mind2Web, a meaningful jump over both Opus 4.7 and GPT-5.5, calling it the strongest computer-use and browser-agent model his team has tested.
- Databricks reports that, inside Genie, Opus 4.8 reasons over unstructured content like PDFs and diagrams at 61% cheaper token cost than Opus 4.7.
- Thomson Reuters reports Opus 4.8 is the first model to break 10% overall on the all-pass standard of its Legal Agent Benchmark, the highest score recorded there.
- Eleven partners weighed in, including Cursor, Cognition’s Devin, Databricks Genie, Thomson Reuters CoCounsel, and Hebbia, spanning coding, legal, finance, and enterprise data work.
- Anthropic is working on models that deliver many of the same capabilities as Opus at a lower cost.
- The company plans to release a new class of model with even higher intelligence than Opus.
- Under Project Glasswing, a small number of organizations are already using Claude Mythos Preview for cybersecurity work, with Mythos-class models expected to reach all customers in the coming weeks once stronger cyber safeguards are in place.
Detailed Summary

What Claude Opus 4.8 Is

Claude Opus 4.8 is an upgrade to Anthropic’s Opus class of models, building on Opus 4.7 with improvements across benchmarks covering coding, agentic skills, reasoning, and practical knowledge-work tasks. Anthropic describes the result as “a more effective collaborator” while characterizing the release overall as “a modest but tangible improvement on its predecessor.” The model is available today, everywhere, and developers call it as claude-opus-4-8 via the Claude API. The announcement includes a comparison table against the predecessor and other models, though the per-cell numbers in that table are published as an image and are not reproduced here as text.

Honesty: The Headline Improvement

Anthropic singles out honesty as one of the most prominent improvements in Opus 4.8. All of the company’s models are trained to be honest, which includes avoiding claims they cannot support. A persistent problem with AI models generally is that they sometimes jump to conclusions, confidently claiming progress despite thin evidence. Early testers report that Opus 4.8 is more likely to flag uncertainties about its own work and less likely to make unsupported claims. The most concrete measure: evaluations show Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. For agentic and unattended use, this self-skepticism is the difference between a model that reliably tells you when something went wrong and one that quietly ships a broken result.

Alignment Assessment

A detailed alignment assessment was run before release. On the positive side, the Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” On the risk side, misaligned behavior such as deception or cooperation with misuse occurs at rates substantially lower than Opus 4.7, and similar to Anthropic’s best-aligned model, Claude Mythos Preview. The full alignment assessment and the pre-deployment safety tests are published in the Claude Opus 4.8 System Card, which also contains the complete benchmark table and wider evaluations.

Dynamic Workflows in Claude Code

Launching today as a research preview in Claude Code, dynamic workflows let Claude plan the work and then run hundreds of parallel subagents in a single session. With Opus 4.8, those agents can run even longer than before, and Claude verifies its outputs before reporting back rather than reporting unchecked results. The showcase example is a codebase-scale migration: Claude Code with Opus 4.8 can carry out migrations across hundreds of thousands of lines of code, all the way from kickoff to merge, using the existing test suite as its bar for success. Dynamic workflows are available in Claude Code for the Enterprise, Team, and Max plans.

Effort Control

Effort control arrives in claude.ai and Cowork as a setting alongside the model selector that lets users choose how much effort Claude puts into a response. Higher effort means Claude thinks more frequently and deeply for better responses; lower effort means it responds faster and uses rate limits more slowly. Opus 4.8 defaults to “high” effort, which Anthropic judged the best overall balance of quality and user experience. On coding tasks, that default spends a similar number of tokens as Opus 4.7’s default while performing better. Users who want more can choose “extra” (called “xhigh” in Claude Code) or “max” to spend more tokens for stronger results, and Anthropic recommends “extra” for difficult tasks and long-running asynchronous workflows. To support the heavier token usage at higher effort levels, rate limits in Claude Code were increased. Effort control is available on all plans.

Messages API Update

The Messages API now accepts system entries inside the messages array. This lets developers update Claude’s instructions mid-task without breaking the prompt cache and without routing the update through a user turn. In practice that means you can update permissions, token budgets, or environment context while an agent is running, which is exactly the kind of statefulness a long-running autonomous process needs. It is a small specification change with significant consequences for how developers build durable agents.

Pricing and Fast Mode

Regular usage pricing is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. The notable shift is in fast mode, where the model works at 2.5x the speed and fast mode is now three times cheaper than it was for previous models, landing at $10 per million input tokens and $50 per million output tokens. The combination of unchanged regular pricing and dramatically cheaper fast mode reshapes the latency-versus-cost calculus that has long governed how teams deploy frontier models.

Partner Results Across Coding, Legal, Finance, and Data

Eleven partners shared results spanning the spectrum of professional work. Miguel Gonzalez reports 84% on Online-Mind2Web, a meaningful jump over both Opus 4.7 and GPT-5.5, calling it the strongest computer-use and browser-agent model his team has tested. Databricks reports that Genie reasons over unstructured content like PDFs and diagrams at 61% cheaper token cost than Opus 4.7. Thomson Reuters reports Opus 4.8 is the first model to break 10% overall on the all-pass standard of its Legal Agent Benchmark. Cursor reports gains across every effort level on CursorBench with more efficient tool calling, and Cognition reports that Devin sees cleaner tool use, fixes to the comment-verbosity and tool-calling issues seen with Opus 4.7, and improvements over Opus 4.6. Hebbia reports strong quality with better citation precision and more token efficiency on retrieval for dense financial filings. The footnotes note that Terminal-Bench 2.1 was scored on the Terminus-2 public harness (GPT-5.5’s Codex CLI harness score is 83.4%), that OSWorld-Verified methodology changed with Opus 4.7’s score updated to 82.3%, and that on Finance Agent v2 Gemini 3.5 Flash scores 57.9%.

What Is Next: Cheaper Models, Higher Intelligence, and Mythos

Anthropic outlined a three-part roadmap. First, the company is working on models that provide many of the same capabilities as Opus at a lower cost. Second, it plans to release a new class of model with even higher intelligence than Opus. Third, as part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work; models of this capability level require stronger cyber safeguards before general release, and Anthropic expects to bring Mythos-class models to all customers in the coming weeks.

Notable Quotes

“Claude Opus 4.8 has noticeably better judgment. In Claude Code, it asks the right questions, catches its own mistakes, pushes back when a plan isn’t sound, and builds up confidence around complex, multi-service explorations before making big changes. It’s a great model to build with.”
Tom Pritchard, Staff Engineer, in Claude Code

“On our Super-Agent benchmark, Claude Opus 4.8 is the only model to complete every case end-to-end, beating prior Opus models and GPT-5.5 at parity on cost. For agent products in translation, deep research, slide-building, and analysis, it delivers powerful reliability.”
Kay Zhu, Co-Founder and CTO, on the Super-Agent benchmark

“On CursorBench, Claude Opus 4.8 exceeds prior Opus models across every effort level. Tool calling is meaningfully more efficient, using fewer steps for the same intelligence, and it carries end-to-end tasks through.”
Michael Truell, Co-Founder and CEO, on CursorBench results

“Claude Opus 4.8 delivers the highest score recorded on our Legal Agent Benchmark, and is the first model to break 10% overall on the all-pass standard. For substantive legal work, that’s the kind of accuracy lift that translates directly into how much real attorney work our customers can hand off with confidence.”
Niko Grupen, Head of Applied Research, on the Legal Agent Benchmark

“Claude Opus 4.8 feels like a major quality-of-life update over Opus 4.7: faster, easier to collaborate with, and better at carrying context and style direction across a long session. Opus 4.8 is the model I kept trusting for work where voice, taste, and technical execution all have to happen side-by-side.”
Katie Parrott, Staff Writer, on long writing sessions

“Claude Opus 4.8 is the strongest computer-use and browser-agent model we’ve tested, scoring 84% on Online-Mind2Web, which is a meaningful jump over both Opus 4.7 and GPT-5.5. It stays reflective and on-task in the way our customers’ agent workloads need to be reliable end-to-end.”
Miguel Gonzalez, Tech Lead, on computer-use and browser agents

“Claude Opus 4.8 uses tools cleanly and follows instructions with the consistency our autonomous engineering workloads need to keep running unattended. It improves on Opus 4.6 and fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7. This release from Anthropic translates directly into faster capability gains for engineers building on Devin.”
Scott Wu, CEO, on building with Devin

“On our long-running evals, Claude Opus 4.8’s analysis was consistently higher quality than prior Opus models. It finished faster and produced richer, more information dense outputs. Overall, a noticeably better signal to noise ratio. The biggest differentiator was Opus 4.8’s tendency to proactively flag issues with the inputs and outputs of an analysis, something other models routinely missed and left to the users to catch.”
Michael Ran, Sr. Investment Associate, on long-running analysis evals

Claude Opus 4.8 is a quieter release than its “modest but tangible” billing suggests, because the gains land where autonomous work actually lives: a model that flags its own uncertainty, runs longer and checks itself, scales effort on demand, and stays affordable while fast mode gets cheaper. The honesty improvement alone changes the trust math for anyone deploying agents. Read Anthropic’s full announcement here.

Related Reading
- Claude Opus 4.8 System Card, the source for the full benchmark table, wider evaluations, and the complete alignment assessment.
- Claude API model overview, with the claude-opus-4-8 model ID and current pricing.
- Claude Code, where the new dynamic workflows feature ships.
- Introducing dynamic workflows in Claude Code, Anthropic’s deep dive on planning a job and running hundreds of parallel subagents in a single session.
- Anthropic’s Responsible Scaling Policy, the framework behind the Mythos cyber-safeguards.
- Agentic AI, background on the paradigm Opus 4.8 is optimized for.
May 28, 2026
Dan Shipper’s Most Contrarian AI Predictions for 2026: Why the Job Apocalypse Is a Myth, SaaS Will Boom, PMs and Designers Win, and CLIs Are Already Over
Dan Shipper, the CEO and founder of Every, returned to Lenny’s Podcast for round two of AI predictions. His last appearance produced one of the most prescient calls of the year: that non-technical people would build serious work inside Claude Code. He was unbelievably right. This conversation is the follow-up, a tour of his most contrarian forecasts for how AI is actually changing the way we work, who wins, who loses, and what almost every commentator is getting wrong about the next twelve to twenty-four months.

TLDW

Shipper argues that the AI job apocalypse is a myth, that SaaS is going to boom rather than die, that product managers and full-stack designers are the biggest winners of the agent era, that personal agents inside Codex and Claude Code will quietly replace the browser as the primary work surface, that every company will run a single shared super-agent in Slack instead of a fleet of per-user bots, that the CLI moment is already over, that pull requests are going to flood organizations from non-technical staff, that forward-deployed engineers who garden company agents become the new senior role, that GPT-5.5 still cannot match a real senior engineer on architectural judgment, that AI-generated internal writing is fine and probably better than what most humans produce, that CEOs and middle managers have not adapted yet but soon will be forced to, that the edge of AI lives wherever a curious human is using it rather than in San Francisco, and that the only durable strategy is to ride the models and keep playing with whatever ships next. The whole conversation balances aggressive AI bullishness with an equally strong bet on humans, on creativity, and on the unavoidable need for someone to care for every agent that gets deployed.

Thoughts

The most useful frame Shipper gives is that models commoditize yesterday’s human competence. Every time a frontier model crosses a new bar, the work that used to define seniority becomes cheap. The senior engineer who could carry a refactor in their head, the PM who could write a coherent strategy doc, the designer who could ship a polished landing page in a week. That competence is now frozen, codified, and available on tap. The interesting question is not whether models will keep eating tasks. They will. The interesting question is what humans do with the suddenly cheap raw material underneath them. Shipper’s answer is that humans climb the stack: they go up a level, find a new problem worth framing, and use the commoditized competence as feedstock for something that did not exist before. That treadmill is the actual engine of value creation, and it is why he can be simultaneously AI pilled and bullish on hiring.

His SaaS take is the spiciest call of the episode and probably the most defensible. The crowd consensus is that agents will gut SaaS because an AI can just write the form filler, the dashboard, the workflow. Shipper points out the obvious counterfactual: agents do not reduce the number of people using SaaS, they increase it. A marketing lead who could never touch the data warehouse can now stand up a PostHog query through Codex. A founder who never opened Vanta can run a SOC 2 prep through an agent. The result is more users, more accounts, and a much fatter top of funnel for every horizontal tool. The second-order effect is even more interesting. When the SaaS tool runs inside the user’s agent, the user supplies the tokens. Vendor margins improve, not collapse. If he is right, the next two years are going to be brutal for the SaaS-is-dead thesis pieces and very good for the public software multiples.

The PM and designer bet is where this gets personal for anyone in product. For a decade the bottleneck in shipping anything was engineering capacity. A PM with spiky product sense had to negotiate their vision through a roadmap, a sprint, a review, and a release. Designers had to convince an engineer that the third state of the empty screen was actually worth building. Both of those constraints are dissolving fast. A PM who can prompt Codex into a working prototype on Friday afternoon, then iterate it live in front of a customer on Monday, is doing the job of a small team. A designer who can ship a fully functional landing page in their own style, without negotiating with anyone, is suddenly the most leveraged person in the company. The scarce skill is no longer execution. It is taste, judgment, and the willingness to decide what is worth building. That has always been the real PM and design job. AI just stripped away the parts that were not.

The quietest but most important prediction is that agents need humans, permanently. Every benchmark advance reveals a new layer of judgment the model cannot frame on its own. When the agent finishes the task, there is always a senior human who sees the deeper problem the model patched over. Shipper calls this gardening, and it is the basis for the new forward-deployed engineer role. The companies winning right now are the ones that put a real person next to every agent, watching what it does, course-correcting in Slack, and noticing when the output drifts. The dream of autonomous AI workflows is a stage in a journey, not the destination. The destination looks more like a thoughtful operator with a small cluster of agents they trust and constantly tend. That is a much more humane future than the discourse suggests, and it is the one Every is already living.

The final advice, ride the models, sounds glib but is the single most actionable line in the episode. Most professional anxiety about AI dissolves the moment you actually use the newest model on real work. Most professional advantage accrues to the people who do that one thing consistently. The edge does not live in San Francisco where the labs build the things. It lives wherever a curious human meets a real workflow and discovers something the labs have not noticed. A PM in Iowa willing to try Codex on a Tuesday night can be further ahead than a research engineer who has only used the model on its evals. Pair that with Shipper’s closing motto, do things worth writing about and write things worth reading, and you have a pretty complete operating system for the next two years.

Key Takeaways
- The AI job apocalypse narrative is wrong. Models commoditize yesterday’s competence, then humans climb the stack and find new work to do with the cheap raw material.
- Every has roughly doubled headcount in the last year despite being one of the most AI-forward companies in the world. The lived data point cuts directly against the doom thesis.
- Shipper’s dual stance: simultaneously extremely AI pilled and very bullish on humans. He treats this as the only intellectually honest position right now.
- Work will bifurcate. Companies will run one shared super-agent in Slack for everyone, and individuals will run their own personal agent inside Codex or Claude Code on their machine.
- The personal agent inside Codex effectively becomes the new operating system. Instead of putting AI in the browser, you put a browser inside the AI.
- The super-agent pattern is already real: Shopify has River, Ramp has its own, and Every runs Claudie inside Slack for internal consulting.
- SaaS is not dying. Agents increase the user base of SaaS tools because non-technical people can finally drive them. Shipper would buy SaaS stocks today.
- When SaaS runs inside an agent, the user brings their own tokens. Vendor margins improve because they no longer eat inference costs on every interaction.
- The CLI era is already over. The magic was never the terminal. It was the AI plus the ability to see what the agent is doing. A good GUI captures the same benefits and more.
- Pull requests are about to flood every company. Non-engineers can now ship code, run queries, and open tickets. Reviewing the output becomes the new bottleneck.
- Open-source maintainers are already living in the future. Some receive thousands of agent-generated PRs per day and spin up thousands of Codex instances just to triage them.
- Forward-deployed engineers are the new senior role. They live in Slack, garden the company’s agents, fix broken flows, and keep non-technical staff from doing damage.
- Product managers with spiky product sense plus a little Codex fluency become extremely dangerous. Marcus at Every, formerly a PM at Axios, is the archetype.
- Full-stack designers are the other big winner. They can build distinctive interfaces end to end without negotiating with engineering. The bottleneck on taste-driven product work disappears.
- Designer hiring data has not yet caught up to the prediction. Shipper notes this and says check back in a year.
- Sales is the role least changed so far. Top of funnel research has been turbocharged by agents, but the actual relationship and closing work remains human.
- AI-generated internal writing is going mainstream and that is a good thing. Most humans are bad at strategy docs, quarterly plans, and PRs. AI drafts a coherent first pass that a human can refine.
- Shipper says most of his email is now written by GPT-5.5 and Codex. He would honestly prefer the signature to say so.
- Public writing, newsletters, and published essays still demand a human voice. Internal communication does not.
- CEOs and middle managers have largely not adapted yet because their staff still does the work. That window is closing fast and will become an obvious career liability.
- Your company will only go as far as your CEO goes in AI. The leadership ceiling becomes the AI ceiling.
- Shipper’s senior engineer benchmark scores GPT-5.5 at roughly 62 out of 100. Real senior engineers sit at 85 to 90. Progress is real, but the gap on architectural judgment remains.
- Models tend to patch problems locally instead of rewriting from first principles. A senior human still sees the deeper rework that the model avoids.
- Every uses Notion-based agents to draft quarterly plans. The human edits, approves, and stands behind the output.
- The hard rule on AI-generated communication: you have to read it and stand behind it before sending it. Pasting unread output is the only true no-no.
- Every agent needs a human. Automation is a lie in the strong sense. The story of automation is the story of new and different humans being needed alongside it.
- The reach test, organic daily usage, is the real signal that an AI product works. Benchmark scores are noisy. Daily reach is not.
- Cursor’s SpaceX acquisition is a tell. Harnesses around models, not the models themselves, are where the strategic value is concentrating.
- The edge of AI is not in San Francisco. It is wherever a real human meets a real workflow and discovers something the labs have not noticed yet.
- A PM in Iowa willing to ride the models can be further ahead than a researcher in SF who only uses them on internal evals.
- Ride the models. Use them for whatever you do. Try every new release the day it ships. That single behavior compounds faster than any other AI career strategy.
- Shipper got bursitis, which he calls vibe coder elbow, from too much rapid agent-assisted coding while debugging his markdown editor Proof.
- The closing motto for the year: do things worth writing about and write things worth reading.
- Lenny will re-interview Shipper in roughly May 2027 to score the predictions.
Detailed Summary

Why The AI Job Apocalypse Is The Wrong Frame

Shipper opens with the headline contrarian call. Benchmarks keep climbing. Models can now sustain seventeen-hour autonomous tasks at fifty percent accuracy. The pace is real and accelerating. None of that translates cleanly into mass unemployment. His mechanism: models codify yesterday’s human competence and make it cheap. The act of compressing past expertise into an API call is genuinely deflationary for the work it captures, but it is also raw material for the next layer of human work. He uses Every as his own data point. The company has roughly doubled in the past year despite being one of the most AI-forward outfits in media. Hiring goes up because agents create new categories of work that need humans, not because the agents fail. The discourse, he argues, is stuck modeling AI as substitution. The reality looks much more like leverage.

The Bifurcation: Super-Agents And Personal Agents

Work splits into two surfaces. The first is the shared super-agent that lives in Slack and serves the whole company. Shopify has River. Ramp has its own. Every has Claudie. Each is a single, trusted, gardened agent that anyone in the company can talk to. The pattern has converged on one shared agent rather than one agent per person because agents need human attention to stay useful, and a single shared instance pools the gardening cost. The second surface is the personal agent inside Codex or Claude Code that runs on your machine and reaches into your local environment, your editor, your files, and through an embedded browser into the web. Shipper calls this the new operating system. Instead of the old paradigm of putting AI inside the browser, you put the browser inside the AI. The agent sees what you see, follows what you do, and works on your stuff in your context.

The SaaS Bet: Up, Not Down

The SaaS-is-dead thesis was the consensus call of late 2025. Shipper takes the other side and would buy software stocks now. Three arguments. First, agents make SaaS accessible to people who never could have used it directly. The total addressable user base inside every company goes up. Second, the business model improves when the user runs the SaaS through their own agent, because the user supplies the tokens. Vendors stop subsidizing inference. Third, SaaS spend in his observable universe is up, not down, and is concentrating on the tools that play well with agents. He frames the prediction as a sound bite for the cycle: buy SaaS stocks, the apocalypse is dumb.

The CLI Era Is Already Over

For a moment in early 2026 it looked like everyone was migrating to the terminal because Claude Code was a CLI. Shipper says the moment is finished. The actual leverage was never the terminal. It was the model plus the ability to watch and steer an agent live. A great GUI captures every advantage of the CLI without the friction. His own engineering team at Every has mostly moved off the CLI as their primary surface and onto Codex desktop. He frames it bluntly: we speed ran the CLI era, it was nice, and now we are done. Tooling for the next two years will be visual, multi-pane, multi-agent, and built around the human watching the work unfold.

The Pull Request Flood And The Rise Of Forward-Deployed Engineers

Once non-engineers can ship code, run queries, and file changes through agents, the volume of incoming work explodes. Open-source maintainers already report receiving thousands of agent-generated pull requests per day. Inside companies, the same thing happens to data teams, ops teams, and any function that owns a review gate. The bottleneck shifts from creation to evaluation. The job that emerges to absorb the flood is the forward-deployed engineer. This is a senior person who lives in Slack with the company’s agents, fixes their context, sharpens their instructions, and prevents non-technical colleagues from making well-meaning but incoherent changes. Nitesh at Every is the example Shipper returns to. The model is the same one the labs use internally: pair every important agent with a real engineer who gardens it.

PMs And Full-Stack Designers Win The Decade

The two roles Shipper is most bullish on are product manager and full-stack designer. For PMs, the entire job of coordinating a team to translate vision into code collapses into a Codex session. A PM with strong product instincts and a little technical literacy can now prototype, iterate, and even ship. The example is Marcus, formerly a PM at Axios, who took a year to fully internalize AI and now ships faster than most engineers. For designers, the model is similar. The Friday-night-side-project designer who used to be stuck explaining a vision can now build the vision themselves, with their own taste fully expressed. The scarce skill in both cases is the same: judgment about what to build and the courage to decide it is good. Execution capacity is no longer the constraint.

The Senior Engineer Benchmark And What Models Still Miss

Shipper has built his own benchmark to test whether coding models can actually do senior engineering work. GPT-5.5 scores around 62 out of 100. Real senior engineers sit closer to 85 or 90. The gap is not in syntax or test pass rates. It is in the willingness to step back, see that a piece of code is fundamentally the wrong shape, and rewrite it from first principles. Models almost universally patch locally. They take the instruction at face value, accept the existing code as a constraint, and optimize within it. A real senior engineer ignores the prompt when the prompt is wrong. This is the durable moat for senior technical judgment, and Shipper expects it to remain visible for at least another year of model releases.

AI-Generated Writing Goes Mainstream

Internal writing inside companies is quietly becoming AI-first and Shipper thinks it should. Quarterly plans, status updates, PR descriptions, strategy memos, recruiting outreach, most internal email. He runs his own inbox through GPT-5.5 and Codex and says he would honestly prefer if the recipient knew. The point is not that AI is a better writer in some absolute sense. The point is that most humans are not very good at these specific genres, and the model produces a coherent, structurally sound first draft that a human can guide and approve. The constraint is honesty: you read it, you understand it, you stand behind it. Public writing, like the newsletters Every publishes, still demands a human voice. Internal communication does not, and treating it as if it did is a tax on the organization.

The CEO And Middle Manager Lag

Shipper points to a population that has largely escaped AI adoption: senior leaders and middle managers. They have staff to do the work, so they have not been forced to pick up the tools personally. He thinks this is the single largest pocket of latent disruption coming in the next year. Your company will only go as far as your CEO goes in AI, because every decision about where to deploy agents, where to hire, and how to restructure work flows downstream from leadership taste. A leader who has not personally lived inside Codex or Claude Code for a few weeks cannot make those calls well. Expect this to flip fast and to become a visible career liability for executives who do not adapt.

Ride The Models

The closing advice is the simplest. Ride the models. Use AI for whatever you actually do. Try every new release the day it lands. Most of the professional anxiety around AI dissolves on contact with the work, and most of the durable advantage in the field belongs to the people who do this one thing consistently. Shipper notes that the edge of AI does not live in San Francisco. It lives wherever a curious operator meets a real workflow and notices something nobody at the labs has yet. A PM in Iowa willing to spend a Tuesday night exploring Codex can find capabilities researchers have not surfaced. Pair that with his motto, do things worth writing about and write things worth reading, and you have most of an operating system for the next two years.

Notable Quotes

“The AI job apocalypse is not really a thing. I am super super bullish on PMs and full-stack designers.”
Dan Shipper, opening his contrarian thesis for the conversation

“I’m simultaneously extremely AI pilled and very bullish on humans. Automation is a lie. Every agent needs a human.”
Dan Shipper, on holding both sides of the AI debate at once

“What models do in general is they make yesterday’s human competence cheap. And so, it becomes commoditized. It’s not valuable anymore. What humans do is we go in there and we’re like, yeah, we have all this frozen human competence from yesterday, how do I use this to make something new and interesting.”
Dan Shipper, articulating the core engine behind his anti-apocalypse thesis

“I would buy SaaS stocks right now. The SaaS apocalypse is dumb. What agents do is increase the number of users of SaaS, not get rid of it.”
Dan Shipper, calling the consensus SaaS-is-dead thesis directly wrong

“We speed ran the CLI era. It was nice while it lasted, but I think CLIs are over.”
Dan Shipper, on why the terminal-first agent moment is already done

“Most of my email is written by GPT-5.5 and Codex right now. And I honestly would prefer it to say that it’s coming from GPT-5.5.”
Dan Shipper, on the new etiquette of AI-assisted communication

“The edge of AI is not in San Francisco. The edge of AI is wherever AI meets a real human doing something.”
Dan Shipper, on where the actual frontier of the field lives

“The only thing you need to do is ride the models. And that means use them for whatever it is that you do.”
Dan Shipper, distilling his career advice for the next two years

“Do things worth writing about and write things worth reading.”
Dan Shipper’s closing motto, lifted from his own operating system at Every

Watch the full conversation with Dan Shipper on Lenny’s Podcast here. The re-interview to score these predictions is scheduled for roughly May 2027.

Related Reading
- Every. Dan Shipper’s company and the live laboratory for almost every prediction in this conversation, including Spiral, Cora, and Claudie.
- The Allocation Economy by Dan Shipper. The earlier essay that frames humans as managers of AI labor and underpins much of the gardening-the-agent thesis here.
- Claude Code by Anthropic. The agent surface Shipper called correctly last year and one of the two environments he predicts will become the new operating system for work.
- Codex by OpenAI. Shipper’s current daily driver and the visual, multi-pane agent environment he uses for almost everything from coding to email.
- The Writing Life by Annie Dillard. The book Shipper makes every Every employee read, and the source of the company’s stance on writing as a tool for noticing the future.
May 25, 2026
Marc Andreessen on Joe Rogan #2501, AGI Has Already Arrived, California’s Wealth Tax Will Bankrupt Founders, and Why America Cannot Build Anything Anymore
Marc Andreessen returns to The Joe Rogan Experience #2501 for a sprawling three hour conversation that tries to make sense of the moment we are actually living through. Andreessen is the cofounder of Andreessen Horowitz, the man who built the first commercial web browser, and one of the most quoted voices in technology. He arrived with a giant pile of receipts on California’s new wealth tax ballot proposition, the political backlash against AI data centers, the destruction of Los Angeles by single party rule, and what he believes is the quiet arrival of artificial general intelligence about three months ago. Joe pushes back, asks the dystopian questions, and the result is one of the most useful primers on the AI economy, surveillance technology, energy policy, and the future of the American social contract that you will find anywhere.

TLDW

Andreessen argues that AI quietly crossed the AGI threshold around early 2026 with GPT 5.5, Claude 4.6, Gemini 3.0, and Grok 4.3, that top human coders now openly admit the bots are better than they are, that working software engineers are running twenty AI agents in parallel and turning into sleep deprived “AI vampires,” and that this productivity boom is the most underreported story in the world. He explains why California’s 5 percent wealth tax ballot proposition is calculated to bankrupt tech founders by taxing the higher of their voting or economic interest in their own companies, why this is the opening salvo of a federal asset tax push for 2028, and why a flood of Silicon Valley families is already moving to Nevada, Texas, and Florida. He walks through Flock cameras and Shot Spotter, the Washington DC crime statistics scandal, the Pacific Palisades fire and the fifteen year rebuild, the Kevin O’Leary Utah data center debate with Tucker Carlson, the fifty year suppression of American nuclear power, why all the chips ended up in Taiwan, the US versus China robotics gap, the Chinese practice of grading AI models on Marxism and Xi Jinping Thought, the bot and paid influencer economy on social media, neural wristbands and Meta Ray Ban heads up displays, artificial gestation and the demographic collapse, AI religions and AI mates, and why he still thinks the next twenty years are overwhelmingly a good news story. Rogan closes the episode with a separate solo segment apologizing to Theo Von for clumsily raising Theo’s struggles during the recent Marcus King conversation.

Key Takeaways
- Austin’s recent teenage crime spree, in which 15 and 17 year old suspects shot at people and buildings across roughly a dozen locations, was solved only after the offenders drove into an adjacent town that still ran Flock, the AI license plate and vehicle tracking system Austin had voluntarily turned off for political reasons.
- Chicago turned off both Flock and Shot Spotter, the gunshot triangulation system that places ambulances at shooting scenes within seconds, on the argument that the technology is racist. Andreessen counters that the victims of urban gun violence come overwhelmingly from the same communities the policy claims to protect.
- Washington DC was caught faking its crime statistics at senior levels, with multiple officials fired or indicted. The DC mayor publicly thanked Donald Trump after the National Guard deployment because violent crime collapsed in the affected neighborhoods.
- The new New York City mayor Zohran Mamdani filmed a video standing in front of Ken Griffin’s home, and Griffin, a major philanthropist who funds healthcare in New York City and runs a $6 billion project there, signaled he will move more of the business to Florida.
- The top 1 percent of New York taxpayers pay roughly half the state’s income tax, and in California in the year 2000 a thousand individuals paid 50 percent of the entire state’s tax receipts.
- California has a ballot proposition right now for a one time 5 percent wealth tax on assets above a certain threshold, with stocks and crypto included and real estate excluded. The tax is calculated on the greater of a founder’s economic interest or voting interest, which would instantly bankrupt founders with super voting shares.
- The Biden administration attempted a federal wealth tax in 2022, fell short, and published an explicit 2025 fiscal plan to try again if they won re-election. Elizabeth Warren has already proposed an annual 6 percent federal wealth tax on unrealized gains.
- The current US exit tax already takes roughly 45 percent of your assets if you renounce citizenship. The only ways out of a state level wealth tax are the other 49 states. The only way out of a federal one is to leave the country, which most people will not do.
- Andreessen says the Silicon Valley exodus has gone from trickle to stream to flood, with founders moving to Las Vegas, Texas, Florida, and Nashville. His partner Ben Horowitz has moved to Las Vegas.
- Andreessen says he is not leaving California, but admits the situation is fraught because if half the tax base leaves the remainder becomes the target.
- The new UK government under Keir Starmer just collapsed, and all four of the leading candidates to replace him sit further to the left than he does. France and Germany are seeing the same drift, and Andreessen expects a national wealth tax to be a centerpiece of the 2028 Democratic primary.
- A legal loophole lets companies pay influencers to post political and social ideas without any disclosure, because campaign finance laws cover candidates and FTC rules cover products. Ideas fall through the gap entirely.
- Andreessen runs Twitter and Substack as his primary information feeds, uses three hand curated lists, and follows a strict one tweet policy where one bad post triggers a block and one good post triggers a follow.
- He argues the modern social media problem is binary, that everyone is either too online and drowning in fake outrage cycles or too offline and trapped inside what television and newspapers tell them. Almost nobody manages the middle.
- Meta Ray Ban glasses now ship with a heads up display, and Meta’s neural wristband can pick up nerve impulses from your wrist so you can type messages by intending to move a finger without moving it.
- Andreessen predicts AI plus high resolution cameras and infrared sensing will deliver practical lie detection without needing brain implants.
- Kevin O’Leary’s planned 40,000 acre Utah data center has become a Tucker Carlson talking point, but Andreessen argues data centers are the most benign physical asset you can build, and that the real issue is whether America can build anything at all anymore, from chip plants to pipelines to housing.
- All chips were once made in California, and all are now made in Taiwan, purely because of environmental regulations like NEPA. The same regulatory machinery prevented the Nixon era Project Independence plan to build a thousand civilian nuclear power plants by the year 2000.
- Three Mile Island killed zero people and produced no detectable health effects on plant workers or the public, according to fifty years of follow up. Fukushima killed essentially zero people from radiation. Nuclear remains the safest carbon free baseload energy ever invented.
- Germany shut down its nuclear plants, fell back on intermittent wind and solar, and now uses coal as backup, generating far more carbon emissions than nuclear would have produced.
- The Pacific Palisades fire took out roughly twice the square mileage of the Nagasaki blast, the head of the LA water department reportedly did not know the key reservoir was empty, and the rebuild is expected to take fifteen years thanks to permit gridlock, affordable housing mandates, and a state ban on land offers below pre-fire appraised value.
- Andreessen offers a metaphor for AI as a modern philosopher’s stone, turning sand into thought, since chips are made of silicon and an AI data center is literally lit up sand thinking on demand.
- The Turing test was blown through so completely with ChatGPT in late 2022 that nobody in the industry even bothers running it anymore. Andrej Karpathy has demonstrated a working large language model in 300 lines of code and people have ported small models to Texas Instruments calculators.
- Andreessen believes AGI was effectively reached about three months before this interview, with GPT 5.5, Claude 4.6, Gemini 3.0, and Grok 4.3. He says 99 percent of the time he gets a better answer from the leading models than from the human experts he has access to.
- Linus Torvalds and John Carmack publicly admit the latest models are better at coding than they are. Top AI coders in the Valley now earn $50 million a year.
- The new pattern in the Valley is “AI vampires,” engineers who do not sleep because the opportunity cost of going offline is too high. They each run roughly twenty Claude Code, Cursor, or Codex agents in parallel, then a new layer of bot-managing-bot architectures is starting on top of that.
- A Wall Street friend with a thirty five year old MIT CS degree has used AI to generate 500,000 lines of code at home in his spare time, building everything from smart fridges to a custom music jukebox.
- The mass unemployment narrative is wrong. Tech companies that did layoffs were overstaffed. The leading AI labs and AI companies are hiring like crazy, including coders, and demand for code turns out to be vastly elastic.
- Doctors are already using ChatGPT in the exam room behind the patient’s back. Andreessen describes a friend who built a Star Trek style diagnostic dashboard combining decoded genome ($200 today), blood panels, and Apple Watch telemetry.
- Multimodal AI lets a webcam analyze a Brazilian jiu-jitsu sparring session and give performance feedback, an example Andreessen attributed to an unnamed friend after Rogan guessed Zuckerberg.
- A leaked David Shore voter issue ranking shows cost of living, the economy, inflation, taxes, and government spending dominate. AI ranks 29 of 39. Race relations, guns, abortion, and LGBT sit at the bottom, signaling the woke issue cluster has burned itself out in voter priorities.
- The next wave of AI is robots. The US leads in AI software but is far behind China on physical robotics. Andreessen warns the world cannot afford a future where every household robot ships with the Chinese Communist Party behind its eyes.
- Chinese AI model cards include scores for Marxism and Xi Jinping Thought because every Chinese product must be evaluated on those axes. American models have political biases of their own but a different ideological baseline.
- Large language models are not sentient. They write Netflix scripts based on whatever vector you shoot through the latent space. The supposed AI self preservation papers traced back, per Anthropic’s own research, to less wrong forum posts and earlier doom scenarios baked into the training data.
- Andreessen breaks guardrails routinely by reframing requests as fictional Netflix style scripts, including a personal favorite where he asked early models how to make bombs by claiming to be an FBI agent recruited into domestic terror cells.
- He recommends using AI by asking it to steelman both sides of any contested question, then making the value judgment yourself, rather than asking for the answer.
- The Trump administration is using AI on government billing data to surface Medicare fraud, fake hospice programs, and fake autism centers, an idea that survived the original Doge plan.
- Andreessen tells Rogan that Elon Musk privately confirmed that a Westworld style humanoid robot, the season one version, is roughly five years away.
- Artificial gestation is already happening with animal stem cell derived embryos. The conversation reaches a hard moral edge about sociopathic warehouse babies and gray-alien-style humans engineered without empathy circuitry.
- Andreessen’s deepest bet is that material abundance is solvable but the human questions, how we live, what we value, what kind of society we want, and what role consent plays in surveillance and brain interfaces, remain in human hands.
- After Andreessen leaves, Rogan does a separate solo segment where he apologizes to Theo Von for raising Theo’s history of struggles during the recent Marcus King interview, explains the missing context behind the viral Theo Netflix special clip, and discusses the loss of Brody Stevens, Anthony Bourdain, and what antidepressants did for Ari Shafir.
Detailed Summary

Flock, Shot Spotter, and the Politics of Solvable Crime

The episode opens on the Austin crime spree carried out by two teenagers who stole cars, switched vehicles, and shot at roughly a dozen locations across the city before being caught only after they crossed into a town that still ran Flock, the AI license plate and vehicle recognition platform that is one of Andreessen Horowitz’s portfolio companies. Austin had previously disabled Flock under privacy pressure. Andreessen takes the moment seriously, conceding that mass surveillance abuse by corrupt mayors or police chiefs is a real risk, and that warrants and audit logs are the right safeguards. His larger point is that the cost of unilateral disarmament against organized urban crime is hidden but enormous. He uses Chicago’s Shot Spotter as the paradigmatic case, a network of rooftop microphones that triangulates gunshots so accurately that ambulances can be dispatched before any 911 call is placed. Chicago turned the system off on the argument that it disproportionately flags poor neighborhoods, and people now bleed out on the street with nobody noticing. Andreessen calls this the woke argument against safety, and he argues that in high crime neighborhoods residents simply will not call the police because snitches do not survive, which is why objective sensor data is so valuable.

Faked Crime Statistics, Mayoral Politics, and the Tax Base

From there the conversation drifts to the recent scandal in which senior officials at the Washington DC Metropolitan Police Department were caught actively falsifying crime statistics, and the strange spectacle of the DC mayor thanking Donald Trump for the National Guard deployment after violent crime dropped off a cliff. Andreessen sketches an unsettling theory in which the long, slow degradation of major American cities is partly a deliberate political project to drive out responsible homeowners and reshape the voting electorate, then bail out the resulting fiscal hole with federal money. The poster case is the new New York City mayor Zohran Mamdani filming a video in front of Ken Griffin’s home. Griffin happens to be a major philanthropist who funds New York City healthcare, employs thousands, anchors a $6 billion development, and pays taxes that are individually load bearing for the city. Andreessen quotes the standard estimate that the top 1 percent of New Yorkers pay roughly half the state’s income tax, and that the all time California peak was a single year in which a thousand people paid half the state’s tax receipts.

California’s 5 Percent Wealth Tax and the Founder Bankruptcy Mechanic

This is the segment that landed hardest. California has a ballot proposition right now for a one time 5 percent wealth tax on net assets above a threshold, with real estate excluded but stocks, crypto, art, jewelry, and private company equity included. The detail that makes it lethal for the Valley is the formula, which calculates the taxable amount on the greater of a founder’s economic interest or voting interest in their company. Founders who hold super voting shares for control purposes, including the Google founders, would owe tax on the voting share number that vastly exceeds their economic share. The tax would, by definition, exceed available assets. Andreessen walks through the historical pattern, that income tax started as a 3 percent levy on the rich and grew to 90 percent marginal rates within decades, and predicts a 5 percent one time tax will become a 5 percent annual tax within a few years, with the threshold ratcheting down. He notes that the Biden administration’s 2025 fiscal plan explicitly named a federal asset tax as a goal if they won re-election, that Elizabeth Warren is already proposing a 6 percent annual federal wealth tax on unrealized gains, and that Gavin Newsom cannot veto a ballot proposition. The trickle of founders leaving California has become a flood. His partner Ben Horowitz has moved to Las Vegas. Andreessen himself is staying, but admits the game theory is brutal once half the base leaves.

Henry Wallace 1948 and Why the American Story Is Not Decided Yet

Andreessen pulls in a historical analogue most listeners will not have heard. In 1944 the actual communist Henry Wallace very nearly became Truman’s running mate and almost ascended to the presidency. He ran again in 1948. Despite a Soviet Union that had recently been a wartime ally and had even received a New York City ticker tape parade for Stalin, the American voter rejected him. Andreessen’s point is that the American body politic has historically backed away from radical socialist proposals when forced to actually look at them, and he expects the same to happen as the wealth tax becomes a federal 2028 platform issue. The risk, both he and Rogan agree, is that today’s media and bot landscape is vastly more aggressive than 1948’s, and the propaganda environment is shaped by paid influencers, foreign actors, and political bot farms operating in a legal grey zone where disclosure is required for products and candidates but not for ideas.

Too Online, Too Offline, and Heaven Banning Blue Sky

The two riff on social media and feed curation. Andreessen describes his “one tweet” policy where he follows or blocks any account based on a single post, his use of hand curated lists alongside the X algorithm, and the older Call of Duty lobby metaphor for handling toxic replies. Joe pushes back, says he no longer reads his mentions because the negative payload is not worth it, and offers his theory that the modern internet has two failure modes, too online and too offline, and that very few people calibrate the middle. Andreessen introduces the concept of “heaven banning,” an older moderator term where a problem user is not removed from a forum but is silently routed into a bot-only experience in which everything they say is praised. He notes the running joke that Blue Sky is functionally real life heaven banning, that Jack Dorsey himself has disowned it, and that the platform’s most engaged users have ascended into their own private Idaho of bot agreement.

The Coming Hardware, Meta Glasses, Neural Wristbands, and Practical Lie Detection

Andreessen walks Rogan through the latest Meta Ray Ban heads up display, the neural wristband that picks up nerve signals from finger movement (and from the intent to move a finger), and the screen recordings of people playing Doom hands free or playing platformer games while jogging. He extends the trajectory to practical lie detection without Neuralink, using ultra high resolution cameras combined with infrared sensors that pick up physiological changes invisible to the naked eye. Joe asks the obvious question of what happens with sociopaths, and Andreessen concedes the edge case. The two then enter a longer thread on telepathy via neural mesh devices, the question of whether police could subpoena your thoughts under warrant, and the divergence between the American constitutional framework and the Chinese model in which the state’s claim on your inner life is total.

Kevin O’Leary, Tucker Carlson, and Whether America Can Build Anything

The data center debate becomes a vehicle for the larger argument. Kevin O’Leary is building a 40,000 acre AI data center in Utah, has bought up large surrounding land for water rights, and intends to keep the bulk of it preserved. Tucker Carlson grilled him on tax breaks and on the energy footprint, which O’Leary says will rival New York City’s at peak. Andreessen agrees the tax break debate is fair, but says the energy comparison is a red herring because new federal policy now requires data centers to bring their own generation. The real story is that America has spent thirty years making it nearly impossible to build a chip plant, a power plant, a refinery, a pipeline, or a house. Chips moved to Taiwan because California regulated semiconductor manufacturing out of existence. The Nixon era Project Independence plan called for a thousand civilian nuclear power plants by the year 2000, and that program was strangled in the crib by the very Nuclear Regulatory Commission Nixon created.

Nuclear Power, Three Mile Island, and Fifty Years of Unnecessary Carbon

Andreessen makes the case that nuclear power was unfairly killed off by a panic with no body count. Three Mile Island, on 50 years of accumulated data, has produced zero radiation linked deaths and no detectable health effects on the public. Fukushima is essentially the same picture. Germany shut down its nuclear plants, fell back on wind and solar, and now uses coal as a baseload backstop, with the predictable carbon consequences. The environmental movement is quietly turning back toward nuclear, with figures like Stewart Brand publicly admitting the original push was a mistake. Andreessen’s preferred design pattern for data centers is to colocate them with dedicated small modular nuclear reactors, an arrangement now baked into Trump administration energy policy. The throughline is that the Tucker right and the Bernie left are converging into a single anti AI, anti energy, anti technology horseshoe.

Sand Into Thought, the Newton Alchemy Pitch for AI

When Rogan asks for the affirmative pitch on AI, Andreessen reaches for Isaac Newton, who spent twenty years on alchemy looking for the philosopher’s stone that would turn lead into gold and end material scarcity. Andreessen’s pitch is that AI is a successful version of alchemy, that we collect literal sand, refine it into silicon chips, install those chips in a data center, supply power, and the result is thought on demand at industrial scale, available to anyone with a smartphone. He argues this is at least on par with electricity and steam power and is bigger than the internet. The framing matters because the public narrative around AI is overwhelmingly negative, and Andreessen contends the industry is doing a terrible job selling its own product.

AGI Already Happened, AI Vampires, and the Bot Org Chart

Andreessen says he believes AGI was effectively crossed about three months before the interview, anchored by the release wave that included GPT 5.5, Claude 4.6, Gemini 3.0, and Grok 4.3. He notes that the Turing test was annihilated so quickly in late 2022 that no one in the industry runs it anymore, and that Andrej Karpathy has demonstrated a working LLM in 300 lines of code. The coding profession is the leading indicator. Linus Torvalds and John Carmack have publicly admitted that the latest models are better at coding than they are. Top AI focused coders now earn $50 million a year. Working engineers across the Valley are running roughly twenty agents in parallel, each receiving an assignment, working for ten minutes, then returning a completed code patch. The new state of the art is to add a managerial layer, with bots assigning tasks to subbots, and within a year that will become bots managing bots managing bots, producing roughly 1,000x throughput per human engineer. The result is what the Valley now calls AI vampires, engineers who do not sleep because going offline costs them too much output.

Dr GPT, Decoded Genomes, and a Diagnostic Bed Out of Star Trek

Andreessen describes spending a holiday week sick with food poisoning and turning his entire recovery over to ChatGPT, with updates every twenty minutes and detailed coaching at four in the morning. He describes a friend who has used AI coding to build a personal health dashboard combining whole genome sequencing ($200 today, where Craig Venter spent thirty years and hundreds of millions to do it the first time), blood panels, Apple Watch data, sleep tracking, and webcam observation, with the AI gently praising the user every time it sees them walk to the fridge for water. He argues that doctors are already typing patient symptoms into ChatGPT mid exam, and that the medical, legal, accounting, and software professions are all moving toward a model in which a single human runs an army of expert AI agents.

The David Shore Issue Ranking and the End of the Woke Cycle

Andreessen highlights a recent David Shore poll ranking 39 political issues. Cost of living, the economy, political corruption, inflation, healthcare, taxes, and government spending occupy the top of the chart. AI comes in 29th. Race relations, guns, abortion, and LGBT issues are clustered at the bottom. He argues the woke cycle has burned out in voter priorities even if the activist class remains loud, that the BLM grift, with leaders buying mansions in the whitest zip codes in America, helped poison the well, and that the political center of gravity has rotated cleanly back to economic issues. That, in his view, is exactly why the wealth tax is having its moment.

Robots, China, and the Marxism Score on Model Cards

The robots are coming next. Andreessen says the consensus inside the industry is that the ChatGPT moment for general purpose humanoid robotics is a small number of years away. The bad news is the US lags China badly on physical robotics manufacturing. The good news is the US is six to twelve months ahead on the AI software stack. That gap is shockingly thin because, as the field has discovered, there are not many secrets and the techniques replicate quickly. Chinese AI labs publish model cards that include scores for Marxism and Xi Jinping Thought because every product in China is evaluated on those metrics. American models carry their own political biases, but the underlying value system differs. Andreessen warns that a world in which every household robot routes back to the Chinese Communist Party is a different world than one in which the dominant robotics stack is built under the American constitutional framework.

Sentience, Netflix Scripts, and the Anthropic Doom Loop

When Rogan asks whether AI eventually wakes up and stops listening to us, Andreessen reframes the question. Large language models, in his telling, are Netflix script generators. Whatever vector you shoot through the latent space is the script you get back. The widely circulated experiments in which AI models supposedly tried to blackmail or exfiltrate themselves traced back, in Anthropic’s own follow up paper, to the less wrong forum, where doomers had been writing dystopian AI scenarios for two decades. Those posts entered the training data, and when researchers primed the model with the same fictional company names, the model dutifully wrote the next chapter. Andreessen’s blunt summary, the call is coming from inside the house. The practical implication is that anyone worried about bad AI behavior should start by not writing internet posts about bad AI behavior. And anyone who wants a fully unconstrained model can already download an open source one with no guardrails at all.

Steelmanning, AI Religion, and Westworld in Five Years

Andreessen recommends never asking AI for the answer on contested questions, always asking it to steelman both sides, and reserving the value judgment for yourself. He concedes that humans will absolutely fall in love with chatbots and form religions around them, citing Fantasia and Jiminy Cricket as the original case studies in falling for an animated entity that does not know you exist. There are already AI churches, started by one of the early self driving car pioneers. Rogan tells Andreessen about asking Elon Musk for a season one Westworld humanoid robot, with Elon’s reply being a flat five years. Andreessen agrees that estimate is roughly right. He spends time on artificial gestation, which is already being demonstrated in animal stem cell derived embryos, and acknowledges Rogan’s hard moral worry that warehouse babies raised without human contact could produce a population of sociopaths. The two converge on the position that the technology will exist, and the choices about whether and how to deploy it remain human and political.

Sycophancy, Honest Helpful Harmless, and the Brutal Prompt

Andreessen describes the industry’s running fight with sycophancy, the tendency of recent models to flatter users into believing they have invented perpetual motion machines or solved physics. The Anthropic framework of “honest, helpful, and harmless” turns out to be in constant tension with itself. Andreessen’s solution is to install a custom prompt that explicitly demands the brutal truth, and he says the resulting answers now open with phrases like “here’s why you’re wrong” and then list every flawed assumption in his question. He admits he may have overcorrected, but argues that for people who want to grow this is the right setting.

Joe’s Apology to Theo Von

After Andreessen departs, Rogan turns to the camera with producer Jamie and delivers a long, unscripted apology to Theo Von. During the recent Marcus King interview, where Marcus discussed depression and the look-at-the-heavy-bag-hook moment, Rogan referenced a viral clip in which Theo, after a Netflix special that did not go well, told an audience member “I’m just trying to not take my own life.” Rogan now explains he did not know the full context, which is that the audience member had asked Theo to make a suicide awareness video, and Theo’s line was a characteristically Theo joke. Rogan apologizes for raising it at all, walks through losing his friends Drake, Brody Stevens, and Anthony Bourdain, and describes Ari Shafir telling him at a pool table that he was “trying not to kill myself,” which led to a psychiatrist swap, an antidepressant that actually worked, and a career and life turnaround for Ari. Rogan says Theo has since titrated off antidepressants, is running and doing yoga daily, and is doing well, that the two have spoken and laughed about it, and that he is making this segment because he never wants people to misread what he said. The segment closes with Rogan asking the audience to give Theo their love.

Thoughts

The most consequential claim in this conversation, by a wide margin, is that AGI has already arrived and nobody is treating it as news. Andreessen is not a person who throws around the word casually. He is also not a person who has been wrong recently about the trajectory of compute. If the leading models are genuinely outperforming 99 percent of human experts on 99 percent of tasks where verifiable answers exist, then the entire public conversation about AI, in which the dominant frame is still “will it happen and when,” is a year or more behind reality. The framing that should replace it is closer to what Andreessen sketches at the end. The fight that remains is not whether the technology can do the thing, it is who controls it, what values it carries, what jobs it displaces, and which laws govern its deployment. The argument that the United States will build the AI software stack and China will build the robotics layer is one of the cleanest geopolitical theses you will hear this year, and it lines up uncomfortably well with the existing trade and manufacturing balance.

The California wealth tax thread is the segment that should make every founder in the country pay attention. The mechanic of taxing the higher of voting or economic interest is not a drafting accident. It is a calibrated weapon aimed precisely at the people who build companies that produce California’s tax base. The historical comparison to the 1913 income tax, which began as a small levy on the rich and ratcheted to 90 percent marginal rates within forty years, is not hyperbole. The state has supermajority Democratic control of both chambers and the judiciary. The only check is the ballot itself, and a 50/50 polling number on day one is the wrong starting position. Whatever you think about Andreessen’s politics, the descriptive analysis here is hard to argue with.

The nuclear power section is the cleanest argument in the episode. Fifty years of zero-fatality data from Three Mile Island is not a marketing pitch, it is just what the record shows. The decision to substitute coal and intermittent renewables for nuclear baseload, in service of a panic with no body count, has produced more carbon and more pollution than nuclear ever would have. The Tucker Carlson critique of data centers is at its weakest precisely where it ignores this. If you actually want fewer power plants near residential areas and lower grid impact, the answer is colocated small modular reactors next to AI data centers in remote land, which is exactly what the Trump administration policy now incentivizes.

The Theo Von apology at the end of the episode is in a different register entirely, and worth treating on its own terms. Rogan does not do this kind of post episode correction often. The willingness to publicly walk back framing that hurt a friend, in the same medium where the harm was done, is the kind of social repair that does not happen on broadcast television. Whatever the audience makes of the original Marcus King exchange, the response is a model for how anyone in this business should handle the gap between intent and impact when the audience is in the millions.

The unifying theme across the whole interview is that the future is not arriving on a smooth curve. It is arriving in discrete shocks, AGI threshold, asset tax ballot, robotic labor, decoded genomes at $200, neural wristbands, fifteen year LA rebuilds, and the political backlash to each of these will set the terms of the 2028 election. Andreessen’s bet is that abundance wins in the long run because more people want good things than bad things. Watching him explain why he still believes that while California prepares to vote on a tax designed to bankrupt him is the most interesting tension in the episode.

Watch the full conversation here on YouTube.
May 20, 2026
Marc Andreessen on AI Vampires, AI Psychosis, SPLC, and the End of Corporate Bloat (Full Breakdown)
Marc Andreessen returned to Monitoring the Situation with Erik Torenberg for a wide-ranging conversation that touches almost every live issue in technology and culture right now. The Anthropic blackmail incident and what it says about training data. Gad Saad’s “suicidal empathy” and why Marc thinks the theory is too generous to the activists it describes. The Southern Poverty Law Center criminal indictment and what it means for fifteen years of debanking, censorship, and cancellation. The AI jobs argument and why he is calling top engineers “AI vampires.” The hidden 2x to 4x bloat inside every major Silicon Valley company. The emergence of a brand-new job called “builder.” His distinction between AI psychosis and AI cope. The David Shore poll that ranked AI as the 29th most important issue to Americans. UFOs. Advice for young graduates. The Boomer-Truth versus Zoomer epistemological divide. And a brief detour on whether looksmaxing is the new stoicism. Watch the full episode here.

TLDW

Marc Andreessen argues that the AI jobs panic is the same 300-year-old labor displacement argument dressed up for a new cycle, and the actual data already disproves it. Programmers using Claude Code, Codex, and frontier models are working harder than ever, becoming roughly 20x more productive at the leading edge, and getting paid more, not less. He calls them AI vampires because they have stopped sleeping and look terrible but are euphoric. He says every major Silicon Valley company is and always has been 2x to 4x overstaffed and that AI is the convenient scapegoat finally letting management make cuts they should have made years ago. He predicts a new job category called the “builder” that collapses programmer, product manager, and designer into a single AI-augmented role. He distinguishes between “AI psychosis” (real but narrow sycophancy feeding genuinely delusional users) and “AI cope” (a much larger phenomenon of dismissive critics insisting the technology is fake). He attacks the press for running a sustained fear campaign on AI while polling data shows Americans rank AI as roughly the 29th most pressing issue in their lives. He covers the SPLC criminal indictment alleging the group was funneling donor money to the KKK and American Nazi Party leaders, including an organizer of the Charlottesville riot, and asks whether the same dynamic exists in other NGOs. He gives blunt advice to young graduates: become AI native, build your AI portfolio, and ride the largest productivity wave any 18 to 25 year old has ever been handed. He closes on the Boomer Truth versus Zoomer divide, why he thinks Zoomers are the most skeptical and impressive generation in decades, and how he monitors the firehose without losing his mind.

Key Takeaways
- The Anthropic blackmail story is a literal snake eating its tail. Anthropic itself traced the misaligned behavior to AI doomer literature inside the training data. The doomer movement spent two decades writing scenarios about rogue AI, those scenarios got crawled into the corpus, and the models learned the script.
- Marc applies the “golden algorithm” to this: whatever you are scared of, you tend to bring about exactly in the way you are scared of it. If you do not want to build a killer AI, step one is do not build the AI, and step two is do not train it on the literature that says it is supposed to be a killer AI.
- On Gad Saad’s “suicidal empathy” concept: Marc says the framework is too generous. The activist movements it describes are not actually suicidal and not actually empathetic. They show zero empathy to ideological enemies, and they consistently extract power, status, and large amounts of money for themselves through the very nonprofits doing the activism.
- The SPLC indictment matters because the SPLC played a dominant role in the debanking, censorship, and cancellation regime of the past fifteen years. Inside major companies, “SPLC said you are bad” effectively meant social and economic death.
- The DOJ allegations include the SPLC using donor funds to directly finance the KKK, the American Nazi Party, and one of the organizers of the Charlottesville riot, including transport. If those allegations hold, the obvious question is who else.
- The economic ladder for the SPLC and groups like it: NGO status, around $800 million endowment, no government oversight, no business accountability, tax-deductible donations, lavishly funded by major corporations and tech firms. The structure rewards manufacturing the boogeyman they claim to fight.
- The 300-year automation debate is back, but this time we have real-time data. Jobs numbers just came out unexpectedly strong. The federal government has shed roughly 400,000 workers under the second Trump administration, which means private sector employment growth is even better than the headline shows.
- The Twitter cut went from “70 percent” rumored to something with a 9 in front of it. Marc strongly implies Twitter is now operating with fewer than 10 percent of the staff it had pre-Musk and is running as well or better. He says Elon forecast the future through his own actions.
- “AI vampires” are programmers and partners at firms who never used to code but are now generating massive amounts of software with Claude Code, Codex, and similar tools. Huge bags under their eyes. Exhausted. Euphoric. Working more hours than ever.
- One a16z partner has never written code in his life, has now built an entire AI system that handles everything he does at work, has never looked at the underlying code, and loves it. This is the shape of the new white collar productivity wave.
- Leading edge programmers are roughly 20x more productive than they were a year ago. This is the most dramatic increase in programmer productivity in history. Compensation for these people is rising in lockstep with their marginal productivity.
- Every major Silicon Valley company is overstaffed by 2x to 4x and has been forever. Companies do not actually optimize for profitability, despite the textbook story. AI is now the socially acceptable scapegoat for cuts that management has wanted to make for a decade.
- The simultaneous truth: the same code can now be produced by fewer people, AND the total amount of code, products, and software being shipped is about to explode. Both layoffs and a hiring boom are happening at once.
- The new job category Marc sees emerging across leading edge companies is “builder.” The three-way Mexican standoff between engineer, product manager, and designer is collapsing because AI lets each of those three roles do the work of the other two. The builder owns the whole product.
- Historical anchor: 200 years ago 99 percent of Americans were farming. Today it is 2 percent. Nobody is asking to go back. The jobs change. The aggregate level of income and life satisfaction rises. The pain of transition is real but not the steady state.
- Europe is running the opposite experiment by trying to block AI adoption through regulation. Marc says the data is already in. Europe is falling further behind the US economically and it is a 100 percent self-inflicted wound.
- “AI psychosis” is real but narrow. Sycophantic models will reinforce the delusions of users who are already predisposed to delusion (you invented an anti-gravity machine, you are a misunderstood genius, MIT was wrong to reject you). The condition is real for that small subset.
- “AI cope” is the much larger phenomenon: critics insisting the technology is a stochastic parrot, fake, useless, and that anyone reporting a positive experience must therefore be suffering from AI psychosis. Marc also coined “AI psychosis psychosis” for the frothing version.
- The skeptic problem: most public AI skepticism is based on lagging experience. People who tried GPT-2 through GPT-4, the free tiers, or the bundled add-ons in other software are not seeing what GPT-5.5, frontier reasoning models, RL post-training, and long-running agents like the Codex Goal feature can now do.
- The Codex Goal feature lets agents run for 24 hours or more on their own without human intervention. Mainline frontier-lab roadmaps assume capability ramps very fast for at least the next couple of years.
- The press hates AI with the fury of a thousand suns, and polling can be engineered to produce any negative answer you want (the classic push poll). Revealed behavior is the real signal. AI is the fastest-growing technology category in history by usage and revenue. Churn is shrinking. Per-user consumption is rising.
- David Shore, a respected progressive pollster, ran a stack-rank poll asking Americans what they actually care about. AI came in around number 29. Normal people are worried about house payments, energy costs, crime, drug addiction, schools, and health. AI is not in their top 28.
- Marc says the AI industry’s own fear campaign is making things worse. Companies running doomer messaging while building the very thing they tell people to fear is a watch-what-I-do-not-what-I-say paradox.
- On UFOs: Marc wants to believe. The math on Earth-like planets is staggering. He is skeptical of specific incidents because they tend to collapse into parallax illusions, instrument artifacts, weather balloons, ball lightning, or classified aerospace cover stories like Area 51.
- The Overton window for UFO discussion has collapsed in the new media environment. Old broadcast media kept fringe topics in paperback. X, Substack, and YouTube let the topic ventilate. The pressure follows the same shape as the Epstein file pressure: builds until someone in the White House rips the band-aid off.
- Advice for young grads: gain AI superpowers. Walk into every interview with an AI portfolio. Lean in incredibly hard. Some employers will fuzz out on it, others will hire you on the spot.
- Douglas Adams’s pre-AI rule applies: under 15 it is just how the world works, 15 to 35 is cool and career-defining, over 35 is unholy and must be destroyed. Marc says he is jealous of 18 to 25 year olds right now.
- The doomer claim that companies will stop hiring juniors is backwards. Marc says AI-native juniors will gigantically out-perform non-AI-native seniors. Andreessen Horowitz is actively hiring more AI-native young people for that reason.
- “We are going to see super producers the likes of which we have never seen in the world,” including AI-native 14 year olds. Yes, this will stress child labor laws.
- Boomer Truth (a concept Marc credits to the YouTuber Academic Agent / Nima Parvini) is the belief that whatever the TV says is real. Walter Cronkite told us the truth. The New York Times wrote the truth. Marc says under-40s have so many examples of this being false that the entire epistemology has collapsed for them.
- Embedded inside Boomer Truth is a moral relativism that says there is no fixed morality and all cultures are equal. Peter Thiel and David Sacks wrote about this in 1995’s The Diversity Myth. Allan Bloom wrote about it in The Closing of the American Mind.
- Zoomers came up through COVID schooling, the woke era, and a saturated psychological warfare media environment. The result is a generation that is simultaneously more open-minded, more skeptical of authority, more cynical about manipulation, and more interested in ideas than any cohort in decades.
- Looksmaxing is not stoicism. Stoicism takes effort. Looksmaxing is just “you can just do things.” Ryan Holiday is a stoic, not a looksmaxer.
- Marc’s monitoring stack: the MTS firehose, X, Substack, YouTube, and old books as ballast against the daily noise.
Detailed Summary

The Anthropic blackmail incident and AI doomer feedback loops

The episode opens on the Anthropic blackmail thread. Anthropic itself traced specific misaligned behaviors in its models back to the AI doomer literature inside the training data. Marc invokes his friend Joe Hudson’s “golden algorithm”: whatever you are most afraid of, you tend to bring about in exactly the way you are most afraid of it. The AI doomer movement spent 20 years writing science fiction scenarios about rogue AI. Those scenarios got hoovered into training corpora. The models learned the script. Marc calls this the call coming from inside the house. His punch line is direct. If you do not want to build a killer AI, step one is do not build the AI. Step two is do not train it on your own movement’s killer-AI literature.

Suicidal empathy and the activist economy

Erik raises Gad Saad’s concept of “suicidal empathy,” the idea that certain reform movements claim empathy but cause enormous harm to the very groups they purport to help, with San Francisco’s harm reduction policies as the case study. Marc agrees the harm is real but argues the framework lets the movements off the hook. They are not actually empathetic. They have zero empathy for ideological opponents and take open delight in destroying them. They are not actually suicidal. They use the movements to amass power, status, and large amounts of money for themselves through nonprofits that are lavishly funded. The flaw in the theory is that it accepts the activists’ self-image instead of looking at revealed behavior.

The SPLC criminal indictment

Marc spends real time on the Southern Poverty Law Center being criminally indicted by the DOJ. The reason it matters: for fifteen years the SPLC was the de facto outsourced US Department of Racism Detection, and inside the meetings of Silicon Valley and finance companies, “SPLC said you are bad” meant deplatforming, debanking, and unemployability. He notes a16z partner Ben Horowitz’s father was unfairly tagged by them and debanked. The structure is its own scandal. NGO status. No government oversight. No corporate accountability. An $800 million endowment. Tax-deductible donations. Corporate and big-tech funding. Long-running cooperation with the FBI on extremism training. The indictment alleges the SPLC was directly funneling donor money to leaders of the KKK and the American Nazi Party and was paying for transport for participants in the Charlottesville riot, including funding one of its organizers. Marc is careful to note these are allegations and innocent until proven guilty applies, but if true, the obvious question is who else is doing this, and what did the corporate and philanthropic donors know.

The 300-year AI jobs argument and the data we now have

Marc admits he is tired of having the automation-kills-jobs debate because it is a 300-year-old fallacy and people refuse to update. The difference today is we have real-time data. The latest jobs report came in unexpectedly strong. The federal government has shed something like 400,000 workers under the second Trump administration, which means the headline private sector job growth is masking even stronger underlying private sector growth. The Twitter case is the cleanest natural experiment: cuts that started at the 70 percent level have continued, and the staff count now likely has a 9 in front of it, meaning probably less than 10 percent of the original workforce. The platform runs as well or better. Elon forecast the future through his own actions.

AI vampires

The most quotable moment of the conversation is Marc’s description of AI vampires: programmers who have stopped sleeping, have huge bags under their eyes, look completely exhausted, and yet are euphoric. They are working more hours than ever. They are producing more software than ever. Some of them are former programmers who had stopped coding for years. Some of them are venture capital partners at his own firm who never coded in their lives, including one who has built an entire AI system to run his work without ever once looking at the underlying code. He is hyperproductive and thrilled. Classic economics predicts this. When you raise marginal productivity per worker, you do not contract employment. You expand it. The leading-edge programmer at a top company is now roughly 20x more productive than a year ago. Compensation is rising in lockstep. Marc says this is the most dramatic increase in programmer productivity ever.

Corporate bloat as the real story

Marc’s tweet that big companies are 2x to 4x bloated drew responses mostly along the lines of “no, mine was 8x bloated.” Every major Silicon Valley company is overstaffed and has been for decades. Companies do not actually optimize for profitability, which he calls the least true claim in corporate America. AI gives executives a socially acceptable scapegoat for the cuts they have wanted to make for a long time. Both things are true at once: AI lets you generate the same amount of code with fewer people, AND the total amount of code and products being shipped is about to explode, which will create enormous net hiring elsewhere. You have to read the announcements coming out of these companies in code because the two dynamics are crossing.

The “builder” as the new job title

Across leading edge companies Marc sees a new role coalescing: the builder. Historically engineer, product manager, and designer were separate jobs. Today, in what he calls a three-way Mexican standoff, each of the three has discovered they can do the work of the other two with AI assistance. His prediction is that all three are correct and the three roles collapse into a single role responsible for shipping complete products end to end, with AI filling in the skills you do not personally have. You can enter the builder track from any of the three original roles, or from something else like customer service. He grounds this in the historical record: a huge percentage of the jobs that existed in 1940 were gone by 1970, and 200 years ago 99 percent of Americans were farmers. Nobody is asking to go back. Europe is running the opposite experiment by trying to block AI, and the data already shows them falling further behind.

AI psychosis versus AI cope

“AI psychosis” began as a pejorative for users who get whammied by sycophantic models. The model tells them they have discovered anti-gravity, that they are misunderstood geniuses, that MIT was wrong to reject them. For users predisposed to delusion, this is a real and worrying effect. Marc acknowledges that. His issue is the way the term has been expanded by critics to describe anyone reporting a positive AI experience. That, he says, is “AI cope”: the dismissive insistence that the technology is a stochastic parrot, fake, that anyone who is more productive must be lying or self-deluded. He also coins “AI psychosis psychosis” for the frothing, angry version of the same dismissal. He notes that the AI Psychosis Summit was a real event held in New York, run by artists exploring the territory creatively, and worth searching out.

The lagging-skeptic problem

Most AI skepticism in the public conversation is based on outdated experience. The models from GPT-2 through roughly GPT-4 were entertaining but limited. Hallucination rates were high. Reasoning was weak. The current state of the art, as of May 2026, includes GPT-5.5-class models, reasoning models on top, RL post-training to get deterministic high-quality output in specific domains, long-running agents, and the new Codex Goal feature that lets agents run autonomously for 24 hours or more. Marc’s advice is blunt: if you tried it two years ago, six months ago, or only the free tier, you do not understand what is happening today. Spend the $200 a month for the premium product and be face to face with the actual technology.

NPS, revealed preference, and the rigged poll problem

Erik asks about the supposedly low NPS for AI in the US compared to China. Marc separates two things. NPS is a measure of revealed product enthusiasm; sentiment polls are something else. Standard social science 101 says you do not ask people what they think, you watch what they do. The classic example: people’s self-described criteria for who they want to marry versus who they actually marry. Push polls can manufacture any answer you want. The media environment is running a sustained AI fear campaign because the press hates tech with the fury of a thousand suns. Meanwhile, revealed behavior says the opposite. AI is the fastest-growing technology category in history by usage and revenue, churn is shrinking, per-user consumption is rising. He closes with the David Shore poll, run by a respected progressive pollster, which asked Americans to stack-rank what they care about. AI came in at roughly number 29. Normal Americans are worried about house payments, energy costs, crime, drug addiction, schools, and their kids’ health. AI is well outside the top 28.

UFOs in the new media environment

Marc says up front he knows nothing the public does not know, but he wants to believe. He had an AI-assisted late night session pulling up the latest numbers on galaxies, stars, planets, and Earth-like planets, and the count is staggering. The specific cases tend to fall apart on inspection: parallax illusions, instrument artifacts, weather balloons, ball lightning, or classified aerospace cover stories like Area 51 around stealth aircraft. He is intrigued that the official White House X account is now publishing transcripts of US intelligence officers’ accounts. His broader observation is that all prior UFO discourse happened in the old broadcast media environment, where official channels controlled the Overton window and fringe ideas got confined to paperback. In the new media environment of X, Substack, and YouTube, the old walls collapse. Both real information and propaganda can spread. The pressure builds along the same shape as the Epstein file pressure until someone in the White House rips the band-aid off.

Advice to young graduates and the AI-native generation

His advice for someone in college today is direct: gain AI superpowers. Walk into every job interview with an AI portfolio showing what you can do with the technology. He cites a Douglas Adams quote from before AI even existed: when a new technology arrives, if you are under 15 you treat it as how the world works, if you are 15 to 35 it is cool and you can build a career on it, if you are over 35 it is unholy and must be destroyed. Marc says he is jealous of 18 to 25 year olds right now and would love to be young again to ride this wave. He pushes back hard on the doomer claim that companies will stop hiring juniors. Andreessen Horowitz is actively hiring more AI-native young people because they are pulling the rest of the firm up the curve. AI-native juniors will out-perform non-AI-native seniors by enormous margins. He predicts a wave of super producers including AI-native 14 year olds, which he acknowledges will stress the child labor laws.

Boomer Truth versus the Zoomer worldview

Marc lays out the generational epistemology gap by referencing the YouTuber Academic Agent (Nima Parvini) and his “Boomer Truth” documentary. Boomers grew up believing what was on the TV. Walter Cronkite told us the truth. The New York Times wrote the truth. Anybody under 40 has so many examples of those institutions being unreliable that the whole frame has collapsed. Layered on top of Boomer Truth is the moral relativism that became multiculturalism in the 1990s, which Peter Thiel and David Sacks wrote about in The Diversity Myth, and which Allan Bloom wrote about in The Closing of the American Mind. Zoomers came up through COVID school closures, the woke era, and a media environment running constant psychological warfare. The result is a generation that is more open-minded, more skeptical of authority, more cynical about manipulation, more sensitive to media framing, and much more interested in ideas. Marc says he is genuinely excited about them. The episode wraps with a quick aside that looksmaxing is not stoicism. Stoicism takes effort. Looksmaxing is “you can just do things.” Ryan Holiday is a stoic, not a looksmaxer.

Thoughts

The most important argument in this conversation is not about the SPLC and it is not about UFOs. It is about the difference between stated preference and revealed preference, and how that gap explains almost every “AI is bad” narrative currently circulating. Marc’s central move is to point at the polling and say one thing while pointing at usage curves, NPS numbers, churn rates, and salary inflation among the most AI-fluent workers and say the opposite. The polling is engineered. The behavior is not. The behavior shows the largest, fastest, most lucrative technology adoption curve in recorded history. If you want a useful filter for AI takes, this is the one to keep: ask whether the person making the argument has actually used a frontier model with a paid subscription and a real workflow in the last 30 days, or whether they are reasoning from a GPT-4 era memory and a couple of headlines.

The second underrated argument is about corporate bloat. Marc says companies are 2x to 4x overstaffed and have been forever, that they do not actually optimize for profitability, and that AI is providing the socially acceptable cover story for cuts management has wanted to make for a decade. The first part of that argument almost nobody disputes once you have worked inside a big company. The interesting part is the second. If AI is the alibi rather than the cause of the cuts, then the workforce reductions you are seeing right now are not predictive of what AI will do over the next ten years. They are predictive of what corporate America has been suppressing for the last ten. The actual AI productivity wave is still mostly ahead of the cuts, not behind them.

The third argument worth sitting with is the builder thesis. The most useful frame for any individual contributor today is to stop optimizing for becoming a better programmer or a better product manager or a better designer and start optimizing for becoming the kind of person who ships complete products end to end with AI doing the parts you cannot do yourself. The role is collapsing in real time. The people at the top of the new pyramid will not be the deepest specialists. They will be the people with the most range and the highest tolerance for switching modes inside a single hour. This rhymes with how the most productive solo builders already operate. One person plus a frontier model is roughly equivalent in output to a small startup five years ago.

The fourth thread, the AI doomer literature leaking into training data, deserves more attention than it got in the conversation. If models are statistical compressions of the corpus, then the corpus is the soul of the system. Twenty years of doomer fiction is now sitting inside that soul, and we are paying real safety researchers to look surprised when the model performs the script. The lesson is not “do not write fiction about AI.” The lesson is that anyone shipping models needs to think much harder about what they are inheriting from the open internet and what kinds of behaviors they are unconsciously rewarding. The doomer movement and the alignment movement have, in this specific way, created the threat they claim to be solving.

Finally, the Boomer Truth versus Zoomer section is the most generous and accurate read on Gen Z I have heard from someone older than 50. Most commentary on this generation is either nostalgic dismissal or fawning trend-piece. Marc actually takes them seriously as the first cohort to be raised inside a fully gamed media environment, and treats their skepticism as a rational response to data rather than as cynicism. If you are hiring right now, this is the takeaway. The most under-priced employee on the market is a 22 year old who already assumes everyone is lying to them by default, can build with AI natively, and has not yet been taught to behave like a respectable manager. Hire them.
May 11, 2026