PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: frontier AI model

  • SubQ 1.1 Small Explained: How Subquadratic Sparse Attention Hits 98% Retrieval at 12 Million Tokens With 64.5x Less Compute Than Dense Attention

    Subquadratic, a frontier AI research and infrastructure company, has released the model card and technical report for SubQ 1.1 Small, a long-context language model built on a new attention mechanism the company calls Subquadratic Sparse Attention (SSA). The headline claim is unusual in two directions at once: the model retains 98% single-fact retrieval accuracy at 12 million tokens, roughly twelve times the length it was primarily trained on, while cutting attention compute by 64.5x against dense attention at a 1 million token context. The deeper argument in the report is not really about a single model at all. It is about what happens to the entire retrieval-and-orchestration stack once reasoning over a complete artifact stops being prohibitively expensive.

    TLDR

    SubQ 1.1 Small is a small long-context model that replaces the dense attention of an existing open-weight frontier model with Subquadratic Sparse Attention, a learned, content-dependent sparse attention mechanism that scales linearly in compute and memory rather than quadratically. On retrieval it posts 99.12% on NVIDIA’s 13-task RULER suite at 128K tokens and 100% needle-in-a-haystack accuracy at 1M and 2M tokens, holding at 98% out to 6M and 12M tokens while attending to only 0.13% of token pairs. It keeps competitive general ability, scoring 85.4% on GPQA Diamond and 89.7% pass@4 on LiveCodeBench v6, and reaches 13% on the long-horizon AutomationBench Finance agentic benchmark, close to Opus 4.8 and GPT-5.5 and well ahead of mid and small tiers. The efficiency story is a scaling win rather than a constant-factor one: 64.5x fewer attention FLOPs than dense attention at 1M tokens and 56x faster than FlashAttention-2 on a single attention layer. The report frames cheap long-context compute as a research accelerator that let the team run more than one hundred million-token experiments and find a training recipe (long-context continued pretraining is the strongest lever) rather than guess at one, positions SSA against FlashAttention, DeepSeek’s Lightning Indexer line, state space models like Mamba, and hybrids, invokes Sutton’s Bitter Lesson to argue that RAG, chunking, and agentic scaffolding are partly workarounds for context scarcity, and was independently verified by Appen. Deployment is starting with design partners now, with a 2M to 12M token lineup planned by year end.

    Thoughts

    The most interesting move in this report is the framing, not the benchmark. Subquadratic plants its flag on Richard Sutton’s Bitter Lesson and argues that much of the modern AI stack, the retrieval pipelines, the chunkers, the re-rankers, the agentic orchestration, is scaffolding built around a single computational constraint: dense attention costs grow with the square of context length. If that constraint relaxes, a lot of hand-engineered machinery that exists to feed a model the right fragments at the right moment starts to look like the task-specific pipelines that learned representations eventually displaced. That is a genuinely provocative thesis, and it is the right lens for reading the rest of the document. The company is not selling a longer context window as a feature. It is betting that whole-artifact reasoning is a different shape of capability than retrieval over fragments, and that fragmentation destroys the cross-references a contract or a codebase actually depends on before the model ever sees them.

    The part of the paper most teams will undervalue is the claim that the real payoff of efficient attention is not cheaper inference but cheaper experimentation. A dense long-context training campaign is expensive enough that most groups get a handful of attempts and are forced to guess at the recipe. Subquadratic says SSA let them run more than a hundred experiments across six model generations with per-step iteration under a minute at million-token context, which is how they discovered that long-context continued pretraining, not clever post-training, was the dominant lever. If that holds, algorithmic efficiency becomes a first-class scaling variable alongside parameters and data, because capability becomes responsive to iteration velocity rather than raw compute alone. It reframes efficiency from a deployment line item into a research multiplier, and that is a more durable advantage than any single benchmark number.

    The generalization result deserves scrutiny precisely because it is so clean. A model trained overwhelmingly at 1M tokens, with a sliver at 2M and nothing beyond, holds 98% retrieval at 12M. The proposed explanation is that SSA routes attention by content relevance rather than fixed positional pattern, so there may simply be no obvious length boundary once the routing behavior is learned. That is plausible and the report is careful to say the 12M result emerged rather than being designed for. But single-needle NIAH is a deliberately clean probe with one target and a binary answer. The far harder RULER suite is only reported at 128K, the longest standardized length in the original benchmark, so the multi-hop, aggregation, and distractor-heavy capability that whole-artifact reasoning actually requires has public numbers at 128K, not at 12M. The honest read is that precise retrieval generalizes spectacularly and composite reasoning at extreme length is still an open question the report does not over-claim on.

    What lends the report credibility is how much counter-evidence it volunteers. It walks through MiniMax abandoning its hybrid M1 architecture and returning to full attention for M2 after efficient variants showed multi-hop reasoning deficits at scale. It admits that earlier SubQ checkpoints improved retrieval while regressing on knowledge benchmarks, forcing dedicated capability-balancing work. It describes catching a case where the MRCR benchmark moved up while the model felt worse in real workflow spot-checks, and switching its development signal to RULER as a result. That last point is a quietly important methodological argument: benchmark score and deployment behavior diverged enough to change checkpoint selection, which is a warning every team shipping long-context models should internalize. A vendor confident enough to show where its own metrics misled it is more trustworthy than one that only shows the wins.

    A few caveats keep the enthusiasm grounded. AutomationBench Finance at 13% is genuinely strong relative to peers, but it is a low absolute score across the board, including for GPT-5.5 at 18% and Opus 4.8 at 16%, so this is early evidence of agentic transfer rather than proof of a finished agent. The efficiency comparisons isolate a single attention layer rather than full end-to-end model throughput, which is the right way to expose the scaling shape but not the same as a wall-clock serving benchmark. The model is built from an unnamed donor open-weight frontier model, so some of its general-knowledge and coding strength is inherited rather than created here. And the most aggressive claims about the future, a 2M to 12M lineup and much higher sparsity, are roadmap, not released artifacts. None of that undercuts the core result. It just means the right posture is to treat SubQ 1.1 Small as a strong proof of concept for an architecture that, if it scales as advertised, could quietly remove a layer of the AI stack that everyone currently takes for granted.

    Key Takeaways

    • SubQ 1.1 Small is a long-context language model from Subquadratic AI, built on a new attention mechanism called Subquadratic Sparse Attention (SSA), released June 16, 2026 alongside a model card and technical report.
    • SSA is a learned, content-dependent sparse attention mechanism that scales linearly in both compute and memory with sequence length, rather than quadratically like dense attention.
    • The central result is context-length generalization: the model was trained primarily at 1M tokens, with some training at 2M and none beyond, yet retrieval held far past the training window.
    • Needle-in-a-haystack accuracy is 100% at 1M and 2M tokens and 98% at both 6M and 12M tokens, roughly twelve times the primary training length.
    • At 12M tokens the model attends to only 0.13% of token pairs, close to a 1,000x reduction in attention relationships, while still retrieving accurately.
    • On NVIDIA’s 13-task RULER benchmark at 128K tokens, SubQ 1.1 Small scores 99.12%, with the remaining errors concentrated in aggregation-style tasks rather than retrieval.
    • RULER tests beyond single-fact lookup: single-key and multi-key retrieval, common-word and frequent-word extraction, and multi-hop variable tracing across positions.
    • At 1M tokens, SSA requires 64.5x fewer attention FLOPs than dense attention (3.9 PFLOP versus 252 PFLOP per attention layer).
    • On a single attention layer, SSA runs 56x faster than FlashAttention-2 at 1M tokens (966 ms versus 54,164 ms on an H100), reaching parity near 16K tokens and pulling away as context grows.
    • The efficiency gain is a scaling-law win, not a constant-factor speedup: the advantage over dense attention grows as context length increases.
    • On general knowledge, SubQ 1.1 Small scores 85.4% on GPQA Diamond (pass@1), below GPT-5.5 (93.2) and Opus 4.8 (92), near Sonnet 4.6 and GPT-5.4-mini (87.5), and above GPT-5.4-nano (81.7) and Haiku 4.5 (67.2).
    • On coding, it reaches 89.7% pass@4 on LiveCodeBench v6, close to the absolute frontier (GPT-5.5 92, Opus 4.8 92.2) and ahead of the smaller tiers.
    • On AutomationBench Finance, a long-horizon agentic benchmark, it scores 13%, close to Opus 4.8 (16%) and GPT-5.5 (18%) and ahead of Sonnet 4.6 (8%), Haiku 4.5 (3%), and GPT-5.4-mini (0%). Absolute scores are low across all models.
    • The model was not trained from scratch. The team converted an existing open-weight frontier model by replacing dense attention with SSA, then built long-context ability through staged context extension and continued pretraining.
    • Context was extended in stages (262K, 512K, 1M, 2M) using YaRN positional scaling, with long-context continued pretraining performed between extension stages on naturally long data: books, long documents, and repository-scale code.
    • Roughly one trillion tokens of continued pretraining were performed, most of it at the 1M-token stage.
    • Long-context continued pretraining was the most consistent predictor of long-context retrieval gains across the experiments, more so than post-training tweaks.
    • The team ran more than one hundred long-context experiments across six major model generations, which the report argues is only possible because SSA made million-token iteration cheap (under a minute per step).
    • Capability balance was a recurring challenge: gains in long-context retrieval often regressed short-context knowledge and reasoning unless training was explicitly managed for both.
    • Benchmark scores and real deployment behavior diverged. The MRCR benchmark moved up while qualitative workflow spot-checks got worse, so the team switched its primary development signal to RULER.
    • The report frames RAG, chunking, summarization, and agentic orchestration as scaffolding built around context scarcity, drawing an analogy to Sutton’s Bitter Lesson, where hand-engineered mechanisms get displaced by larger-scale learning.
    • SSA is positioned against FlashAttention (a memory optimization that does not change quadratic compute), fixed-pattern sparse attention, DeepSeek’s learned sparse line, state space models, and hybrid architectures.
    • DeepSeek’s Lightning Indexer (used in DSA and CSA) is the closest published comparison. Its quadratic scoring overtakes the sparse attention it feeds around 52,000 tokens, reaching roughly 16x the attention cost at 1M and 190x at 12M.
    • State space models like Mamba achieve linear cost through a compressed fixed-size state, but that compression is lossy and weakens exact retrieval, which is why production efficient models are usually hybrids with some dense attention layers retained.
    • MiniMax is cited as a cautionary case: it moved from a hybrid M1 to a full-attention M2 after hybrids showed multi-hop reasoning deficits at scale and less mature supporting infrastructure.
    • The benchmark results were independently verified by Appen, a third-party evaluation firm.
    • The named use cases are financial analysis and due diligence, legal and contract work, and software engineering (architecture-level reasoning, cross-file refactoring, dependency tracing, planning, review, and long-horizon memory).
    • Sparsity settings were deliberately conservative, tuned for maximum context length rather than maximum sparsity. Limited experiments at 4x the sparsity reported positive early results.
    • The training infrastructure used a memory-scaling ladder: single node, intra-node sequence parallelism, CPU offload, multi-node sequence parallelism, nested offloading, and Ring Attention for the longest contexts.
    • Beyond about 8M tokens, BF16 numerical underflow and stability became practical constraints on evaluation.
    • The technical report is authored by Saul Ramirez, Alex Whedon, Ashmal Vayani, and Phong Vo of Subquadratic AI.
    • Deployment is starting with a first cohort of design partners, with broader rollout through the quarter and a general model lineup ranging from 2M to 12M tokens by the end of the year.
    • The company’s framing line is “Efficiency is intelligence,” and its broader thesis is that the point is not bigger context windows for their own sake but reasoning directly over complete artifacts with less surrounding scaffolding.

    Detailed Summary

    The problem: whole-artifact reasoning and context scarcity

    The report opens by naming a class of tasks it calls whole-artifact reasoning: problems whose structure requires reasoning across a complete artifact rather than over isolated fragments. A legal agreement may define a term on page 2, qualify it on page 12, carve out an exception on page 46, and amend it in a schedule. A function may be defined in one file, called from forty others, and constrained by invariants encoded in the architecture rather than in comments. A financial review may require connecting filings, earnings reports, contracts, and internal records. In each case the difficulty is not locating a passage, it is reasoning over relationships distributed throughout a large artifact. Most production systems do not do this directly. They rely on retrieval pipelines, chunking, summaries, and agentic workflows that partition information and reconstruct fragments at inference time, because dense attention scales quadratically with context length and makes direct reasoning over large artifacts expensive. Subquadratic argues that much of the modern AI stack is therefore designed to manage context scarcity rather than reason over complete artifacts, and it connects this to Sutton’s Bitter Lesson: sophisticated hand-engineered mechanisms historically get displaced once larger-scale learning becomes practical.

    What SSA is and the three requirements it targets

    Subquadratic Sparse Attention is a content-dependent sparse attention mechanism designed to satisfy three requirements at once, a combination the report argues prior approaches never achieved in a practical long-context system. First, dense-attention-level retrieval and reasoning quality, which requires routing that is content-dependent (determined by the tokens themselves) rather than driven by a fixed positional pattern. Second, subquadratic scaling, where selection, retrieval, and attention are each linear in sequence length so the mechanism is linear end to end, not only within the attention read. Third, full-context training with standard autoregressive generation, so the model can optimize over the entire context during training while keeping efficient token-by-token decoding at inference. The internal mechanism by which SSA achieves this is held back as outside the scope of the report, which focuses instead on the requirements and the experimental program that followed.

    Where SSA sits among prior approaches

    The background section is effectively a taxonomy of long-context modeling. FlashAttention is treated not as a competitor but as the standard dense-attention baseline: it solved the memory problem by never materializing the full attention matrix, but it left the quadratic compute cost untouched, so doubling context still quadruples attention computation. Fixed-pattern sparse attention (sliding-window, strided, as in Longformer, BigBird, and the sliding window in Gemma) scales well but sacrifices content-dependent routing and tends to fail on retrieval benchmarks like RULER. Compression methods like Multi-head Latent Attention reduce KV-cache memory at inference but do not change the quadratic prefill cost. Learned sparse attention, exemplified by DeepSeek’s Native Sparse Attention and its Lightning Indexer, learns where to route but pays a quadratic cost in the indexer itself. State space models and linear attention (Mamba, Mamba-2 and Mamba-3, RetNet, RWKV, gated delta networks) achieve linear cost through a compressed fixed-size state, but that compression is lossy and weak on exact retrieval. Hybrids (Jamba, Kimi Linear, Qwen3 Next, Nemotron) keep a few dense layers to preserve retrieval, which means the quadratic component still dominates at long context. System-level workarounds (RAG, agentic frameworks, recursive language models) move retrieval outside the model entirely. The report’s stated open problem is to combine subquadratic scaling end to end with content-dependent retrieval, arbitrary-position access, and practical ultra-long-context training in one system, which it claims no widely deployed architecture provides and which SSA targets.

    Training: conversion, staged context extension, and continued pretraining

    Rather than training from scratch, the team converted an existing open-weight frontier model that supported a 262K-token context by replacing its dense attention with SSA. They then extended the context window in stages (262K to 512K to 1M to 2M) using YaRN to rescale positional representations, performing long-context continued pretraining between extension stages rather than jumping straight to the final length. The training mixture emphasized naturally long data such as books, long documents, and repository-scale code, packed to the target length with document separators and without masking cross-document attention boundaries. Most continued-pretraining tokens were trained at the 1M-token stage, with roughly one trillion tokens total. Post-training played a separate role: shaping how the long-context capability was expressed while preserving reasoning, coding, and instruction following. The team explored sample-level loss aggregation to keep a few extremely long examples from dominating gradient updates, and staged the post-training corpus across synthetic retrieval tasks, long-context reasoning, coding, educational material, and general instruction following, alternating capability-building phases with recovery phases.

    Results: retrieval, knowledge, coding, and agentic tasks

    On retrieval, SubQ 1.1 Small scores 99.12% on the 13-task RULER average at 128K, with errors concentrated in aggregation-style tasks like common-word and frequent-word extraction. On needle-in-a-haystack, evaluated on 50 held-out UUID samples per length, it scores 100% at 1M and 2M (within the training window) and 98% at 6M and 12M (held out), attending to only 0.13% of token pairs at 12M. On knowledge, GPQA Diamond pass@1 is 85.4%, landing between the small and mid frontier tiers and confirming that long-context optimization need not sacrifice reasoning, a result the report credits to its capability-balancing stages after earlier checkpoints showed retrieval gains coming at the cost of knowledge. On coding, LiveCodeBench v6 pass@4 is 89.7%, and the report notes coding data played a dual role, also improving non-code long-context retrieval because code is dense with the cross-position dependencies that train general routing. On long-horizon agentic work, AutomationBench Finance is 13%, where agents must discover the right endpoints among roughly 500 across 47 applications, make interdependent API calls, follow layered business rules, and ignore seeded distractors, graded on binary end-state correctness with no partial credit.

    Efficiency and the DeepSeek comparison

    Efficiency is measured on one attention layer against a dense baseline on the same backbone. Per-forward-pass attention FLOPs scale from a 2.1x reduction at 32K to 8x at 128K, 31.5x at 512K, and 64.5x at 1M tokens (3.9 PFLOP for SSA versus 252 PFLOP for dense). Measured against FlashAttention-2 in isolation, SSA reaches parity near 16K tokens and pulls away to 56x at 1M, where it runs in 966 ms versus 54,164 ms on an H100. The report devotes a discussion section to DeepSeek’s sparse attention line as the closest published comparison. DeepSeek’s Lightning Indexer is a learned selector, but it is a full-attention distilled transformer, so it scales quadratically: in a V3.2-style configuration the indexer is cheaper than the sparse attention it feeds only below about 52,000 tokens, then overtakes it, reaching roughly 16x the attention cost at 1M tokens and 190x at 12M. SSA targets that same selection role with a selector the report says is dramatically cheaper and linear throughout, and notes SSA could conceptually replace the selector over either uncompressed or compressed representations.

    Efficiency as a research accelerator and the evaluation lessons

    A recurring theme is that the most valuable effect of cheap long-context compute was on the research loop, not just inference. Where a dense campaign would allow a handful of attempts, SSA enabled more than a hundred experiments across six model generations with per-step iteration under a minute at million-token context. That throughput is what surfaced the finding that long-context continued pretraining is the strongest lever, and it leads the authors to argue that algorithmic efficiency should be treated as a first-class scaling variable alongside model and dataset size. The report is unusually candid about evaluation pitfalls. It describes how the MRCR benchmark diverged from deployment behavior, with MRCR-optimized checkpoints often feeling worse on repository-scale code reasoning, multi-document synthesis, and contract analysis, which pushed the team to rely on RULER and a fixed set of qualitative workflow spot-checks as development signals. It also cites MiniMax returning from a hybrid M1 to a full-attention M2 as evidence that reducing asymptotic cost is not sufficient on its own if retrieval quality, reasoning at scale, and system maturity are not preserved at the same time.

    Implications, availability, and what comes next

    The report’s deployment argument is that the most important enterprise implication of long-context models is not larger windows but the ability to reason directly over complete or more-complete artifacts, moving retrieval, re-ranking, and orchestration logic into the model where the task is naturally whole-artifact rather than naturally decomposable. It is careful not to declare retrieval obsolete: for corpora larger than any plausible context window, fast-changing knowledge, and genuinely multi-stage workflows, RAG and orchestration remain the right tools. The narrower claim is that the class of scaffolding that exists only to compensate for context limits gets smaller as efficient long-context models extend the reachable window. The benchmark results were independently verified by Appen. Subquadratic is deploying SubQ 1.1 Small with a first cohort of design partners now, with broader rollout through the quarter and a general lineup spanning 2M to 12M tokens planned by the end of the year, and it flags much higher sparsity as future work.

    Notable Quotes

    “Much of the modern AI stack is therefore designed to manage context scarcity rather than reason over complete artifacts directly.”

    SubQ-1.1-Small Technical Report, framing retrieval and orchestration as workarounds for an architectural limit

    “The hybrid has moved the line, but not changed its shape.”

    SubQ-1.1-Small Technical Report, on why hybrid models keep their quadratic component at long context

    “A routing mechanism intended to make long context affordable becomes the dominant long-context cost, reintroducing quadratic scaling after providing scalar compute savings.”

    SubQ-1.1-Small Technical Report, on DeepSeek’s Lightning Indexer overtaking the attention it feeds

    “If the cost of long-context experiments is too high, teams are forced to guess at the recipe. If the cost falls far enough, they can search for it.”

    SubQ-1.1-Small Technical Report, on efficient attention as a research accelerator

    “Fragmentation systematically destroys those relationships before the model ever sees them.”

    SubQ-1.1-Small Technical Report, on why chunking hurts whole-artifact reasoning

    “Holding the whole artifact in context changes the shape of the task rather than only the speed of it.”

    SubQ-1.1-Small Technical Report, on the difference between bigger windows and direct reasoning

    “The value of SSA is therefore not only that it makes long-context inference cheaper. It makes long-context experimentation cheaper.”

    SubQ-1.1-Small Technical Report, conclusion

    Read the full SubQ 1.1 Small technical report and model card here.

    Related Reading

    • Subquadratic (subq.ai) the company behind SubQ 1.1 Small and the Subquadratic Sparse Attention architecture, where you can join the waitlist.
    • The Bitter Lesson by Richard Sutton the short essay whose argument the report leans on, that hand-engineered mechanisms lose to general methods that scale with computation.
    • Attention Is All You Need the original Transformer paper that introduced the dense attention whose quadratic cost SSA is built to remove.
    • RULER (arXiv) NVIDIA’s long-context benchmark that the report uses as its primary retrieval signal, and that fixed-pattern sparse methods historically struggle with.
    • Retrieval-augmented generation (Wikipedia) background on the RAG approach that the report frames as scaffolding around context scarcity rather than a permanent fixture.
  • Claude Fable 5 and Claude Mythos 5: Anthropic Ships Its First Generally Available Mythos-Class AI Model With New Safeguards

    Anthropic has launched Claude Fable 5 and Claude Mythos 5, the first Mythos-class models offered beyond a tiny circle of cyber defenders. Fable 5 is the generally available version, wrapped in a new layer of safeguards, while Mythos 5 is the same underlying model with some of those guardrails lifted for a small group of vetted partners. The pair sits a full tier above the Opus class in raw capability, and the launch is as much a story about how Anthropic is choosing to gate that capability as it is about the benchmarks. Below is a full breakdown of what shipped, what the model can do, and why the safeguard design matters.

    TLDR

    Anthropic released Claude Fable 5, a Mythos-class model that is now its most capable generally available model, posting state-of-the-art results across software engineering, knowledge work, vision, memory, and scientific research. To ship it safely and fast, Fable 5 carries new safety classifiers that route flagged queries in cybersecurity, biology and chemistry, and distillation over to Claude Opus 4.8 instead of refusing, a fallback that triggers in under 5% of sessions. The same model ships without cyber safeguards as Claude Mythos 5 for Project Glasswing partners in collaboration with the US Government, where it is described as having the strongest cybersecurity capabilities of any model in the world. Highlights include a codebase-wide migration of a 50-million-line Ruby codebase that Stripe says took a day instead of two months, beating Pokemon FireRed with a vision-only harness, accelerating drug design roughly tenfold using Mythos 5, producing novel molecular biology hypotheses preferred by scientists about 80% of the time, and over a week of autonomous genomics research. Both models cost 10 dollars per million input tokens and 50 dollars per million output tokens, less than half the price of Mythos Preview, with a staged subscription rollout and a new 30-day data retention policy for Mythos-class traffic.

    Thoughts

    The most interesting decision here is not the capability jump, it is the naming split. Fable and Mythos are the same brain. The only difference is whether the safeguards are on. Anthropic is effectively shipping one model twice: a gated public edition and an ungated edition handed to a short list of trusted defenders working with the US Government. That is a clean way to resolve the central tension of frontier AI, which is that the exact capabilities that help a security professional close a vulnerability also help an attacker find one. Rather than dumbing the model down for everyone or holding it back entirely, they are letting the access list, not the weights, carry the risk. Expect this pattern to repeat as capabilities climb.

    The fallback-to-Opus design is the other quietly important choice. When a classifier flags a query in cybersecurity, biology, chemistry, or suspected distillation, the user does not hit a wall of refusal. The request is silently handed to Opus 4.8, a model that is still excellent at almost everything. Graceful degradation beats a hard no, both for user experience and for trust. It also reframes what a safeguard is. Instead of a binary block, it becomes a routing decision, and because more than 95% of sessions never trigger it, most users will never notice it exists. The honest admission that the classifiers are tuned conservatively and will sometimes catch harmless requests is the right posture, even if it will annoy power users who keep getting bounced to the smaller model.

    The commercial signals are worth reading closely. Pricing came down to less than half of Mythos Preview, which suggests confidence in serving costs at scale, but the subscription rollout tells a more cautious story. Fable 5 is free on Pro, Max, Team, and Enterprise plans only through June 22, after which using it requires usage credits until capacity catches up. That is a polite way of saying demand is expected to badly outrun supply. The model is fully available on the API and consumption-based Enterprise plans from day one, because those bill by the token and self-throttle. Subscriptions, which are all-you-can-eat, are where a capacity crunch actually hurts, so that is exactly where the brakes went on.

    On the science, the genomics result is the one that should make people sit up. A model doing over a week of largely autonomous research, assembling single-cell data across 138 species, then designing and training its own machine learning model that outperforms a recently published Science paper while being 100 times smaller, is a different category of claim than acing a benchmark. So is the drug-design work, where Mythos 5 reportedly matches or beats skilled human operators end to end, choosing binding sites, running protein design tools, and recovering from its own failures. If those hold up to publication and independent replication, the interesting frontier stops being chat quality and becomes whether a model can run a research program. That is also precisely why the biology and chemistry classifier exists, and why Anthropic is being so deliberate about who gets the ungated version.

    One caveat worth keeping in view: nearly all of the evidence in the announcement is Anthropic’s own, or comes from partners with early access and an incentive to be enthusiastic. The Stripe migration, the FrontierCode score, the Slay the Spire memory result, the protein targets, and the genomics model are all compelling, but they are first-party until outside labs and the eventual system card, peer review, and independent red-teamers weigh in. The note that the UK AISI made progress toward a universal jailbreak inside a brief testing window is a useful reminder that the safeguard story is a work in progress, not a finished proof.

    Key Takeaways

    • Claude Fable 5 is a Mythos-class model made safe for general use, and is now Anthropic’s most capable generally available model.
    • Mythos-class is a tier that sits above the Opus class in capability. The first was Claude Mythos Preview, released in April through Project Glasswing.
    • Fable 5 is state-of-the-art on nearly all tested benchmarks, and its lead grows as tasks get longer and more complex.
    • Claude Mythos 5 is the same underlying model as Fable 5, but with safeguards lifted in some areas. Fable and Mythos differ only by their safeguards.
    • Mythos 5 is described as having the strongest cybersecurity capabilities of any model in the world, and is deployed through Project Glasswing with the US Government.
    • New safety classifiers cover cybersecurity, biology and chemistry, and distillation. Flagged queries fall back to Claude Opus 4.8 rather than being refused.
    • Users are told whenever a fallback happens. More than 95% of Fable sessions involve no fallback at all, and for those sessions Fable performs effectively the same as Mythos 5.
    • The safeguards are tuned conservatively and trigger in less than 5% of sessions on average, sometimes catching harmless requests. Anthropic plans to reduce false positives after launch.
    • Stripe reported Fable 5 compressed months of engineering into days, performing a codebase-wide migration of a 50-million-line Ruby codebase in a day that would have taken a team over two months by hand.
    • Fable 5 scores highest among frontier models on Cognition’s FrontierCode evaluation for high-quality agentic coding, even at medium effort, and is more token-efficient than past Claude models.
    • On Hebbia’s Finance Benchmark for senior-level reasoning, Fable 5 has the highest score of any model, with gains in document reasoning, chart and table interpretation, and problem solving.
    • IMC noted Fable 5 aced their trading-analysis evaluations nearly across the board, including factual lookup, conceptual reasoning, root-cause analysis, and expected-value analysis.
    • Fable 5 is the new state-of-the-art for vision, and can rebuild a web app’s source code from screenshots alone.
    • Fable 5 beat Pokemon FireRed using a minimal, vision-only harness with no maps, navigation aids, or extra game-state information. Earlier Claude models needed a complex helper harness.
    • Persistent file-based memory improved Fable 5’s Slay the Spire performance three times more than it did for Opus 4.8, and Fable reached the game’s final act three times more often.
    • Fable 5 built a simulation of the solar system, deriving the planets’ orbital motion from physics first principles and using it to predict solar eclipses.
    • Using Mythos 5, internal protein design experts accelerated aspects of drug design by around ten times, with the model matching or beating skilled human operators end to end.
    • Nine of 14 protein targets in the drug-design study yielded strong candidates Anthropic is now investigating.
    • Mythos 5 is Anthropic’s first model to consistently produce novel, compelling scientific hypotheses. Scientists preferred its molecular biology hypotheses about 80% of the time in blinded comparisons.
    • One Mythos hypothesis, a novel mechanism for an E. coli protein, was corroborated by an independent lab working on the same problem.
    • In over a week of largely autonomous work, Mythos 5 assembled single-cell data for millions of cells across 138 animal species and trained a custom model that outperformed a recent Science paper while being 100 times smaller.
    • Anthropic’s automated alignment assessment found Mythos 5’s level of misaligned behavior was low and similar to Opus 4.8. Because they are the same model, Fable 5’s alignment is similar.
    • An external bug bounty produced no universal jailbreaks in over 1,000 hours of testing, though the UK AISI made progress toward one in a brief initial window.
    • One external partner found Fable 5’s safeguards against harmful cyber queries the most robust of any model tested, including Opus 4.8 and Opus 4.7, with zero compliance on harmful single-turn cyberattack requests.
    • The biology and chemistry classifier is deliberately broad for now. Mythos-class models outperformed dedicated protein language models at predicting AAV viral shell assembly using biological reasoning alone.
    • The distillation classifier targets large-scale attempts to extract Claude’s capabilities to train competing models, which could proliferate near-frontier capabilities without safeguards.
    • A new policy requires 30-day data retention for all Mythos-class traffic on first- and third-party surfaces, used only for safety, with logged human access and deletion after 30 days in almost all cases.
    • Anthropic plans trusted access programs that let cybersecurity organizations apply for Mythos 5, and let a small number of life science researchers access Fable 5 with biology and chemistry safeguards removed.
    • Both models cost 10 dollars per million input tokens and 50 dollars per million output tokens, less than half the price of Mythos Preview. Developers can use claude-fable-5 via the Claude API.
    • Fable 5 is free on Pro, Max, Team, and seat-based Enterprise plans through June 22. On June 23 it moves to usage credits on those plans until capacity allows it to return as a standard inclusion.

    Detailed Summary

    A Mythos-class model, made safe for general use

    Fable 5 is the first Mythos-class model Anthropic has made generally available. Mythos-class is a tier that sits above the Opus class, and the first of its kind, Claude Mythos Preview, was released in April through Project Glasswing to a limited group of cyber defenders and critical software infrastructure providers. The company framed today’s launch as the moment it could finally bring that level of capability to all users, because its safeguards had matured enough to allow it. Fable 5’s capabilities exceed those of any model Anthropic has made generally available, and its advantage over other models grows as tasks get longer and more complex.

    Two models, one brain

    Claude Mythos 5 is the same underlying model as Fable 5, but with safeguards lifted in some areas. The names are the only real difference: Fable, from the Latin fabula meaning that which is told, is akin to the Greek mythos, and the safeguards are what distinguish the two. Mythos 5 launches first to existing Mythos Preview users, including the Project Glasswing cybersecurity partners, as an upgrade. It is deployed in collaboration with the US Government and is described as having the strongest cybersecurity capabilities of any model in the world. Anthropic plans to steadily expand access through a more systematic trusted access program.

    Software engineering and token efficiency

    Fable 5 can work autonomously for longer than any previous Claude model, and software engineering is where that shows most clearly. During early testing, Stripe reported it compressed months of engineering into days, performing a codebase-wide migration in a 50-million-line Ruby codebase in a single day that would otherwise have taken a whole team over two months by hand. It is also more token-efficient than past models, scoring highest among frontier models on Cognition’s FrontierCode evaluation for high-quality, maintainable agentic coding, even at medium effort.

    Knowledge work, vision, and memory

    On complex analytical work, Fable 5 posted the highest score of any model on Hebbia’s Finance Benchmark for senior-level reasoning, with substantial gains in document-based reasoning and chart and table interpretation, and IMC said it aced their trading-analysis evaluations nearly across the board. In vision, it is the new state-of-the-art, able to extract precise numbers from detailed scientific figures and rebuild a web app’s source code from screenshots alone. It needs less scaffolding too: where earlier Claude models struggled to play Pokemon even with helper harnesses, Fable 5 beat FireRed with a minimal, vision-only harness using nothing but raw game screenshots. On memory, giving Fable persistent file-based notes improved its Slay the Spire performance three times more than it did for Opus 4.8, and it built a physics-first-principles solar system simulation accurate enough to predict solar eclipses.

    Life sciences: drug design, hypotheses, and genomics

    Using Mythos 5, Anthropic’s internal protein design experts accelerated aspects of the drug-design process by around ten times. With protein design and bioinformatics tools but no human assistance, the model matched or beat skilled human operators, executing the full workflow of choosing binding sites, selecting and running design tools, and recovering from failures. Nine of 14 protein targets yielded strong drug-design candidates now under investigation. Mythos 5 is also Anthropic’s first model to consistently produce novel, compelling scientific hypotheses: scientists preferred its molecular biology hypotheses about 80% of the time in blinded comparisons, and one, a novel mechanism for an E. coli protein, was corroborated by an independent lab. In genomics, Mythos 5 ran over a week of largely autonomous research, assembling single-cell data for millions of cells across 138 species and training a custom model that outperformed a recent Science paper despite being 100 times smaller.

    The new safeguards: classifiers and fallback

    Mythos-class capability is potent enough that Anthropic considers it a substantial misuse risk, especially given how much advanced AI usage is dual use. Fable 5 ships with a new set of classifiers, separate AI systems that detect potential misuse and jailbreak attempts and stop the main model from responding. When a classifier flags a request related to cybersecurity, biology and chemistry, or distillation, the response is handled by Claude Opus 4.8 instead, and the user is told. The cybersecurity classifiers cover both exploitation and broader offensive cyber tasks like reconnaissance and lateral movement, and Anthropic says they prevent Fable from making any progress on those tasks. The biology and chemistry classifier is intentionally broad for now, after tests showed Mythos-class models could outperform dedicated protein language models at predicting AAV viral shell assembly using biological reasoning alone. The distillation classifier targets large-scale attempts to extract Claude’s capabilities to train competing models.

    Jailbreak resistance, data retention, and availability

    Anthropic ran extensive red-teaming, including an external bug bounty that produced no universal jailbreaks in over 1,000 hours, though it notes the UK AISI made progress toward one in a brief window. The company concedes it is likely impossible to fully prevent universal jailbreaks and aims instead to make any that remain slow and costly enough to catch before they scale. A new policy requires 30-day data retention for all Mythos-class traffic, used only for safety, with logged human access and deletion after 30 days in almost all cases. On availability, Fable 5 is live everywhere today and fully available on the API and consumption-based Enterprise plans, while subscription access rolls out in stages: free on Pro, Max, Team, and seat-based Enterprise through June 22, then on usage credits from June 23 until capacity allows it to return as a standard inclusion. Both models cost 10 dollars per million input tokens and 50 dollars per million output tokens.

    Notable Quotes

    “Today we’re launching Claude Fable 5: a Mythos-class model that we’ve made safe for general use.”

    Anthropic, opening the Claude Fable 5 and Claude Mythos 5 announcement

    “Fable 5’s capabilities exceed those of any model we’ve ever made generally available.”

    Anthropic, on where Fable 5 sits in the lineup

    “It has the strongest cybersecurity capabilities of any model in the world.”

    Anthropic, describing Claude Mythos 5

    “During early testing, Stripe reported that Fable 5 compressed months of engineering into days.”

    Anthropic, on Fable 5’s software engineering results

    “Our early data shows that more than 95% of Fable sessions involve no fallback at all.”

    Anthropic, on how often the safeguards route to Opus 4.8

    “Mythos 5 is our first model to consistently produce novel, compelling scientific hypotheses.”

    Anthropic, on the model’s molecular biology research

    “It is likely impossible to completely prevent universal jailbreaks, but our goal is to make any remaining jailbreaks sufficiently slow and costly that we can detect and prevent them before they are used at scale.”

    Anthropic, on the limits of its safeguards

    “Fable is from the Latin fabula, ‘that which is told,’ akin to the Greek mythos. The safeguards are what distinguish the two models.”

    Anthropic, explaining the Fable and Mythos naming

    Read the full announcement and the benchmark tables on Anthropic’s site here: Claude Fable 5 and Claude Mythos 5.

    Related Reading