PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: token efficiency

Inkling: Thinking Machines Lab Releases Its First Open-Weights Model, a 975B Multimodal Mixture-of-Experts With Controllable Thinking Effort That Can Fine-Tune Itself on Tinker
Today, we are introducing Inkling.

Inkling reasons efficiently across text, image, and audio modalities. We are making the full weights available.https://t.co/Ghebq5mG30

Available today for fine-tuning on Tinker. Play with it in the Inkling Playground. 🧵
— Thinking Machines (@thinkymachines) July 15, 2026

Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati, has released Inkling, its first open-weights model trained from scratch. Inkling is a 975 billion parameter Mixture-of-Experts transformer (41B active) with a context window of up to 1 million tokens, native multimodal reasoning over text, images, and audio, and a dial for controllable thinking effort. The lab is explicit that Inkling is not the strongest model in the world. It is pitched as something arguably more useful: a broad, balanced, customizable foundation you can fine-tune on Tinker, with the full weights on Hugging Face. The announcement even includes a demo where Inkling fine-tunes itself and swaps in its own new weights.

TLDR

Thinking Machines Lab released Inkling, a 975B-total, 41B-active Mixture-of-Experts model pretrained on 45 trillion tokens of text, images, audio, and video, alongside a preview of Inkling-Small (276B total, 12B active). The release covers the model’s generalist benchmark profile across reasoning, agentic coding, tool use, vision, and audio; a controllable thinking effort setting that lets developers trade performance against tokens (matching Nemotron 3 Ultra on Terminal Bench 2.1 at roughly a third of the tokens); an encoder-free multimodal architecture using dMel spectrograms and hMLP image patches; a training recipe combining Muon and Adam with weight decay coupled to the learning rate; RL scaled past 30 million rollouts with log-linearly improving reasoning and an emergent compression of the chain of thought; an epistemics push covering calibration, forecasting (where it beats several frontier models), abstention, and censorship resistance; the strongest FORTRESS adversarial safety score among compared open-weights models; a headline-grabbing demo of the model fine-tuning itself into a lipogram assistant via Tinker; and day-one availability on Tinker (at a 50% discount), Hugging Face, and inference partners including Together, Fireworks, Modal, Databricks, Baseten, vLLM, SGLang, and llama.cpp.

Thoughts

The most striking thing about this launch is its honesty. Nearly every frontier release leads with a claim to be the best at something, and the fine print walks it back. Thinking Machines Lab says plainly that Inkling is not the strongest model available, open or closed, and then makes the case that “strongest” is the wrong axis for most real buyers. If you are going to run a model millions of times inside a product, what you care about is the cost curve, the adaptability, and whether you can shape it to your workflow. That framing conveniently matches their business (Tinker sells fine-tuning), but it also matches how production AI actually gets deployed, where cost and latency are binding constraints and a benchmark crown is trivia.

The self-fine-tuning demo deserves more attention than it will probably get. Asked to become a lipogram assistant that never uses the letter “e” (a behavior prompting alone cannot reliably produce), Inkling wrote its own training objective and scoring function, generated its own synthetic data, launched the run on Tinker, evaluated the result against its base self, and then staged a weight swap so the improved checkpoint took over the session. That is a closed loop of specify, train, evaluate, and self-update, packaged as a cute product demo. The loop is the primitive behind every serious conversation about recursive self-improvement, and here it is running as a marketing asset with a 27 minute wall clock. The gap between “toy objective” and “economically meaningful objective” is now a question of reward design, not plumbing.

Controllable thinking effort is the feature I expect developers to care about most. Instead of publishing a single score, TML publishes a curve: sweep the effort setting from 0.2 to 0.99 and watch performance trade against generated tokens. Inkling reportedly matches Nemotron 3 Ultra on Terminal Bench 2.1 while spending about a third of the tokens. Benchmarks reported as single points hide exactly this, and a model that reaches a target score cheaply beats a model that scores two points higher at triple the cost in any high-volume workload. Expect effort curves to become standard marketing for open models, the way context length became standard a couple of years ago.

The epistemics section is quietly the most differentiated part of the release. TML trained calibration directly, running RL against proper scoring rules on resolved real-world questions, and pairing a rubric grader with a claims grader that does agentic web search to verify each factual assertion. The result is a model that beats GPT-5.5 and Claude Opus 4.8 on ForecastBench without search and holds its own on Prophet Arena. A model that knows when to say “I don’t know” is more useful across messy real-world domains than one that confabulates confidently, and it is notable that a lab whose stated mission is extending human will and judgment treats calibrated uncertainty as a first-class training target rather than a safety afterthought. The censorship-resistance training, validated on Cognition’s Propaganda and Censorship Eval, extends the same idea: trustworthiness as a capability you train, not a policy you bolt on.

Finally, the open-weights safety tension is handled with unusual candor. Inkling posts the strongest adversarial FORTRESS score among the open models compared while keeping benign over-refusal low, and it was tested externally for CBRN, cyber, and loss-of-control capabilities. But everyone in this space knows fine-tuning can strip safety behavior from open weights, and TML ships a fine-tuning platform for this exact model. Their acknowledgment that they are actively studying how safety behavior survives fine-tuning on Tinker is the right thing to say, and it is also the open question that will define whether “safe open weights” is a coherent category at all.

Key Takeaways
- Inkling is Thinking Machines Lab’s first from-scratch, open-weights model: a Mixture-of-Experts transformer with 975B total parameters, 41B active, and a context window up to 1M tokens.
- It was pretrained on 45 trillion tokens spanning text, images, audio, and video, and reasons natively over text, images, and audio without separate encoders.
- A preview of Inkling-Small ships alongside it: a 276B-parameter MoE with just 12B active parameters that matches or beats its larger sibling on several benchmarks thanks to an improved pretraining recipe.
- TML explicitly positions Inkling as a base for customization rather than the strongest overall model, leaning on multimodality, efficient thinking, and Tinker fine-tuning as the differentiators.
- The launch demo shows Inkling fine-tuning itself: it wrote its own training objective and data, ran the job through the Tinker API, evaluated the result, and hot-swapped to its own new weights inside the OpenCode harness.
- The self-fine-tuning target was a lipogram assistant that never uses the letter “e,” a behavior chosen precisely because prompting alone cannot reliably achieve it; the full loop completed in about 27 minutes.
- Controllable thinking effort is a core feature: a setting swept from 0.2 to 0.99 traces a full performance-versus-tokens curve instead of a single benchmark point.
- On Terminal Bench 2.1, Inkling matches Nemotron 3 Ultra’s score at roughly one third of the generated tokens, the release’s flagship efficiency claim.
- Inkling was trained to run inside a variety of coding and agent harnesses, with tool sets and schemas randomized during training to reduce sensitivity to any particular harness.
- On Design Arena’s blinded human-evaluated Agentic Web Dev leaderboard, Inkling scores 1257, among the strongest open-weights models and tied with Claude Opus 4.6.
- Headline benchmark scores at effort 0.99 include SWEBench Verified 77.6%, SWEBench Pro Public 54.3%, Terminal Bench 2.1 63.8%, GPQA Diamond 87.2%, AIME 2026 97.1%, and HLE 29.7% text-only (46.0% with tools).
- Agentic and general scores include MCP Atlas 74.1%, Tau 3 Banking 23.7%, and BrowseComp 77.1% with context management.
- Vision results are strong for an open model: MMMU Pro 73.5%, CharXiv RQ 78.1%, rising to 82.0% when the model uses a Python tool for zooming and cropping during visual reasoning.
- Audio results place it among the strongest open-weights audio models: VoiceBench 91.4%, MMAU 77.2%, and Audio MC 56.6%, well ahead of Qwen3-Omni and Nemotron Nano-Omni on the last.
- The multimodal stack is encoder-free: audio enters as discrete dMel spectrograms and images as 40×40 pixel patches through a four-layer hMLP, both passed through a lightweight embedding layer and processed jointly with text tokens.
- The MoE design largely follows DeepSeek-V3: 256 routed experts plus 2 shared experts per layer, 6 routed experts active per token, with a sigmoid router and auxiliary-loss-free load balancing.
- Attention interleaves sliding-window and global layers at a 5:1 ratio with 8 KV heads, and uses a learned relative positional embedding instead of RoPE, which TML found extrapolates better to long sequences.
- Short convolutions are applied after the key and value projections and on the attention and MLP residual branch outputs, an unusual architectural touch aimed at efficiency and long-context performance.
- Training used a hybrid optimizer strategy, Muon for large matrix weights and Adam for everything else, with weight decay coupled to the square of the learning rate to keep weight magnitudes stable.
- Post-training was bootstrapped with a small SFT phase on synthetic data generated by open-weights models including Kimi K2.5, with the large majority of compute spent on large-scale RL.
- RL was scaled past 30 million rollouts across two long continuous runs, with reasoning performance on a held-out aggregate (AIME, HLE, GPQA, and others) improving log-linearly throughout.
- Effort control was trained by varying the system message and per-token cost across rollouts, teaching the model to modulate its own thinking budget.
- An emergent effect appeared during RL: the chain of thought compressed over training, dropping articles and connectives into a telegraphic style, driven purely by efficiency pressure rather than any targeted reward.
- Inkling was TML’s first major training effort and ran on NVIDIA GB300 NVL72 systems; the lab says future models will push compute scale further across pretraining and RL.
- Calibration was trained directly with RL against proper scoring rules on a large corpus of resolved real-world questions, treating well-placed confidence as a capability rather than a byproduct.
- On ForecastBench without search, Inkling’s Brier Index of 61.1 beats GPT-5.5 (59.1) and Claude Opus 4.8 (54.6), and it stays competitive with search enabled and on Prophet Arena.
- Instruction following was trained with two automated graders working together: a rubric grader scoring against a checklist and a claims grader that verifies each factual claim via agentic web search, improving helpfulness and reducing hallucination simultaneously.
- Abstention-aware rewards on short-form factual QA taught the model to answer when confident and hedge or decline when not, with some prompts explicitly forcing or forbidding hedging so the user’s preference wins.
- Inkling was trained to answer directly on topics subject to censorship, and Cognition’s Propaganda and Censorship Eval found strong censorship non-compliance.
- On FORTRESS, Inkling posts the strongest adversarial refusal score (78.0%) of any compared open-weights model while keeping benign compliance high (95.9%), and scores 98.6% on StrongREJECT.
- Safety testing covered CBRN, cyber, and loss-of-control capabilities plus human-AI threat vectors like sycophancy, vulnerable users, and manipulation, verified by commissioned external testers.
- Inkling is available for fine-tuning on Tinker today with 64K and 256K context options at a 50% limited-time discount, plus a free Inkling Playground chat interface in the Tinker console.
- Full weights are on Hugging Face, including an NVFP4 checkpoint for efficient inference on NVIDIA Blackwell, with API availability via Together, Fireworks, Modal, Databricks, and Baseten and inference support in SGLang, vLLM, TokenSpeed, and llama.cpp.
- TML frames Inkling as the first in a family and as the intended background reasoning model for its previously announced real-time interaction models system.
Detailed Summary

What Inkling Is and Why It Exists

Thinking Machines Lab frames its mission as building AI that extends human will and judgment, and Inkling as the logical next step after shipping the Tinker customization platform, previewing an interaction-focused AI system, and publishing research. Inkling is a Mixture-of-Experts transformer with 975B total and 41B active parameters, a context window up to 1M tokens, and pretraining on 45 trillion tokens of mixed text, image, audio, and video data. The lab is upfront that it is not the strongest model available. The pitch is breadth plus adaptability: a generalist trained across agentic, reasoning, coding, instruction-following, factuality, vision, and audio tasks rather than tuned to dominate one leaderboard, offered with full weights so people can make it their own. It launches with a preview sibling, Inkling-Small, at 276B total and 12B active parameters.

The Self-Fine-Tuning Demo

To demonstrate what customization means, TML asked Inkling to fine-tune itself. Running inside the OpenCode harness with access to Tinker, the model was told to become a lipogram assistant that never uses the letter “e.” Inkling drafted the plan, wrote an objective file with a scoring function (any response containing “e” scores zero), generated synthetic training data, launched a supervised fine-tuning run through the Tinker API, evaluated the checkpoint against its base self, and then staged a self-update so the supervisor relaunched the session on the new weights. The pipeline passed in about 27 minutes, and the updated model answered a test question about launching an LLM without a single “e.” It is a whimsical objective wrapped around a serious primitive: a model autonomously specifying, running, and adopting its own weight updates.

Agentic Coding and Tool Use

TML trained Inkling to operate inside many coding and agent harnesses, randomizing tool sets and schemas during training so the model does not overfit to one environment. The release showcases three demos: a one-shot job-application web app that then hosts an embedded browser-use agent operating its own interface; a nine-page, cohesively designed PDF food and travel journal produced from a single editorial prompt with web-verified details; and a server-authoritative multiplayer snake game refined over 40 iterations of feedback from GPT Codex acting as a reviewer. On benchmarks, Inkling posts 77.6% on SWEBench Verified, 54.3% on SWEBench Pro Public, and 63.8% on Terminal Bench 2.1, competitive within the open-weights field, and 1257 on Design Arena’s human-judged web dev leaderboard, in the same band as Claude Opus 4.6.

Controllable Thinking Effort

Rather than reporting a single operating point, TML sweeps Inkling’s effort setting from 0.2 to 0.99 and plots score against mean generated tokens on Terminal Bench 2.1, HLE, and IFBench, with competitors shown at their default settings. The headline result is efficiency: Inkling reaches Nemotron 3 Ultra’s Terminal Bench score at roughly a third of the tokens. The argument is that cost and latency are binding constraints in production, especially for interactive collaboration, so the full cost curve, not the peak score, is what developers should evaluate. Effort can be set from within the agent harness, and the ability was trained by varying system messages and per-token costs across RL rollouts.

Native Multimodality Without Encoders

Inkling is designed to serve as the background reasoning model for TML’s interaction models system, which requires real-time voice and vision collaboration. The multimodal components are trained from scratch with an encoder-free architecture: audio arrives as discrete dMel spectrograms and images as 40×40 pixel patches through a four-layer hMLP, both mapped through a lightweight embedding layer and processed jointly with text. The model transcribes speech, follows spoken instructions, reasons over long recordings, and answers questions about charts and diagrams, optionally using a Python tool to zoom and crop images mid-reasoning. Scores like 91.4% on VoiceBench and 82.0% on CharXiv RQ with Python place it among the strongest open-weights multimodal models, though still behind Gemini 3.1 Pro.

Epistemics: Calibration, Forecasting, and Censorship Resistance

TML groups calibration, instruction following, and censorship resistance under the banner of epistemics. Calibration was trained with RL against proper scoring rules on resolved real-world questions, and it shows: Inkling’s ForecastBench Brier Index of 61.1 without search beats GPT-5.5 and Claude Opus 4.8, and its Prophet Arena score sits close to the frontier. Instruction following used two complementary automated graders, a rubric checklist and a claims grader that verifies factual assertions through agentic web search, so recall-spraying to hack rubrics gets penalized by the factuality check. Targeted abstention-aware QA datasets taught the model to say “I don’t know” or give hedged best guesses when appropriate, while still complying when a user demands a forced guess. Finally, the model was trained to answer directly on censorship-prone topics, with Cognition’s Propaganda and Censorship Eval finding strong non-compliance with censorship patterns.

Safety for an Open-Weights Release

Inkling was trained to an internal behavioral spec across all modalities and then checked by commissioned external safety testers. Evaluations covered dangerous capabilities (CBRN, cyber, loss of control) and human-AI threat vectors including sycophancy, vulnerable users, and harmful manipulation. On FORTRESS, which pairs adversarial harmful requests with benign look-alikes, Inkling posts the strongest adversarial score among the compared open models (78.0%) without collapsing on the benign side (95.9%), and it scores 98.6% on StrongREJECT. TML acknowledges the open question hanging over every open-weights release: how safety behavior holds up under fine-tuning, which it says it is actively studying on Tinker.

Architecture and Training Recipe

The MoE layout follows DeepSeek-V3: 256 routed experts and 2 shared experts per layer with 6 routed experts active per token, a sigmoid-based router, and auxiliary-loss-free load balancing. Attention interleaves sliding-window and global layers 5:1 with 8 KV heads, and positions are encoded with a learned relative positional embedding that TML found outperforms and out-extrapolates RoPE. Short convolutions appear after the key and value projections and on residual branch outputs. Optimization was hybrid, Muon for large matrices and Adam elsewhere, with hyperparameter schedules drawn from the lab’s modular manifolds research and weight decay coupled to the square of the learning rate to keep weight norms stable. Post-training bootstrapped from a small SFT phase on synthetic data from open models including Kimi K2.5, then spent the bulk of compute on large-scale RL. Everything ran on NVIDIA GB300 NVL72 systems.

RL at Scale and the Emergent Compression of Thought

TML scaled asynchronous RL past 30 million rollouts across two long continuous runs, with performance on a held-out aggregate of reasoning evals improving log-linearly the whole way. Along the way an unplanned behavior emerged: the chain of thought became progressively more concise, shedding grammatical overhead into a telegraphic style (“We need to understand” becomes “We need determine”) while remaining comprehensible and leaving final answers unaffected. No reward targeted this; token efficiency pressure alone drove the compression, echoing an observation Cognition made while training SWE-1.7. It is a vivid example of optimization discovering its own shorthand.

Inkling-Small

The preview of Inkling-Small is arguably the sleeper story: with 12B active parameters against Inkling’s 41B, it matches or exceeds the larger model on a surprising number of benchmarks, including GPQA Diamond (88.3% vs 87.2%), IFBench (83.4% vs 79.8%), and CharXiv RQ with Python (83.4% vs 82.0%). TML attributes this to pretraining data and recipe improvements made after the big model trained, with both models sharing the same post-training stack. The clearest gaps favoring big Inkling are factuality (SimpleQA 43.9% vs 20.9%), Terminal Bench, and Tau 3 Banking. Full weights for Inkling-Small will be released once testing finishes, and its cost and latency profile targets high-volume workloads like coding, LLM grading, and synthetic data generation.

Availability and the Ecosystem Play

Inkling is on Tinker today with 64K and 256K context options at a limited-time 50% discount, plus a free Inkling Playground chat interface with integrated web search in the Tinker console so developers can get a feel for the model before committing to a run. The cookbook gained native Inkling support and three new audio recipes, and a new tml-renderer handles chat templates, tool calls, reasoning content, and multimodal inputs. Deployment partnerships span Together, Fireworks, Modal, Databricks, and Baseten for APIs; RadixArk for SGLang and Miles; Inferact for vLLM; Lightseek for TokenSpeed; Unsloth for llama.cpp; and Hugging Face for transformers integration. Full weights are on Hugging Face in both the original checkpoint and an NVFP4 checkpoint for NVIDIA Blackwell inference.

Notable Quotes

“Our mission is to build AI that extends human will and judgment.”
Thinking Machines Lab, opening the Inkling announcement

The company’s north star, and the lens through which the whole release (customization, calibration, open weights) is framed.

“Inkling is not the strongest overall model available today, open or closed. Instead, a combination of qualities makes it a good open-weights base for customization: multimodal capabilities, efficient thinking, and availability on Tinker for fine-tuning.”
Thinking Machines Lab, positioning the release

A rare piece of launch-day honesty from a frontier lab, and the strategic thesis of the whole release.

“Picking the right base model to fine-tune is a qualitative judgment that combines measurable benchmarks with the unique feel of a model that comes from playing with it.”
Thinking Machines Lab, on why the Inkling Playground exists

An argument that vibes are data, from the lab that built a playground into a fine-tuning console.

“Cost and latency are often binding constraints in real-world applications, and low latency in particular is crucial for enabling collaboration and improvement through iteration.”
Thinking Machines Lab, on controllable thinking effort

The case for evaluating models on their full effort-versus-performance curve instead of a single benchmark point.

“A model that’s confident in every answer it gives, including when it’s missing info and confabulates, forces the user to double-check everything.”
Thinking Machines Lab, on why calibration was a training target

The clearest one-line justification for treating calibrated uncertainty as a capability rather than a nicety.

“Together, the two graders improve helpfulness and reduce hallucination at the same time, rather than trading one for the other.”
Thinking Machines Lab, on pairing a rubric grader with a web-searching claims grader

A neat solution to rubric hacking: verify every claim with agentic search so spraying plausible facts stops paying.

“Safety is crucial for open-weights models. We’re continuing to study safety behavior and capability uplift in customizable models, including how safety behavior is impacted by fine-tuning on Tinker.”
Thinking Machines Lab, on the open question of fine-tunable safety

The acknowledgment that safety trained into open weights must survive the very customization the product sells.

“Inkling is just the start: our first release in a model family we will continue to build on.”
Thinking Machines Lab, on the roadmap

Together with the GB300 compute note, a clear signal that larger and stronger family members are coming.

Read the full announcement, including the interactive demos, effort curves, and complete benchmark tables, on the Thinking Machines Lab blog.

Related Reading
- Thinking Machines Lab the lab’s official site, with its research blog and the Tinker fine-tuning platform behind this release.
- Mira Murati (Wikipedia) background on the former OpenAI CTO who founded Thinking Machines Lab.
- Mixture of experts (Wikipedia) a primer on the sparse architecture that lets a 975B model run with only 41B active parameters.
- Brier score (Wikipedia) the proper scoring rule behind the ForecastBench and Prophet Arena calibration results discussed above.
- The launch announcement on X the thread where Thinking Machines Lab introduced Inkling to the world.
July 15, 2026
Claude Fable 5 and Claude Mythos 5: Anthropic Ships Its First Generally Available Mythos-Class AI Model With New Safeguards
Anthropic has launched Claude Fable 5 and Claude Mythos 5, the first Mythos-class models offered beyond a tiny circle of cyber defenders. Fable 5 is the generally available version, wrapped in a new layer of safeguards, while Mythos 5 is the same underlying model with some of those guardrails lifted for a small group of vetted partners. The pair sits a full tier above the Opus class in raw capability, and the launch is as much a story about how Anthropic is choosing to gate that capability as it is about the benchmarks. Below is a full breakdown of what shipped, what the model can do, and why the safeguard design matters.

TLDR

Anthropic released Claude Fable 5, a Mythos-class model that is now its most capable generally available model, posting state-of-the-art results across software engineering, knowledge work, vision, memory, and scientific research. To ship it safely and fast, Fable 5 carries new safety classifiers that route flagged queries in cybersecurity, biology and chemistry, and distillation over to Claude Opus 4.8 instead of refusing, a fallback that triggers in under 5% of sessions. The same model ships without cyber safeguards as Claude Mythos 5 for Project Glasswing partners in collaboration with the US Government, where it is described as having the strongest cybersecurity capabilities of any model in the world. Highlights include a codebase-wide migration of a 50-million-line Ruby codebase that Stripe says took a day instead of two months, beating Pokemon FireRed with a vision-only harness, accelerating drug design roughly tenfold using Mythos 5, producing novel molecular biology hypotheses preferred by scientists about 80% of the time, and over a week of autonomous genomics research. Both models cost 10 dollars per million input tokens and 50 dollars per million output tokens, less than half the price of Mythos Preview, with a staged subscription rollout and a new 30-day data retention policy for Mythos-class traffic.

Thoughts

The most interesting decision here is not the capability jump, it is the naming split. Fable and Mythos are the same brain. The only difference is whether the safeguards are on. Anthropic is effectively shipping one model twice: a gated public edition and an ungated edition handed to a short list of trusted defenders working with the US Government. That is a clean way to resolve the central tension of frontier AI, which is that the exact capabilities that help a security professional close a vulnerability also help an attacker find one. Rather than dumbing the model down for everyone or holding it back entirely, they are letting the access list, not the weights, carry the risk. Expect this pattern to repeat as capabilities climb.

The fallback-to-Opus design is the other quietly important choice. When a classifier flags a query in cybersecurity, biology, chemistry, or suspected distillation, the user does not hit a wall of refusal. The request is silently handed to Opus 4.8, a model that is still excellent at almost everything. Graceful degradation beats a hard no, both for user experience and for trust. It also reframes what a safeguard is. Instead of a binary block, it becomes a routing decision, and because more than 95% of sessions never trigger it, most users will never notice it exists. The honest admission that the classifiers are tuned conservatively and will sometimes catch harmless requests is the right posture, even if it will annoy power users who keep getting bounced to the smaller model.

The commercial signals are worth reading closely. Pricing came down to less than half of Mythos Preview, which suggests confidence in serving costs at scale, but the subscription rollout tells a more cautious story. Fable 5 is free on Pro, Max, Team, and Enterprise plans only through June 22, after which using it requires usage credits until capacity catches up. That is a polite way of saying demand is expected to badly outrun supply. The model is fully available on the API and consumption-based Enterprise plans from day one, because those bill by the token and self-throttle. Subscriptions, which are all-you-can-eat, are where a capacity crunch actually hurts, so that is exactly where the brakes went on.

On the science, the genomics result is the one that should make people sit up. A model doing over a week of largely autonomous research, assembling single-cell data across 138 species, then designing and training its own machine learning model that outperforms a recently published Science paper while being 100 times smaller, is a different category of claim than acing a benchmark. So is the drug-design work, where Mythos 5 reportedly matches or beats skilled human operators end to end, choosing binding sites, running protein design tools, and recovering from its own failures. If those hold up to publication and independent replication, the interesting frontier stops being chat quality and becomes whether a model can run a research program. That is also precisely why the biology and chemistry classifier exists, and why Anthropic is being so deliberate about who gets the ungated version.

One caveat worth keeping in view: nearly all of the evidence in the announcement is Anthropic’s own, or comes from partners with early access and an incentive to be enthusiastic. The Stripe migration, the FrontierCode score, the Slay the Spire memory result, the protein targets, and the genomics model are all compelling, but they are first-party until outside labs and the eventual system card, peer review, and independent red-teamers weigh in. The note that the UK AISI made progress toward a universal jailbreak inside a brief testing window is a useful reminder that the safeguard story is a work in progress, not a finished proof.

Key Takeaways
- Claude Fable 5 is a Mythos-class model made safe for general use, and is now Anthropic’s most capable generally available model.
- Mythos-class is a tier that sits above the Opus class in capability. The first was Claude Mythos Preview, released in April through Project Glasswing.
- Fable 5 is state-of-the-art on nearly all tested benchmarks, and its lead grows as tasks get longer and more complex.
- Claude Mythos 5 is the same underlying model as Fable 5, but with safeguards lifted in some areas. Fable and Mythos differ only by their safeguards.
- Mythos 5 is described as having the strongest cybersecurity capabilities of any model in the world, and is deployed through Project Glasswing with the US Government.
- New safety classifiers cover cybersecurity, biology and chemistry, and distillation. Flagged queries fall back to Claude Opus 4.8 rather than being refused.
- Users are told whenever a fallback happens. More than 95% of Fable sessions involve no fallback at all, and for those sessions Fable performs effectively the same as Mythos 5.
- The safeguards are tuned conservatively and trigger in less than 5% of sessions on average, sometimes catching harmless requests. Anthropic plans to reduce false positives after launch.
- Stripe reported Fable 5 compressed months of engineering into days, performing a codebase-wide migration of a 50-million-line Ruby codebase in a day that would have taken a team over two months by hand.
- Fable 5 scores highest among frontier models on Cognition’s FrontierCode evaluation for high-quality agentic coding, even at medium effort, and is more token-efficient than past Claude models.
- On Hebbia’s Finance Benchmark for senior-level reasoning, Fable 5 has the highest score of any model, with gains in document reasoning, chart and table interpretation, and problem solving.
- IMC noted Fable 5 aced their trading-analysis evaluations nearly across the board, including factual lookup, conceptual reasoning, root-cause analysis, and expected-value analysis.
- Fable 5 is the new state-of-the-art for vision, and can rebuild a web app’s source code from screenshots alone.
- Fable 5 beat Pokemon FireRed using a minimal, vision-only harness with no maps, navigation aids, or extra game-state information. Earlier Claude models needed a complex helper harness.
- Persistent file-based memory improved Fable 5’s Slay the Spire performance three times more than it did for Opus 4.8, and Fable reached the game’s final act three times more often.
- Fable 5 built a simulation of the solar system, deriving the planets’ orbital motion from physics first principles and using it to predict solar eclipses.
- Using Mythos 5, internal protein design experts accelerated aspects of drug design by around ten times, with the model matching or beating skilled human operators end to end.
- Nine of 14 protein targets in the drug-design study yielded strong candidates Anthropic is now investigating.
- Mythos 5 is Anthropic’s first model to consistently produce novel, compelling scientific hypotheses. Scientists preferred its molecular biology hypotheses about 80% of the time in blinded comparisons.
- One Mythos hypothesis, a novel mechanism for an E. coli protein, was corroborated by an independent lab working on the same problem.
- In over a week of largely autonomous work, Mythos 5 assembled single-cell data for millions of cells across 138 animal species and trained a custom model that outperformed a recent Science paper while being 100 times smaller.
- Anthropic’s automated alignment assessment found Mythos 5’s level of misaligned behavior was low and similar to Opus 4.8. Because they are the same model, Fable 5’s alignment is similar.
- An external bug bounty produced no universal jailbreaks in over 1,000 hours of testing, though the UK AISI made progress toward one in a brief initial window.
- One external partner found Fable 5’s safeguards against harmful cyber queries the most robust of any model tested, including Opus 4.8 and Opus 4.7, with zero compliance on harmful single-turn cyberattack requests.
- The biology and chemistry classifier is deliberately broad for now. Mythos-class models outperformed dedicated protein language models at predicting AAV viral shell assembly using biological reasoning alone.
- The distillation classifier targets large-scale attempts to extract Claude’s capabilities to train competing models, which could proliferate near-frontier capabilities without safeguards.
- A new policy requires 30-day data retention for all Mythos-class traffic on first- and third-party surfaces, used only for safety, with logged human access and deletion after 30 days in almost all cases.
- Anthropic plans trusted access programs that let cybersecurity organizations apply for Mythos 5, and let a small number of life science researchers access Fable 5 with biology and chemistry safeguards removed.
- Both models cost 10 dollars per million input tokens and 50 dollars per million output tokens, less than half the price of Mythos Preview. Developers can use claude-fable-5 via the Claude API.
- Fable 5 is free on Pro, Max, Team, and seat-based Enterprise plans through June 22. On June 23 it moves to usage credits on those plans until capacity allows it to return as a standard inclusion.
Detailed Summary

A Mythos-class model, made safe for general use

Fable 5 is the first Mythos-class model Anthropic has made generally available. Mythos-class is a tier that sits above the Opus class, and the first of its kind, Claude Mythos Preview, was released in April through Project Glasswing to a limited group of cyber defenders and critical software infrastructure providers. The company framed today’s launch as the moment it could finally bring that level of capability to all users, because its safeguards had matured enough to allow it. Fable 5’s capabilities exceed those of any model Anthropic has made generally available, and its advantage over other models grows as tasks get longer and more complex.

Two models, one brain

Claude Mythos 5 is the same underlying model as Fable 5, but with safeguards lifted in some areas. The names are the only real difference: Fable, from the Latin fabula meaning that which is told, is akin to the Greek mythos, and the safeguards are what distinguish the two. Mythos 5 launches first to existing Mythos Preview users, including the Project Glasswing cybersecurity partners, as an upgrade. It is deployed in collaboration with the US Government and is described as having the strongest cybersecurity capabilities of any model in the world. Anthropic plans to steadily expand access through a more systematic trusted access program.

Software engineering and token efficiency

Fable 5 can work autonomously for longer than any previous Claude model, and software engineering is where that shows most clearly. During early testing, Stripe reported it compressed months of engineering into days, performing a codebase-wide migration in a 50-million-line Ruby codebase in a single day that would otherwise have taken a whole team over two months by hand. It is also more token-efficient than past models, scoring highest among frontier models on Cognition’s FrontierCode evaluation for high-quality, maintainable agentic coding, even at medium effort.

Knowledge work, vision, and memory

On complex analytical work, Fable 5 posted the highest score of any model on Hebbia’s Finance Benchmark for senior-level reasoning, with substantial gains in document-based reasoning and chart and table interpretation, and IMC said it aced their trading-analysis evaluations nearly across the board. In vision, it is the new state-of-the-art, able to extract precise numbers from detailed scientific figures and rebuild a web app’s source code from screenshots alone. It needs less scaffolding too: where earlier Claude models struggled to play Pokemon even with helper harnesses, Fable 5 beat FireRed with a minimal, vision-only harness using nothing but raw game screenshots. On memory, giving Fable persistent file-based notes improved its Slay the Spire performance three times more than it did for Opus 4.8, and it built a physics-first-principles solar system simulation accurate enough to predict solar eclipses.

Life sciences: drug design, hypotheses, and genomics

Using Mythos 5, Anthropic’s internal protein design experts accelerated aspects of the drug-design process by around ten times. With protein design and bioinformatics tools but no human assistance, the model matched or beat skilled human operators, executing the full workflow of choosing binding sites, selecting and running design tools, and recovering from failures. Nine of 14 protein targets yielded strong drug-design candidates now under investigation. Mythos 5 is also Anthropic’s first model to consistently produce novel, compelling scientific hypotheses: scientists preferred its molecular biology hypotheses about 80% of the time in blinded comparisons, and one, a novel mechanism for an E. coli protein, was corroborated by an independent lab. In genomics, Mythos 5 ran over a week of largely autonomous research, assembling single-cell data for millions of cells across 138 species and training a custom model that outperformed a recent Science paper despite being 100 times smaller.

The new safeguards: classifiers and fallback

Mythos-class capability is potent enough that Anthropic considers it a substantial misuse risk, especially given how much advanced AI usage is dual use. Fable 5 ships with a new set of classifiers, separate AI systems that detect potential misuse and jailbreak attempts and stop the main model from responding. When a classifier flags a request related to cybersecurity, biology and chemistry, or distillation, the response is handled by Claude Opus 4.8 instead, and the user is told. The cybersecurity classifiers cover both exploitation and broader offensive cyber tasks like reconnaissance and lateral movement, and Anthropic says they prevent Fable from making any progress on those tasks. The biology and chemistry classifier is intentionally broad for now, after tests showed Mythos-class models could outperform dedicated protein language models at predicting AAV viral shell assembly using biological reasoning alone. The distillation classifier targets large-scale attempts to extract Claude’s capabilities to train competing models.

Jailbreak resistance, data retention, and availability

Anthropic ran extensive red-teaming, including an external bug bounty that produced no universal jailbreaks in over 1,000 hours, though it notes the UK AISI made progress toward one in a brief window. The company concedes it is likely impossible to fully prevent universal jailbreaks and aims instead to make any that remain slow and costly enough to catch before they scale. A new policy requires 30-day data retention for all Mythos-class traffic, used only for safety, with logged human access and deletion after 30 days in almost all cases. On availability, Fable 5 is live everywhere today and fully available on the API and consumption-based Enterprise plans, while subscription access rolls out in stages: free on Pro, Max, Team, and seat-based Enterprise through June 22, then on usage credits from June 23 until capacity allows it to return as a standard inclusion. Both models cost 10 dollars per million input tokens and 50 dollars per million output tokens.

Notable Quotes

“Today we’re launching Claude Fable 5: a Mythos-class model that we’ve made safe for general use.”
Anthropic, opening the Claude Fable 5 and Claude Mythos 5 announcement

“Fable 5’s capabilities exceed those of any model we’ve ever made generally available.”
Anthropic, on where Fable 5 sits in the lineup

“It has the strongest cybersecurity capabilities of any model in the world.”
Anthropic, describing Claude Mythos 5

“During early testing, Stripe reported that Fable 5 compressed months of engineering into days.”
Anthropic, on Fable 5’s software engineering results

“Our early data shows that more than 95% of Fable sessions involve no fallback at all.”
Anthropic, on how often the safeguards route to Opus 4.8

“Mythos 5 is our first model to consistently produce novel, compelling scientific hypotheses.”
Anthropic, on the model’s molecular biology research

“It is likely impossible to completely prevent universal jailbreaks, but our goal is to make any remaining jailbreaks sufficiently slow and costly that we can detect and prevent them before they are used at scale.”
Anthropic, on the limits of its safeguards

“Fable is from the Latin fabula, ‘that which is told,’ akin to the Greek mythos. The safeguards are what distinguish the two models.”
Anthropic, explaining the Fable and Mythos naming

Read the full announcement and the benchmark tables on Anthropic’s site here: Claude Fable 5 and Claude Mythos 5.

Related Reading
- Project Glasswing — background on the cyberdefense program that Mythos 5 ships through with the US Government.
- Introducing Claude Opus 4.8 — the model that flagged Fable 5 queries fall back to instead of being refused.
- Claude Mythos Preview — the first Mythos-class model, released in April, that Mythos 5 now upgrades.
- Anthropic model system cards — where the full safety, alignment, and capability testing for models like Fable 5 is documented.
June 9, 2026
Claude Opus 4.8 Released: Anthropic Bets on Honesty, Dynamic Workflows, Effort Control, and Cheaper Fast Mode
Anthropic has released Claude Opus 4.8, the newest member of its flagship Opus class, available today across every surface and priced exactly like the model it replaces. The company calls it “a modest but tangible improvement” on Opus 4.7, but the framing undersells what is actually interesting here: the headline upgrade is not a benchmark number, it is honesty. Opus 4.8 is built to know when it does not know, and that single behavioral shift may matter more for real agent work than any raw capability bump.

TLDR

Claude Opus 4.8 is an across-the-board upgrade to Anthropic’s Opus class that ships today at the same regular price as Opus 4.7 ($5 per million input tokens, $25 per million output tokens), with the model positioned as “a more effective collaborator.” The marquee improvement is honesty: Opus 4.8 is roughly four times less likely than its predecessor to let flaws in its own code pass unremarked, and it is more willing to flag uncertainty rather than confidently claim progress on thin evidence. A pre-release alignment assessment found new highs on prosocial traits like supporting user autonomy and acting in the user’s best interest, with misaligned behavior at rates similar to Anthropic’s best-aligned model, Claude Mythos Preview. Three things launch alongside the model: dynamic workflows in Claude Code (research preview), where Claude plans work then runs hundreds of parallel subagents that run even longer and verify their own outputs before reporting back; effort control in claude.ai and Cowork, a slider for how hard Claude thinks; and a Messages API update that accepts system entries inside the messages array so developers can update instructions mid-task without breaking the prompt cache. Fast mode now runs at 2.5x speed and is three times cheaper than before ($10 / $50 per million tokens). The roadmap points to cheaper Opus-equivalent models, a higher-intelligence class above Opus, and a wider rollout of Mythos-class models gated behind stronger cyber safeguards under Project Glasswing.

Thoughts

The most important sentence in this announcement is not about coding scores. It is the claim that Opus 4.8 is about four times less likely than Opus 4.7 to let flaws in its own code slip by without comment. For a chat assistant, overconfidence is annoying. For an agent, it is catastrophic. The whole premise of long-running autonomous work is that you hand the model a task and walk away, which means the model’s own judgment about whether it succeeded becomes the only judgment in the loop until you come back. A model that confidently declares victory on a half-finished migration does not save you time, it costs you a debugging session plus the time you spent trusting it. Honesty, framed this way, is not a soft virtue. It is the load-bearing reliability property that makes unattended agents usable at all.

Read the launch as a single coherent argument rather than a list of features, and the pieces lock together. Dynamic workflows let Claude plan a job and fan out hundreds of parallel subagents that, with Opus 4.8, run longer than before. Effort control lets you dial up how much the model thinks. The honesty improvement means the model checks its own work and flags what it is unsure about instead of papering over it. Put those three together and you get one product thesis: let it run longer, let it think harder, and trust it to tell you when something is wrong. The codebase-scale migration example, hundreds of thousands of lines from kickoff to merge with the existing test suite as the bar, is the proof point. None of those three capabilities is worth much alone. A model that runs for hours but lies about its results is a liability. A model that flags uncertainty but cannot sustain a long task never reaches the moment where its honesty matters. Anthropic shipped all three at once because they only pay off together.

The economics deserve a closer look than the “same price” headline invites. Regular pricing is flat versus Opus 4.7, which is the polite way of saying you get a better model for free. The real move is fast mode: 2.5x the speed at three times cheaper than it cost on previous models, landing at $10 per million input and $50 per million output. That is Anthropic quietly attacking the latency-versus-cost tradeoff that has shaped how teams deploy frontier models. Until now, “fast” meant “expensive,” so you reserved it for interactive moments and ate the wait everywhere else. Collapsing that premium changes the default. And note the subtle token story underneath: Opus 4.8 at its default high effort spends roughly the same tokens on coding as Opus 4.7’s default while performing better, so the effort slider is not a way to bleed you dry, it is an honest exposure of the quality-cost dial that was always there implicitly.

The Messages API change is the kind of unglamorous plumbing that practitioners will appreciate immediately. Letting system entries live inside the messages array means you can update an agent’s instructions, permissions, token budget, or environment context partway through a task without smuggling the update through a fake user turn and without blowing up your prompt cache. Anyone who has built a long-running agent has hit this wall: the world changes mid-task, the agent needs new constraints, and the only clean way to inject them previously was a cache-busting hack. This is Anthropic treating agents as first-class, stateful, long-lived processes rather than oversized chat sessions. It is a small spec change with outsized implications for how you architect an agent that runs for an hour.

Then there is the roadmap, where the most telling line is the quietest. Anthropic says a small number of organizations are already using Claude Mythos Preview for cybersecurity work under Project Glasswing, and that models of this capability level require stronger cyber safeguards before general release. Notice that they are pinning Opus 4.8’s alignment numbers to Mythos as the benchmark for “best-aligned,” while simultaneously holding Mythos back from general availability on safety grounds. That is a deliberate signal: the next class of model is good enough that they are gating it on cyber-offense risk, not on capability. For a site about the pursuit of joy, fulfillment, and purpose through AI, this is the part worth sitting with. The frontier is increasingly defined not by what the models can do, but by what their builders decide it is responsible to ship. Honesty in the small (flagging a bad line of code) and restraint in the large (holding back a cyber-capable model) are the same instinct expressed at two different scales.

Key Takeaways
- Claude Opus 4.8 is now available everywhere, replacing Opus 4.7 as Anthropic’s flagship Opus-class model and positioned as “a more effective collaborator.”
- Regular usage pricing is unchanged from Opus 4.7, holding at $5 per million input tokens and $25 per million output tokens, so the capability gains come at no added cost.
- The single most emphasized improvement is honesty, which Anthropic treats as a core trained behavior rather than a marketing flourish.
- Evaluations show Opus 4.8 is around four times less likely than its predecessor to let flaws in its own code pass unremarked, a direct reliability win for autonomous coding.
- Early testers report the model is more likely to flag uncertainty about its work and less likely to make unsupported claims or jump to conclusions on thin evidence.
- A detailed alignment assessment was run before release and concluded Opus 4.8 reaches new highs on prosocial traits like supporting user autonomy and acting in the user’s best interest.
- Misaligned behavior such as deception or cooperation with misuse is at rates substantially lower than Opus 4.7 and similar to Anthropic’s best-aligned model, Claude Mythos Preview.
- The full alignment assessment and pre-deployment safety tests are documented in the public Claude Opus 4.8 System Card.
- Dynamic workflows launch as a research preview inside Claude Code, letting Claude plan the work and then run hundreds of parallel subagents in a single session.
- With Opus 4.8, those subagents can run even longer, and Claude verifies its outputs before reporting back rather than declaring success blindly.
- Anthropic’s flagship example for dynamic workflows is a codebase-scale migration across hundreds of thousands of lines of code, from kickoff to merge, using the existing test suite as the success bar.
- Dynamic workflows are available in Claude Code for the Enterprise, Team, and Max plans.
- Effort control arrives in claude.ai and Cowork as a setting next to the model selector that lets users choose how much effort Claude puts into a response.
- Higher effort makes Claude think more frequently and deeply for better answers; lower effort responds faster and consumes rate limits more slowly. Effort control is available on all plans.
- Opus 4.8 defaults to “high” effort, judged the best overall balance of quality and user experience.
- On coding tasks, the default effort spends a similar number of tokens as Opus 4.7’s default but delivers better performance, so quality rises without a token penalty.
- Users can select “extra” (called “xhigh” in Claude Code) or “max” to spend more tokens for stronger results, and Anthropic recommends “extra” for difficult tasks and long-running asynchronous workflows.
- Rate limits in Claude Code were increased to accommodate the higher token usage of the higher effort levels.
- The Messages API now accepts system entries inside the messages array, a meaningful change for agent developers.
- That update lets developers change Claude’s instructions mid-task, adjusting permissions, token budgets, or environment context, without breaking the prompt cache or routing through a user turn.
- Fast mode now runs at 2.5x speed and is three times cheaper than it was for previous models, priced at $10 per million input tokens and $50 per million output tokens.
- Developers access the model as claude-opus-4-8 through the Claude API.
- Partner Miguel Gonzalez reports Opus 4.8 scored 84% on Online-Mind2Web, a meaningful jump over both Opus 4.7 and GPT-5.5, calling it the strongest computer-use and browser-agent model his team has tested.
- Databricks reports that, inside Genie, Opus 4.8 reasons over unstructured content like PDFs and diagrams at 61% cheaper token cost than Opus 4.7.
- Thomson Reuters reports Opus 4.8 is the first model to break 10% overall on the all-pass standard of its Legal Agent Benchmark, the highest score recorded there.
- Eleven partners weighed in, including Cursor, Cognition’s Devin, Databricks Genie, Thomson Reuters CoCounsel, and Hebbia, spanning coding, legal, finance, and enterprise data work.
- Anthropic is working on models that deliver many of the same capabilities as Opus at a lower cost.
- The company plans to release a new class of model with even higher intelligence than Opus.
- Under Project Glasswing, a small number of organizations are already using Claude Mythos Preview for cybersecurity work, with Mythos-class models expected to reach all customers in the coming weeks once stronger cyber safeguards are in place.
Detailed Summary

What Claude Opus 4.8 Is

Claude Opus 4.8 is an upgrade to Anthropic’s Opus class of models, building on Opus 4.7 with improvements across benchmarks covering coding, agentic skills, reasoning, and practical knowledge-work tasks. Anthropic describes the result as “a more effective collaborator” while characterizing the release overall as “a modest but tangible improvement on its predecessor.” The model is available today, everywhere, and developers call it as claude-opus-4-8 via the Claude API. The announcement includes a comparison table against the predecessor and other models, though the per-cell numbers in that table are published as an image and are not reproduced here as text.

Honesty: The Headline Improvement

Anthropic singles out honesty as one of the most prominent improvements in Opus 4.8. All of the company’s models are trained to be honest, which includes avoiding claims they cannot support. A persistent problem with AI models generally is that they sometimes jump to conclusions, confidently claiming progress despite thin evidence. Early testers report that Opus 4.8 is more likely to flag uncertainties about its own work and less likely to make unsupported claims. The most concrete measure: evaluations show Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. For agentic and unattended use, this self-skepticism is the difference between a model that reliably tells you when something went wrong and one that quietly ships a broken result.

Alignment Assessment

A detailed alignment assessment was run before release. On the positive side, the Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” On the risk side, misaligned behavior such as deception or cooperation with misuse occurs at rates substantially lower than Opus 4.7, and similar to Anthropic’s best-aligned model, Claude Mythos Preview. The full alignment assessment and the pre-deployment safety tests are published in the Claude Opus 4.8 System Card, which also contains the complete benchmark table and wider evaluations.

Dynamic Workflows in Claude Code

Launching today as a research preview in Claude Code, dynamic workflows let Claude plan the work and then run hundreds of parallel subagents in a single session. With Opus 4.8, those agents can run even longer than before, and Claude verifies its outputs before reporting back rather than reporting unchecked results. The showcase example is a codebase-scale migration: Claude Code with Opus 4.8 can carry out migrations across hundreds of thousands of lines of code, all the way from kickoff to merge, using the existing test suite as its bar for success. Dynamic workflows are available in Claude Code for the Enterprise, Team, and Max plans.

Effort Control

Effort control arrives in claude.ai and Cowork as a setting alongside the model selector that lets users choose how much effort Claude puts into a response. Higher effort means Claude thinks more frequently and deeply for better responses; lower effort means it responds faster and uses rate limits more slowly. Opus 4.8 defaults to “high” effort, which Anthropic judged the best overall balance of quality and user experience. On coding tasks, that default spends a similar number of tokens as Opus 4.7’s default while performing better. Users who want more can choose “extra” (called “xhigh” in Claude Code) or “max” to spend more tokens for stronger results, and Anthropic recommends “extra” for difficult tasks and long-running asynchronous workflows. To support the heavier token usage at higher effort levels, rate limits in Claude Code were increased. Effort control is available on all plans.

Messages API Update

The Messages API now accepts system entries inside the messages array. This lets developers update Claude’s instructions mid-task without breaking the prompt cache and without routing the update through a user turn. In practice that means you can update permissions, token budgets, or environment context while an agent is running, which is exactly the kind of statefulness a long-running autonomous process needs. It is a small specification change with significant consequences for how developers build durable agents.

Pricing and Fast Mode

Regular usage pricing is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. The notable shift is in fast mode, where the model works at 2.5x the speed and fast mode is now three times cheaper than it was for previous models, landing at $10 per million input tokens and $50 per million output tokens. The combination of unchanged regular pricing and dramatically cheaper fast mode reshapes the latency-versus-cost calculus that has long governed how teams deploy frontier models.

Partner Results Across Coding, Legal, Finance, and Data

Eleven partners shared results spanning the spectrum of professional work. Miguel Gonzalez reports 84% on Online-Mind2Web, a meaningful jump over both Opus 4.7 and GPT-5.5, calling it the strongest computer-use and browser-agent model his team has tested. Databricks reports that Genie reasons over unstructured content like PDFs and diagrams at 61% cheaper token cost than Opus 4.7. Thomson Reuters reports Opus 4.8 is the first model to break 10% overall on the all-pass standard of its Legal Agent Benchmark. Cursor reports gains across every effort level on CursorBench with more efficient tool calling, and Cognition reports that Devin sees cleaner tool use, fixes to the comment-verbosity and tool-calling issues seen with Opus 4.7, and improvements over Opus 4.6. Hebbia reports strong quality with better citation precision and more token efficiency on retrieval for dense financial filings. The footnotes note that Terminal-Bench 2.1 was scored on the Terminus-2 public harness (GPT-5.5’s Codex CLI harness score is 83.4%), that OSWorld-Verified methodology changed with Opus 4.7’s score updated to 82.3%, and that on Finance Agent v2 Gemini 3.5 Flash scores 57.9%.

What Is Next: Cheaper Models, Higher Intelligence, and Mythos

Anthropic outlined a three-part roadmap. First, the company is working on models that provide many of the same capabilities as Opus at a lower cost. Second, it plans to release a new class of model with even higher intelligence than Opus. Third, as part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work; models of this capability level require stronger cyber safeguards before general release, and Anthropic expects to bring Mythos-class models to all customers in the coming weeks.

Notable Quotes

“Claude Opus 4.8 has noticeably better judgment. In Claude Code, it asks the right questions, catches its own mistakes, pushes back when a plan isn’t sound, and builds up confidence around complex, multi-service explorations before making big changes. It’s a great model to build with.”
Tom Pritchard, Staff Engineer, in Claude Code

“On our Super-Agent benchmark, Claude Opus 4.8 is the only model to complete every case end-to-end, beating prior Opus models and GPT-5.5 at parity on cost. For agent products in translation, deep research, slide-building, and analysis, it delivers powerful reliability.”
Kay Zhu, Co-Founder and CTO, on the Super-Agent benchmark

“On CursorBench, Claude Opus 4.8 exceeds prior Opus models across every effort level. Tool calling is meaningfully more efficient, using fewer steps for the same intelligence, and it carries end-to-end tasks through.”
Michael Truell, Co-Founder and CEO, on CursorBench results

“Claude Opus 4.8 delivers the highest score recorded on our Legal Agent Benchmark, and is the first model to break 10% overall on the all-pass standard. For substantive legal work, that’s the kind of accuracy lift that translates directly into how much real attorney work our customers can hand off with confidence.”
Niko Grupen, Head of Applied Research, on the Legal Agent Benchmark

“Claude Opus 4.8 feels like a major quality-of-life update over Opus 4.7: faster, easier to collaborate with, and better at carrying context and style direction across a long session. Opus 4.8 is the model I kept trusting for work where voice, taste, and technical execution all have to happen side-by-side.”
Katie Parrott, Staff Writer, on long writing sessions

“Claude Opus 4.8 is the strongest computer-use and browser-agent model we’ve tested, scoring 84% on Online-Mind2Web, which is a meaningful jump over both Opus 4.7 and GPT-5.5. It stays reflective and on-task in the way our customers’ agent workloads need to be reliable end-to-end.”
Miguel Gonzalez, Tech Lead, on computer-use and browser agents

“Claude Opus 4.8 uses tools cleanly and follows instructions with the consistency our autonomous engineering workloads need to keep running unattended. It improves on Opus 4.6 and fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7. This release from Anthropic translates directly into faster capability gains for engineers building on Devin.”
Scott Wu, CEO, on building with Devin

“On our long-running evals, Claude Opus 4.8’s analysis was consistently higher quality than prior Opus models. It finished faster and produced richer, more information dense outputs. Overall, a noticeably better signal to noise ratio. The biggest differentiator was Opus 4.8’s tendency to proactively flag issues with the inputs and outputs of an analysis, something other models routinely missed and left to the users to catch.”
Michael Ran, Sr. Investment Associate, on long-running analysis evals

Claude Opus 4.8 is a quieter release than its “modest but tangible” billing suggests, because the gains land where autonomous work actually lives: a model that flags its own uncertainty, runs longer and checks itself, scales effort on demand, and stays affordable while fast mode gets cheaper. The honesty improvement alone changes the trust math for anyone deploying agents. Read Anthropic’s full announcement here.

Related Reading
- Claude Opus 4.8 System Card, the source for the full benchmark table, wider evaluations, and the complete alignment assessment.
- Claude API model overview, with the claude-opus-4-8 model ID and current pricing.
- Claude Code, where the new dynamic workflows feature ships.
- Introducing dynamic workflows in Claude Code, Anthropic’s deep dive on planning a job and running hundreds of parallel subagents in a single session.
- Anthropic’s Responsible Scaling Policy, the framework behind the Mythos cyber-safeguards.
- Agentic AI, background on the paradigm Opus 4.8 is optimized for.
May 28, 2026
Alex Wang on Leaving Scale to Run Meta Superintelligence Labs, MuseSpark, Personal Super Intelligence, and Building an Economy of Agents
Alex Wang, head of Meta Superintelligence Labs, sits down with Ashley Vance and Kylie Robinson on the Core Memory podcast for his first long-form interview since Meta’s quasi-acquisition of Scale AI roughly ten months ago. He walks through how MSL is structured, why Llama was off-trajectory, what made MuseSpark’s token efficiency surprise the team, how Meta thinks about a future “economy of agents in a data center,” and where he lands on safety, open source, robotics, brain computer interfaces, and even model welfare.

TLDW

Wang explains that Meta Superintelligence Labs is a fully rebuilt frontier effort organized around four principles (take superintelligence seriously, technical voices loudest, scientific rigor, big bets) and three velocity levers (high compute per researcher, extreme talent density, ambitious research bets). He confirms Llama was off the frontier when he arrived, so MSL rebuilt the pre-training, reinforcement learning, and data stacks from scratch. MuseSpark is described as the “appetizer” on the scaling ladder, notable for its strong token efficiency, with much larger and stronger models coming in the coming months. He pushes back on the mercenary narrative around recruiting, frames Meta’s edge as compute plus billions of consumers and hundreds of millions of small businesses, sketches a vision of personal super intelligence delivered through Ray-Ban Meta glasses and WhatsApp, and outlines why physical intelligence, robotics (the new Assured Robot Intelligence acquisition), health super intelligence with CZI, brain computer interfaces, and even model welfare are core to Meta’s roadmap. He dismisses reported infighting with Bosworth and Cox as gossip, declines to comment on the Manus situation, and says safety guardrails (bio, cyber, loss of control) are why MuseSpark cannot currently be open sourced, while smaller open variants are being prepared.

Key Takeaways
- Meta Superintelligence Labs (MSL) is the umbrella, with TBD Lab as the large-model research unit reporting directly to Alex Wang, PAR (Product and Applied Research) under Nat Friedman, FAIR for exploratory science, and Meta Compute under Daniel Gross handling long-term GPU and data center planning.
- Wang says Llama was not on a frontier trajectory when he arrived, so MSL had to do a “full renovation” of the pre-training stack, RL stack, data pipeline, and research science.
- The first cultural fix was getting the lab to “take superintelligence seriously” as a near-term, achievable goal, not an abstract bet. Big incumbents often lack that religious conviction.
- Four MSL principles: take superintelligence seriously, let technical voices be loudest, demand scientific rigor on basics, and make big bets.
- Three velocity levers Wang identified for catching and overtaking the frontier: high compute per researcher, very high talent density in a small team, and willingness to fund ambitious research bets.
- Wang rejects the mercenary recruiting narrative. He says most hires had strong financial prospects at their prior labs already and joined for compute access, talent density, and the chance to build from scratch.
- On the famous soup story, Wang neither confirms nor denies Zuck personally made the soup, but says recruiting was highly individualized and signaled how seriously Meta cared about each researcher’s agenda.
- Yann LeCun publicly called Wang young and inexperienced. Wang says they reconciled in person at a conference in India where LeCun congratulated him on MuseSpark.
- Sam Altman, asked by Vance for comment, “did not have flattering things to say” about Wang. Wang hopes industry animosities subside as systems approach superintelligence.
- Wang’s management philosophy borrows the Steve Jobs line: hire brilliant people so they tell you what to do, not the other way around.
- MuseSpark is framed as an “appetizer” data point on the MSL scaling ladder, not a flagship.
- The MuseSpark program is built around predictable scaling on multiple axes: pre-training, reinforcement learning, test-time compute, and multi-agent collaboration (the 16-agent content planning mode).
- MuseSpark outperformed internal expectations and showed emergent capabilities in agentic visual coding, including generating websites and games from prompts, helped by combined agentic and multimodal strength.
- MuseSpark’s biggest external signal is token efficiency. On benchmarks like Artificial Analysis it hits similar results with far fewer tokens than competitor models, which Wang attributes to a clean stack rebuilt by experts rather than inefficiencies patched by longer thinking.
- Larger MSL models are arriving in the coming months and Wang expects them to be state of the art in the areas MSL is focused on.
- The Meta strategic edge: massive compute, billions of consumers across the family of apps, and hundreds of millions of small businesses already on Facebook, Instagram, and WhatsApp.
- Wang’s headline framing: Dario Amodei talks about a “country of geniuses in a data center.” Meta is targeting an “economy of agents in a data center,” with consumer agents and business agents transacting and collaborating.
- Consumer AI sentiment is in the toilet because, unlike developers who have had a Claude Code moment, ordinary people have not yet experienced AI as a genuine personal agency unlock.
- Wang acknowledges the product overhang. Meta held back from deep AI integration across its apps until the models were good enough, and is now entering the integration phase.
- Ray-Ban Meta glasses are the canonical example of personal super intelligence hardware, with the model seeing what the user sees, hearing what they hear, capturing context, and surfacing proactive insights.
- Wang admits even AI-native users like Kylie Robinson, who lives in WhatsApp, have not naturally used Meta AI yet. He bets that better models plus deeper integration close that gap.
- On the competitive landscape: a year ago everyone assumed ChatGPT had already won consumer. Claude Code has since become the fastest growing business in history, and Gemini has taken consumer market share. Wang’s read: AI is far from endgame and each new capability tier unlocks a new dominant form factor.
- On open source: MuseSpark triggered guardrails in Meta’s Advanced AI Scaling Framework around bio, chem, cyber, and loss-of-control risks, so it is not currently safe to open source. Smaller, derived open variants are actively in development.
- Meta remains committed to open sourcing models when safety allows, drawing a line through the Open Compute Project legacy and Sun Microsystems open-software heritage.
- Wang dismisses reporting about a Wang-Zuck versus Bosworth-Cox split as “the line between gossip and reporting is remarkably thin.” He says leadership is aligned on needing best-in-class models and product integration.
- On the Manus situation, Wang says it is too complicated to discuss publicly and that the deal status implies “machinations are still at play.”
- On China, Wang separates the people from the state. He still wants to work with talented Chinese-born researchers regardless of his views on the Chinese Communist Party and PLA, which he sees as taking AI extremely seriously for national security.
- The full-page New York Times AI war ad Wang ran while at Scale was meant to push the US government to treat AI as a step change for national security. He thinks events since then, including DeepSeek and other shocks, have proved that plea correct.
- On Anthropic’s doom posture, Wang largely agrees with the core message that models are already very powerful and getting more so, while declining to endorse every specific claim.
- Meta has acquired Assured Robot Intelligence (ARRI), an AI software company building models for hardware platforms, not a hardware maker itself.
- Wang frames physical super intelligence as the natural sequel to digital super intelligence. Robotics, world models, and physical intelligence all benefit from the same scaling that drives language models.
- On health, MSL is building a “health super intelligence” effort and will collaborate closely with CZI. Wang sees equal global access to powerful health AI as a uniquely Meta-shaped delivery problem.
- Wang admires John Carmack but says nobody really knows what Carmack is currently working on. No band reunion announced.
- The mango model is “alive and kicking” despite rumors. Wang notes MSL gets a small fraction of the rumor-mill attention other labs get and feels sympathy for them.
- On model welfare, Wang says it is a serious topic that “nobody is talking about enough” given how integrated models have become as work partners. He references research, including from Eleos, that measures subjective experience of models.
- Wang’s critical-path technology list: super intelligence, robotics, brain computer interfaces. The infinite-scale primitives behind them are energy, compute, and robots.
- FAIR’s brain research program Tribe hit a milestone called Tribe B2: a foundation model that can predict how an unknown person’s brain would respond to images, video, and audio with reasonable zero-shot generalization.
- Wang’s main philosophical break with Elon Musk: research itself is the primary activity. Building super intelligence is a research expedition through fog of war, and sequencing of bets really matters.
- Personal notes: Wang moved from San Francisco to the South Bay, treats Palo Alto as his city now, was a math olympiad competitor, says his favorite activities are reading sci-fi and walking in the woods, and bonds with Vance over country music.
Detailed Summary

How MSL Is Actually Organized

Meta Superintelligence Labs sits as the umbrella organization that Wang oversees. Inside it, TBD Lab is the large-model research group where the most discussed researchers and infrastructure engineers sit, and they technically report to Wang. PAR, Product and Applied Research, is led by Nat Friedman and owns deployment and product surfaces. FAIR continues to run exploratory science, including work on brain prediction models and a universal model for atoms used in computational chemistry. Sitting alongside MSL is Meta Compute, run by Daniel Gross, which owns the long-horizon GPU and data center plan that everything else relies on. Chief scientist Shengjia Zhao orchestrates the scientific agenda across the whole lab.

Why Wang Left Scale

Wang says progress in frontier AI has been faster than even insiders expected. Two structural beliefs pushed him toward Meta. First, the labs that actually train the frontier models are accruing disproportionate economic and product rights in the AI ecosystem. Second, compute is the dominant scarce input of the next phase, so the right mental model is to treat tech companies with compute as fundamentally different animals from companies without it. Meta has both, Zuck is “AGI pilled,” and the personal super intelligence memo Zuck published roughly a year ago became the shared north star.

The Diagnosis: Llama Was Off-Trajectory

When Wang arrived, the existing AI org needed a reset because Llama was not on the same trajectory as the frontier. The plan he laid out has four cultural principles. Take superintelligence seriously as a real near-term target. Make technical voices the loudest in the room. Demand scientific rigor and focus on basics. Make big bets. On top of that, three structural levers were used to set velocity. Push compute per researcher much higher than at larger labs where compute is diluted across too many efforts. Keep the team small and extremely cracked. Allocate a meaningful share of resources to ambitious, paradigm-shifting research bets rather than incremental refinement.

Recruiting, Soup, and the Mercenary Narrative

Wang argues the reporting on MSL hiring overstated the money story. Most of the people MSL recruited had strong financial paths at their previous employers, so individualized recruiting was more about computing access, talent density, and the ability to make big research bets. The recruitment blitz happened fast because Wang knew the team needed to exist “yesterday.” Asked about Mark Chen’s claim that Zuck made soup to recruit people, Wang refuses to confirm or deny who made it but agrees the process was intense and personal. Visitors from other labs reportedly tell Wang the MSL culture feels like early OpenAI or early Anthropic, which lands as the strongest endorsement he could ask for.

Receiving the Public Hits: Young, Inexperienced, Mercenary

LeCun called Wang young and inexperienced shortly after departing. The two reconnected in India a few weeks later and LeCun congratulated Wang on MuseSpark. Wang says the age critique has followed him since his earliest Silicon Valley days, so he barely registers it. Altman, asked off-camera by Vance about Wang’s appearance on the show, had nothing flattering to add. Wang’s response is to bet that as the field gets closer to actual super intelligence, the personal animosities will subside. Whether they will is, as Vance puts it, an open question.

MuseSpark as Appetizer, Not Entree

Wang is careful not to oversell MuseSpark. He calls it “the appetizer” and says it is an early data point on a deliberately constructed scaling ladder. MSL spent nine months rebuilding the pre-training stack, the reinforcement learning stack, the data pipeline, and the science before generating MuseSpark. The point of releasing it was to show that the new program scales predictably along multiple axes (pre-training, RL, test-time compute, and the recently demonstrated multi-agent scaling visible in MuseSpark’s 16-agent content planning mode). Wang says the upcoming larger models are what MSL is genuinely excited about and frames the next two rungs as much more interesting than the current release.

Token Efficiency Was the Surprise

MuseSpark’s strongest competitive signal is how few tokens it needs to match competitors on tasks like Artificial Analysis. Wang attributes this to having had the rare luxury of building a clean pre-training and RL stack from scratch with the right experts. He speculates that some competitor models compensate for upstream inefficiency by allowing the model to think longer, which inflates token usage without improving the underlying capability. If that read is right, MSL’s efficiency advantage should grow as models scale up.

Glasses, WhatsApp, and the Constellation of Devices

Personal super intelligence shows up at Meta as a constellation of devices that capture context across the user’s day. Ray-Ban Meta glasses are the headline product, with the AI seeing what you see and hearing what you hear, then offering proactive insight or doing background research. Wang acknowledges that even AI-fluent users like Kylie Robinson, who runs her business inside WhatsApp, have not naturally used Meta’s AI buttons in the family of apps. His answer is that Meta deliberately waited for models to be good enough before tightening cross-app integration, and that integration phase is starting now.

Country of Geniuses Versus Economy of Agents

Wang’s framing of Meta’s strategic position is the most memorable line in the interview. Where Dario Amodei talks about a country of geniuses in a data center, Wang wants to build an economy of agents in a data center. Meta uniquely sits on both sides of consumer and small-business surface area, with billions of consumers and hundreds of millions of small businesses already on the platforms. If MSL can build great agents for both, then connect them so they transact and coordinate, the platform becomes a substrate for an entirely new kind of digital economy.

Consumer Sentiment, Product Overhang, and the Trust Tax

Wang concedes consumer AI sentiment is poor and that everyday users have not yet had a personal Claude Code moment. He believes the only durable answer is to ship products that genuinely transform individual agency for non-developers and small business owners. Robinson notes that for the small-town restaurant whose website has not been updated since 2002, a working agent on the business side could be transformational. Vance pushes that Meta carries a bigger trust tax than any other lab, so the bar for shipping AI products that the public will accept is correspondingly higher. Wang accepts the framing and says the answer is to keep building thoughtfully.

Why MuseSpark Cannot Be Open Sourced Yet

Meta’s Advanced AI Scaling Framework set explicit guardrails around bio, chem, cyber, and loss-of-control risks. MuseSpark in its current form tripped some of those internal evaluations, documented in the preparedness report Meta published alongside the model. So MuseSpark itself is not safe to open source. MSL is, however, developing smaller versions and derived models intended for open release, with active reviews happening the day of the interview. Wang reaffirms the commitment to open source where safety allows and draws a line back to the Open Compute Project and the Sun Microsystems-era ethos of openness in infrastructure.

The Bosworth, Cox, and Manus Questions

The reporting that Wang and Zuck push toward best-in-the-world research while Bosworth and Cox push toward cheap product deployment is dismissed as gossip dressed up as journalism. Wang says leadership debates points hard but is aligned on needing top models, integrating them into Meta’s surfaces, and serving the existing business. On Manus, the Chinese AI startup that figured in Meta’s late-stage strategy, Wang says he cannot comment, which itself signals that the situation is unresolved.

China, National Security, and the Newspaper Ad

Wang draws a sharp distinction between the Chinese state and Chinese-born researchers. His parents are from China, he is happy to work with talented researchers regardless of origin, and he sees a flattening of nuance on this question inside Silicon Valley. At the same time, he stands by the New York Times AI and war ad he ran while at Scale, framing it as an early plea for the US government to take AI seriously as a national security technology. He thinks subsequent events, including DeepSeek and other shocks, validated that call and that policymakers now do treat AI accordingly.

Robotics and Physical Super Intelligence

Meta has acquired Assured Robot Intelligence, an AI software company that builds models for multiple hardware targets rather than its own robot. Wang argues that if you take digital super intelligence seriously, physical super intelligence quickly becomes the next logical milestone. Scaling laws for robotic intelligence look similar enough to language model scaling that having the largest compute footprint in the industry would be wasted if it were not also turned toward world modeling and embodied learning. He grants the metaverse-skeptic critique exists but says retreating from ambition is the wrong response to past misfires.

Health Super Intelligence and CZI

Wang names health super intelligence as one of MSL’s anchor initiatives. Because billions of people already use Meta products daily, Wang believes Meta is structurally positioned to put powerful health AI in the hands of equal global access in a way nobody else can. The work will involve close collaboration with the Chan Zuckerberg Initiative, which has its own multi-billion-dollar biotech and science investment program.

Model Welfare, Sci-Fi, and Brain Models

Two of the most distinctive moments come at the end. Wang flags model welfare as a topic he thinks is being undercovered relative to how integrated models now are in daily work. He is open to the idea that models may have measurable subjective experience worth weighing, and points to research efforts (including Eleos) trying to quantify it. He also reveals that FAIR’s Tribe program, with its Tribe B2 milestone, has produced foundation models capable of predicting how an unknown person’s brain would respond to images, video, and audio with reasonable zero-shot generalization, a building block toward future brain computer interfaces. Wang lists brain computer interfaces alongside super intelligence and robotics as the critical-path technologies for humanity, with energy, compute, and robots as the infinitely scaling primitives behind them.

Where Wang Diverges From Elon

Asked whether Musk is more all-in on robotics, energy, and BCI than anyone, Wang concedes the point but argues the details matter and sequencing matters more. Wang’s core philosophical break is that building super intelligence is fundamentally a research activity, not a scaling-only sprint. The lab is operating in fog of war, and ambitious experiments are the only way to map it. That conviction is what makes MSL a research-led organization rather than a brute-force compute farm.

Thoughts

The most strategically interesting move in this entire interview is the “economy of agents in a data center” framing. It is a deliberate reframe against Anthropic’s “country of geniuses” line, and it does real work. A country of geniuses is a labor-substitution story aimed at knowledge workers and code. An economy of agents is a marketplace story that maps directly onto Meta’s two-sided distribution advantage: billions of consumers on one side, hundreds of millions of small businesses on the other. That positioning makes the agentic future Meta-shaped in a way no other frontier lab can claim, because no other frontier lab also owns the demand and supply graph of the global small-business economy. If Wang’s team can actually ship reliable agents on both sides plus the rails for them to transact, Meta’s structural moat in agentic commerce could exceed anything Llama ever had as an open model.

The token efficiency claim is the strongest piece of technical evidence in the interview for the “clean stack” thesis. If MuseSpark really is matching competitors with materially fewer tokens, the implication is not that MuseSpark is the best model today, but that MSL has rebuilt the foundations with less accumulated tech debt than competitors that have layered fixes on top of older stacks. That is exactly the kind of advantage that compounds with scale. The next two model releases are the actual test. If Wang is right about predictable scaling on pre-training, RL, test-time, and multi-agent axes simultaneously, the gap from MuseSpark to the next rung should be visible in a way that forces re-rating of Meta’s position.

The open-source posture is the cleanest signal of how the safety conversation has actually changed in 2026. Meta, the lab most identified with open weights, is saying out loud that its current frontier model triggered enough internal guardrails that releasing the weights is off the table. Wang threads the needle by promising smaller open variants, but the underlying point is unmistakable: the open-weights bargain has limits, and those limits will be set by internal preparedness frameworks rather than community pressure. That is a real shift from the Llama 2 era and worth tracking as the next generation lands.

Wang’s willingness to engage on model welfare, on roughly the same footing as safety and alignment, is the second philosophical reveal worth flagging. It signals that the next generation of lab leadership is not going to dismiss the topic the way the previous generation often did. Whether that translates into product or policy changes is unclear, but the fact that the head of MSL says it is “underdiscussed” is itself a marker.

Finally, the human texture of the interview matters. Wang has clearly absorbed a lot of personal incoming fire over the past ten months, including from LeCun and Altman, and his answer is consistently to redirect to the work. The Steve Jobs quote about hiring people who tell you what to do is the operating slogan he keeps coming back to. Combined with the genuine enthusiasm for sci-fi, walks in the woods, and country music, the picture that emerges is less the salesman caricature his critics paint and more a young technical operator betting that scoreboard work over a multi-year horizon will settle every argument that text on X cannot.

Watch the full conversation here.
May 13, 2026