Alibaba's Qwen 3.5 matches frontier models at a fraction of the cost

What Happened

On February 16, 2026, Alibaba released the Qwen 3.5 flagship model (397B-A17B) followed on February 24 by the full medium model series: Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-27B, and Qwen3.5-Flash. All models support text, image, and video input, cover 201 languages (up from 82 in the prior generation), and are released under the Apache 2.0 license — meaning any developer or company can download and deploy them without licensing fees.

The headlining claim is performance at price. Qwen3.5-Flash, the hosted API version of the 35B-A3B model, is priced at $0.10 per million input tokens and $0.40 per million output tokens. Claude Sonnet 4.5 costs $3.00 per million input tokens — 30 times more. On the MMLU-Pro benchmark, which tests broad knowledge reasoning, the 122B-A10B model scores 86.7 against Claude Sonnet 4.5's 80.8. On SWE-bench Verified, which measures real-world software engineering performance, the Qwen 3.5 27B scores 72.4 compared to Claude Sonnet 4.5's 62.0. On BFCL V4, which evaluates AI agent tool use, the 122B-A10B scores 72.2 — nearly 30 percentage points above GPT-5 mini's 55.5 and Claude Sonnet 4.5's 54.8.

One caveat applies to all benchmark figures in this story: these numbers are primarily vendor-reported by Alibaba. Independent third-party verification is still underway, and scores may reflect conditions that favor the models' strongest task categories.

Why It Matters

A year ago, running a model competitive with Anthropic's Sonnet tier meant either paying Anthropic's API rates or maintaining expensive GPU infrastructure. Qwen 3.5 changes that arithmetic in two ways.

Editorial pencil sketch illustrating Mixture-of-Experts architecture: a hallway of numbered doors where only a few are open, representing selective expert activation in Qwen 3.5's 35B-A3B model

First, it demonstrates that architectural efficiency can substitute for parameter scale. The 35B-A3B model uses a Mixture-of-Experts (MoE) architecture — a design that keeps 35 billion parameters in storage but activates only 3 billion of them on any given input. That drastically reduces compute cost per query. The result: the 35B-A3B outperforms Alibaba's own previous-generation 235B-A22B model, which was nearly seven times larger by active parameter count. Smaller active footprints also mean faster inference. Alibaba reports the series decodes 19 times faster than Qwen3-Max at 256,000-token context lengths and costs 60% less to operate than its predecessor.

Second, the Apache 2.0 open license means developers can self-host these models rather than paying per-token API fees. For organizations processing high volumes of text, this shifts costs from a per-query operational expense to a capital investment — a meaningful option for teams that already operate GPU infrastructure.

Where Qwen 3.5 does not lead: Claude Sonnet 4.5 retains advantages in terminal coding and embodied reasoning. The competitive picture is not uniform across all task types. Additionally, self-hosting the largest variants — the 122B and 397B models — requires substantial GPU capacity, so the cost advantage for those tiers primarily applies via Alibaba Cloud API pricing rather than local deployment.

The broader pattern, which began with DeepSeek's January 2025 open-source release, continues: Chinese open-weight models reaching frontier-adjacent performance at a fraction of Western proprietary pricing. If that trend holds, near-term pressure on mid-tier API pricing from Anthropic and OpenAI will intensify.

Alibaba's Qwen 3.5 matches frontier models at a fraction of the cost

What Happened

Why It Matters

Related

Google's Lyria 3 Pro extends AI music from a jingle to an actual song

OpenAI's next model is nearly ready, and Altman says it can move the economy

OpenAI shutters AI video generator Sora