
Chinese AI and technology companies continue to impress with the development of cutting-edge AI language models.
Today, the one catching the eye is Alibaba Cloud’s Qwen team of AI researchers and the unveiling of a new proprietary linguistic reasoning model, Qwen3-Max-Thought.
You may recall, as VentureBeat explained last year, that Qwen made a name for itself in the rapidly evolving global AI market by offering a variety of powerful open source models in various modalities, from text to image to spoken audio. The company has even gained the support of US tech accommodation giant Airbnb, whose CEO and co-founder Brian Chesky said the company leverages Qwen’s free and open source models. as a more affordable alternative to American offerings like those from OpenAI.
Now, with the proprietary Qwen3-Max-Thinking solution, the Qwen team aims to match and, in some cases, exceed the reasoning capabilities of GPT-5.2 and Gemini 3 Pro through architectural efficiency and agent autonomy.
The release comes at a critical time. Western laboratories have largely defined the "reasoning" category (often called "System 2" logical), but Qwen’s latest benchmarks suggest that the gap has narrowed.
Additionally, the company’s relatively affordable price API Pricing Strategy aggressively targets enterprise adoption. However, because it is a Chinese model, some U.S. companies with strict national security requirements and considerations might be hesitant to adopt it.
Architecture: "Test time scaling" Redefined
The main innovation behind Qwen3-Max-Thinking is a departure from standard inference methods. While most models generate tokens linearly, Qwen3 uses a "heavy mode" driven by a technique known as "Test time scaling."
Simply put, this technique allows the model to trade computation for intelligence. But unlike the naive "best of N" sampling – where a model can generate 100 answers and choose the best one – Qwen3-Max-Thinking uses a multi-round strategy of experience accumulation.
This approach mimics human problem solving. When the model encounters a complex query, it doesn’t just guess; it engages in iterative self-reflection. It uses an owner "to experience" mechanism for distilling information from previous reasoning steps. This allows the model to:
-
Identify dead ends: Recognize when reasoning fails without needing to go through it completely.
-
Concentration calculation: Redirect processing power to "unresolved uncertainties" rather than drawing known conclusions again.
The efficiency gains are tangible. By avoiding redundant reasoning, the model integrates richer historical context in the same window. The Qwen team reports that this method significantly increased performance without exploding token costs:
-
GPQA (doctoral level science): Scores improved from 90.3 to 92.8.
-
LiveCodeBench v6: Performance went from 88.0 to 91.4.
Beyond pure thought: adaptive tools
While "thought" models are powerful, they’ve always been siloed – great at math, but poor at browsing the web or running code. Qwen3-Max-Thinking fills this gap by efficiently integrating "modes of thinking and non-thinking".
The model has adaptive tool usage capabilities, meaning it autonomously selects the right tool for the job without manual prompts from the user. It can seamlessly switch between:
-
Web search and retrieval: For real-time factual queries.
-
Memory: To store and recall user-specific context.
-
Code interpreter: Write and run Python snippets for computational tasks.
In "Reflection mode," the model supports these tools simultaneously. This functionality is essential for enterprise applications where a model may need to verify a fact (Research), calculate a projection (Code Interpreter), and then reason about the strategic implication (Reflection), all in one go.
Empirically, the team notes that this combination "effectively reduces hallucinations," because the model can base its reasoning on verifiable external data rather than relying solely on its training weights.
Benchmark Analysis: The Data Story
Qwen doesn’t shy away from making direct comparisons.
On February 25 HMMT, a rigorous reasoning benchmark, Qwen3-Max-Thinking scored 98.0, besting Gemini 3 Pro (97.5) and significantly edging out DeepSeek V3.2 (92.5).
However, arguably the most significant signal for developers is agentic search. On "Humanity’s Last Exam" (HLE) — the benchmark that measures performance over 3,000 "Google proof" higher level questions on mathematics, science, computer science, humanities and engineering – Qwen3-Max-Thinking, equipped with web search tools, scored 49.8, beating Gemini 3 Pro (45.8) and GPT-5.2-Thinking (45.5) .
This suggests that the Qwen3-Max-Thinking architecture is particularly suited to complex, multi-step agent workflows where external data retrieval is required.
In coding tasks, the model also shines. On Arena-Hard v2, he posted a score of 90.2, leaving competitors like Claude-Opus-4.5 (76.7) far behind.
The economics of reasoning: price distribution
For the first time, we have a clear overview of the economics of Qwen’s foreground reasoning model. Alibaba Cloud has positioned itself qwen3-max-2026-01-23 as a premium but accessible offer on its API.
-
To input: $1.20 for 1 million tokens (for standard contexts <= 32,000).
-
To go out: $6.00 for 1 million tokens.
At a basic level, here’s how Qwen3-Max-Thinking compares:
|
Model |
Input (/1M) |
Output (/1M) |
Total cost |
Source |
|
Qwen 3 Turbo |
$0.05 |
$0.20 |
$0.25 |
|
|
Grok 4.1 Rapide (reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
Grok 4.1 Fast (without reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
deep search chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
deep search reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
Qwen3 Plus |
$0.40 |
$1.20 |
$1.60 |
|
|
ERNIE 5.0 |
$0.85 |
$3.40 |
$4.25 |
|
|
Gemini 3 Flash Preview |
$0.50 |
$3.00 |
$3.50 |
|
|
Claude Haïku 4.5 |
$1.00 |
$5.00 |
$6.00 |
|
|
Reflection Qwen3-Max (2026-01-23) |
$1.20 |
$6.00 |
$7.20 |
|
|
Gemini 3 Pro (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
GPT-5.2 |
$1.75 |
$14.00 |
$15.75 |
|
|
Claude Sonnet 4.5 |
$3.00 |
$15.00 |
$18.00 |
|
|
Gemini 3 Pro (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Close job 4.5 |
$5.00 |
$25.00 |
$30.00 |
|
|
GPT-5.2 Pro |
$21.00 |
$168.00 |
$189.00 |
This pricing structure is aggressive, undercutting the prices of many existing flagship models while still providing industry-leading performance.
However, developers should note the granular pricing of new agentic features, as Qwen separates the cost of "thought" (tokens) of the cost of "TO DO" (use of tools).
-
Agent search strategy: The two standards
search_strategy:agentand the most advancedsearch_strategy:agent_maxare priced at $10 for 1,000 calls.-
Note: THE
agent_maxthe strategy is currently marked as a "Limited time offer," which suggests that its price could increase later.
-
-
Web search: Priced at $10 for 1,000 calls via the Responses API.
Promotional free level:To encourage adoption of its most advanced features, Alibaba Cloud is currently offering two key tools free and for a limited time:
-
Web Extractor: Free (Limited time).
-
Code interpreter: Free (Limited time).
This pricing model (low token cost + a la carte tool pricing) allows developers to build complex, cost-effective agents for text processing, while paying extra only when external actions, like a live web search, are explicitly triggered.
Developer ecosystem
Recognizing that performance is useless without integration, Alibaba Cloud has ensured that Qwen3-Max-Thinking is ready for integration.
-
OpenAI compatibility: The API supports the OpenAI standard format, allowing teams to switch models by simply changing the
base_urlAndmodelname. -
Anthropogenic compatibility: In a smart move to capture the coding market, the API also supports the Anthropic protocol. This makes Qwen3-Max-Thinking compatible with Claude Codea popular agent coding environment.
The verdict
Qwen3-Max-Thinking represents a maturation of the AI market in 2026. It moves the conversation beyond "who has the smartest chatbot" has "who has the most competent agent."
By combining high-efficiency reasoning with the use of adaptive and autonomous tools (and pricing them to move), Qwen has firmly established itself as a leading contender for the enterprise AI throne.
For developers and businesses, the "Free for a limited time" Windows on Code Interpreter and Web Extractor suggest that now is the time to experiment. The reasoning wars are far from over, but Qwen just deployed a very big hitter.




