AI & Tech Artificial Intelligence BigTech Companies Newswire Technology What's Buzzing

Google’s Gemini 3.1 Pro Doubles Its Reasoning Score

February 21, 2026Last Updated: February 21, 2026

2 minutes read

▼ Summary

– Google has released Gemini 3.1 Pro, a new AI model building on the progress of Gemini 3.
– The model shows improved reasoning, scoring 44.4% on the Humanity’s Last Exam benchmark and 77.1% on the ARC-AGI-2 logic benchmark.
– It is designed as the core intelligence for daily use, underpinning the science-focused Gemini 3 Deep Think mode.
– Despite these improvements, Anthropic’s Claude models currently lead in some external rankings for text capability and safety.
– Experts caution that true performance is relative and will be clearer after broader testing and competitor releases like GPT-5.3.

Google has officially launched Gemini 3.1 Pro, marking a significant step forward in its AI model development. This new release builds directly on the foundation set by Gemini 3, which itself garnered positive attention for its performance against rivals. The company reports that the latest iteration demonstrates a substantial leap in logical reasoning, a claim supported by its performance on key industry benchmarks designed to measure advanced cognitive abilities.

The announcement highlights a 77.1% score on the ARC-AGI-2 benchmark, which Google states represents more than double the reasoning performance of its predecessor, Gemini 3 Pro. This benchmark is specifically crafted to evaluate an AI’s ability to handle entirely new logic patterns, a critical measure of general intelligence. This progress follows closely on the heels of last week’s “Deep Think” upgrade to Gemini 3, which introduced enhanced capabilities for complex scientific and engineering challenges. Google positions Gemini 3.1 Pro as the upgraded core intelligence that enables such specialized breakthroughs.

Benchmark performance provides a useful, if incomplete, picture. Late last year, Gemini 3 set a new high score of 38.3% on the rigorous Humanity’s Last Exam (HLE) benchmark. Gemini 3.1 Pro now achieves a 44.4% score on that same test. It’s important to note that the specialized Deep Think mode scored even higher at 48.4%, but it operates with longer processing times optimized for heavy-duty research tasks. For a model designed for broader, daily use, the gains shown by 3.1 Pro are particularly noteworthy.

However, the competitive landscape remains fierce. Anthropic’s Claude Opus 4.6 currently leads the Center for AI Safety’s text capability leaderboard, which averages scores from several reasoning and knowledge benchmarks. On safety assessments, Anthropic’s models also outperform Gemini 3 according to CAIS data. This underscores a key reality in AI development: today’s state-of-the-art model can quickly be overshadowed by a competitor’s next release.

Experts advise a measured perspective on these announcements. While the benchmark numbers suggest a meaningful improvement, the true capabilities and limitations of Gemini 3.1 Pro will become clearer through widespread, real-world testing over time. Its value is also relative, as the industry awaits future releases from other labs, which could reset competitive expectations once again.

For those interested in hands-on experience, Gemini 3.1 Pro is available in preview through several Google platforms. Developers can access it via API in Google AI Studio, Android Studio, and other developer tools. Enterprise customers will find it in Vertex AI and Gemini Enterprise, while general users can explore its capabilities within NotebookLM and the standard Gemini app. This tiered rollout allows different user groups to evaluate how the model’s enhanced reasoning translates into practical utility for their specific needs.

(Source: ZDNET)