英语轻松读发新版了,欢迎下载、更新

Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning | Microsoft Azure Blog

2025-07-09 16:00:00 英文原文

作者:Weizhu Chen, Jianfeng Gao, Liliang Ren

Unlock faster, efficient reasoning with Phi-4-mini-flash-reasoning—optimized for edge, mobile, and real-time applications.

State of the art architecture redefines speed for reasoning models

Microsoft is excited to unveil a new edition to the Phi model family: Phi-4-mini-flash-reasoning. Purpose-built for scenarios where compute, memory, and latency are tightly constrained, this new model is engineered to bring advanced reasoning capabilities to edge devices, mobile applications, and other resource-constrained environments. This new model follows Phi-4-mini, but is built on a new hybrid architecture, that achieves up to 10 times higher throughput and a 2 to 3 times average reduction in latency, enabling significantly faster inference without sacrificing reasoning performance. Ready to power real world solutions that demand efficiency and flexibility, Phi-4-mini-flash-reasoning is available on Azure AI Foundry, NVIDIA API Catalog, and Hugging Face today.

Efficiency without compromise 

Phi-4-mini-flash-reasoning balances math reasoning ability with efficiency, making it potentially suitable for educational applications, real-time logic-based applications, and more. 

Similar to its predecessor, Phi-4-mini-flash-reasoning is a 3.8 billion parameter open model optimized for advanced math reasoning. It supports a 64K token context length and is fine-tuned on high-quality synthetic data to deliver reliable, logic-intensive performance deployment.  

What’s new?

At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers.  The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. 

Key benefits of the SambaY architecture include: 

  • Enhanced decoding efficiency.
  • Preserves linear prefiling time complexity.
  • Increased scalability and enhanced long context performance.
  • Up to 10 times higher throughput.
A diagram of a computer program
Our decoder-hybrid-decoder architecture taking Samba [RLL+25] as the self-decoder. Gated Memory Units (GMUs) are interleaved with the cross-attention layers in the cross-decoder to reduce the decoding computation complexity. As in YOCO [SDZ+24], the full attention layer only computes the KV cache during the prefilling with the self-decoder, leading to linear computation complexity for the prefill stage.

Phi-4-mini-flash-reasoning benchmarks 

Like all models in the Phi family, Phi-4-mini-flash-reasoning is deployable on a single GPU, making it accessible for a broad range of use cases. However, what sets it apart is its architectural advantage. This new model achieves significantly lower latency and higher throughput compared to Phi-4-mini-reasoning, particularly in long-context generation and latency-sensitive reasoning tasks. 

This makes Phi-4-mini-flash-reasoning a compelling option for developers and enterprises looking to deploy intelligent systems that require fast, scalable, and efficient reasoning—whether on premises or on-device. 

A graph of a number of people
A graph with red and blue dots and numbers
The top plot shows inference latency as a function of generation length, while the bottom plot illustrates how inference latency varies with throughput. Both experiments were conducted using the vLLM inference framework on a single A100-80GB GPU with tensor parallelism (TP) set to 1.
A graph of different colored bars
A more accurate evaluation was used where Pass@1 accuracy is averaged over 64 samples for AIME24/25 and 8 samples for Math500 and GPQA Diamond. In this graph, Phi-4-mini-flash-reasoning outperforms Phi-4-mini-reasoning and is better than models twice its size.

What are the potential use cases? 

Thanks to its reduced latency, improved throughput, and focus on math reasoning, the model is ideal for: 

  • Adaptive learning platforms, where real-time feedback loops are essential.
  • On-device reasoning assistants, such as mobile study aids or edge-based logic agents.
  • Interactive tutoring systems that dynamically adjust content difficulty based on a learner’s performance.

Its strength in math and structured reasoning makes it especially valuable for education technology, lightweight simulations, and automated assessment tools that require reliable logic inference with fast response times. 

Developers are encouraged to connect with peers and Microsoft engineers through the Microsoft Developer Discord community to ask questions, share feedback, and explore real-world use cases together. 

Microsoft’s commitment to trustworthy AI 

Organizations across industries are leveraging Azure AI and Microsoft 365 Copilot capabilities to drive growth, increase productivity, and create value-added experiences. 

We’re committed to helping organizations use and build AI that is trustworthy, meaning it is secure, private, and safe. We bring best practices and learnings from decades of researching and building AI products at scale to provide industry-leading commitments and capabilities that span our three pillars of security, privacy, and safety. Trustworthy AI is only possible when you combine our commitments, such as our Secure Future Initiative and our responsible AI principles, with our product capabilities to unlock AI transformation with confidence.  

Phi models are developed in accordance with Microsoft AI principles: accountability, transparency, fairness, reliability and safety, privacy and security, and inclusiveness.  

The Phi model family, including Phi-4-mini-flash-reasoning, employs a robust safety post-training strategy that integrates Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). These techniques are applied using a combination of open-source and proprietary datasets, with a strong emphasis on ensuring helpfulness, minimizing harmful outputs, and addressing a broad range of safety categories. Developers are encouraged to apply responsible AI best practices tailored to their specific use cases and cultural contexts. 

Read the model card to learn more about any risk and mitigation strategies.  

Learn more about the new model 

Create with Azure AI Foundry

关于《Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning | Microsoft Azure Blog》的评论


暂无评论

发表评论

摘要

Microsoft introduces Phi-4-mini-flash-reasoning, a new model optimized for edge devices, mobile applications, and real-time scenarios with up to 10 times higher throughput and reduced latency. This hybrid architecture model, featuring Gated Memory Units (GMUs), enhances decoding efficiency and long-context performance without compromising reasoning capabilities. Available on Azure AI Foundry, NVIDIA API Catalog, and Hugging Face, Phi-4-mini-flash-reasoning is ideal for educational platforms, real-time logic applications, and interactive tutoring systems requiring fast response times.