IBM Granite 4.0: Hyper-efficient, High Performance Hybrid Models for Enterprise

2025-10-02 13:19:01 英文原文

作者：Authors

Director, Technical Product Management, Granite

IBM

Here’s the key info, at a glance:

We’re launching Granite 4, the next generation of IBM language models. Granite 4.0 features a new hybrid Mamba/transformer architecture that greatly reduces memory requirements without sacrificing performance. They can be run on significantly cheaper GPUs and at significantly reduced costs compared to conventional LLMs.
These new Granite 4.0 offerings, open sourced under a standard Apache 2.0 license, are the world’s first open models to receive ISO 42001 certification and are cryptographically signed, confirming their adherence to internationally recognized best practices for security, governance and transparency.
Granite 4.0 models are available on IBM watsonx.ai, as well as through platform partners including (in alphabetical order) Dell Technologies on Dell Pro AI Studio and Dell Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE and Replicate. Access through Amazon SageMaker JumpStart and Microsoft Azure AI Foundry is coming soon.

The launch of Granite 4.0 initiates a new era for IBM’s family of enterprise-ready large language models, leveraging novel architectural advancements to double down on small, efficient language models that provide competitive performance at reduced costs and latency. The Granite 4.0 models were developed with a particular emphasis on essential tasks for agentic workflows, both in standalone deployments and as cost-efficient building blocks in complex systems alongside larger reasoning models.

The Granite 4.0 collection comprises multiple model sizes and architecture styles to provide optimal production across a wide array of hardware constraints, including:

Granite-4.0-H-Small, a hybrid mixture of experts (MoE) model with 32B total parameters (9B active)
Granite-4.0-H-Tiny, a hybrid MoE with 7B total parameters (1B active)
Granite-4.0-H-Micro, a dense hybrid model with 3B parameters.
This release also includes Granite-4.0-Micro, a 3B dense model with a conventional attention-driven transformer architecture, to accommodate platforms and communities that do not yet support hybrid architectures.

Granite 4.0-H Small is a workhorse model for strong, cost-effective performance on enterprise workflows like multi-tool agents and customer support automation. The Tiny and Micro models are designed for low latency, edge and local applications, and can also serve as a building block within larger agentic workflows for fast execution of key tasks such as function calling.

Granite 4.0 benchmark performance shows substantial improvements over prior generations—even the smallest Granite 4.0 models significantly outperform Granite 3.3 8B, despite being less than half its size—but their most notable strength is a remarkable increase in inference efficiency. Relative to conventional LLMs, our hybrid Granite 4.0 models require significantly less RAM to run, especially for tasks involving long context lengths (like ingesting a large codebase or extensive documentation) and multiple sessions at the same time (like a customer service agent handling many detailed user inquiries simultaneously).

Most importantly, this dramatic reduction in Granite 4.0’s memory requirements entails a similarly dramatic reduction in the cost of hardware needed to run heavy workloads at high inference speeds. Our aim is to lower barriers to entry by providing enterprises and open-source developers alike with cost-effective access to highly competitive LLMs.

IBM’s prioritization of practical inference efficiency on any hardware is matched by our emphasis on the safety, security and transparency of our model ecosystem. Following an extensive, months-long external audit of IBM’s AI development process, IBM Granite recently became the only open language model family to achieve ISO 42001 certification, meeting the world’s first international standard for accountability, explainability, data privacy and reliability in AI management systems (AIMS). That foundational trustworthiness is further bolstered by our recent partnership with HackerOne on a bug bounty program for Granite, as well our new practice of cryptographic signing of all 4.0 model checkpoints available on Hugging Face (enabling developers and enterprises to ensure the models’ provenance and authenticity).

Select enterprise partners, including EY and Lockheed Martin, were given early access to test Granite 4.0’s capabilities at scale on key use cases. Feedback from these early release partners, alongside feedback from the open-source community, will be used to improve and optimize the models for future updates.

Today’s release includes both Base and Instruct variants of Micro, Tiny and Small. Additional model sizes (both bigger and smaller), as well as variants with explicit reasoning support, are planned for release by the end of 2025.

Granite 4.0 inference efficiency

The hybrid Granite 4.0 models are significantly faster and more memory-efficient than comparably-sized models built with standard transformer architectures. The Granite 4 hybrid architecture combines a small amount of standard transformer-style attention layers with a majority of Mamba layers—more specifically, Mamba-2. Mamba processes the nuances of language in a way that’s wholly distinct from, and significantly more efficient than, that of conventional language models.

Memory usage

LLMs’ GPU memory requirements are often reported in terms of how much RAM is needed just to load up model weights. But many enterprise use cases—especially those involving large-scale deployment, agentic AI in complex environments or RAG systems—entail lengthy context, batch inferencing of several concurrent model instances at once, or both. In keeping with IBM’s emphasis on enterprise practicality, we evaluated and optimized Granite 4 with long context and concurrent sessions in mind.

Compared to conventional transformer-based models, Granite 4.0-H can offer over 70% reduction in RAM needed to handle long inputs and multiple concurrent batches.

The hybrid Granite 4.0 models are compatible with AMD Instinct™ MI-300X GPUs, enabling even further reduction of their memory footprint.

Inference speed

Conventional LLMs struggle to maintain throughput as context length or batch size increases. Our hybrid models continue to accelerate their output even at workloads where most models slow to a crawl or outright exceed hardware capacity. The more you throw at them, the more their advantages are apparent.

IBM worked with Qualcomm Technologies, Inc. and Nexa AI to ensure Granite 4.0 models’ compatibility with Hexagon™ NPUs¹ to further optimize inference speed for on-device deployment on smartphones and PC’s.

Of course, the actual utility of those efficiency advantages is driven by the fact that the quality of Granite 4.0 models’ output is competitive with that of models at or above their respective weight classes—especially on benchmarks that evaluate performance on key agentic AI tasks like instruction following and function calling.

All Granite 4.0 models offer major across-the-board performance improvements over the previous generation of Granite models. While the new Granite hybrid architecture contributes to the efficiency and efficacy of model training, most improvement in model accuracy are derived from advancements in our training (and post-training) methodologies and the ongoing expansion and refinement of the Granite training data corpus. This is how and why even Granite 4.0-Micro, built on a conventional transformer architecture similar to that of past Granite models, significantly outperforms Granite 3.3 8B.

They particularly excel on tasks essential to enterprise use cases and agentic AI workflows. As evaluated by Stanford HELM, Granite-4.0-H-Small exceeds all open weight models (with the sole exception of Llama 4 Maverick, a 402B-parameter model over 12 times its size) on IFEval, a widely used benchmark for evaluating a model’s ability to follow explicit instructions.

In many agentic workflows, it’s crucial for instructions to not only be reliably followed, but also accurately translated into effective tool calls. To that end, Granite-4.0-H-Small keeps pace with much larger models, both open and closed, on the Berkeley Function Calling Leaderboard v3 benchmark (BFCLv3). Moreover, it achieves this at a price point unmatched within this competitive set.

Granite 4.0 likewise excels on MTRAG, a benchmark measuring performance and reliability on complex retrieval augmented generation (RAG) tasks entailing multiple turns, unanswerable questions, non-standalone questions and information spanning multiple domains.

Additional evaluation metrics are available on Granite 4.0’s Hugging Face model cards.

Trust, safety and security

All Granite models are built with security, safety and responsible governance at their core.

Earlier this month, IBM Granite became the first open language model family to receive accreditation under ISO/IEC 42001:2023, certifying that Granite is aligned with internationally recognized best practices for safe, responsible AI and that IBM’s AI management system (AIMS) meets the highest levels of scrutiny. Organizations can confidently build with Granite 4.0 models even in high-stakes contexts like highly regulated industries and mission-critical deployment environments.

Like all Granite models, Granite 4.0 models were trained entirely on carefully curated, ethically acquired and enterprise-cleared data. Reflecting our full confidence in our models’ trustworthiness, IBM provides an uncapped indemnity for third party IP claims against content generated by Granite models when used on IBM watsonx.ai.

Going beyond our extensive internal testing and red-teaming, IBM has also recently partnered with HackerOne to launch a bug bounty program for Granite, offering up to $100,000 for the identification of any unforeseen flaws, failure modes or vulnerabilities to jailbreaking and other adversarial attacks. Any such invaluable information uncovered by researchers participating in the bug bounty program will be inform ongoing enhancements and updates to of our models’ security—particularly through the generation of synthetic data to improve model alignment.

IBM is focused on the safety and security of not only our models themselves, but of the model distribution chain as well. To that end, IBM has initiated the novel practice of cryptographically signing all Granite 4 model checkpoints before release: all Granite model checkpoints are now shipped with a model.sig file to enable easy, public verification of Granite model provenance to ensure their integrity and authenticity.

Why we're exploring mamba models

Despite their many upsides, transformer models have a critical downside: their computational needs scale quadratically with sequence length. If context length doubles, the number of calculations a transformer model must perform (and store in memory) quadruples. This “quadratic bottleneck” inevitably decreases speed and increases cost as context length increases. At long context lengths it can quickly exhaust the RAM capacity of even high-end consumer GPUs.

Whereas transformers rely on self-attention, Mamba uses an entirely distinct selectivity mechanism that’s inherently more efficient. Mamba’s computational requirements scale linearly with sequence length: when context doubles, Mamba performs only double—not quadruple—the calculations. Even better, Mamba’s memory requirements remain constant, regardless of sequence length. The more work you throw at a Mamba model, the greater its advantages over transformers.

Nevertheless, transformers and self-attention do still have some advantages over Mamba and Mamba-2, particularly for performance on tasks that entail in-context learning (like few-shot prompting). Fortunately, combining both in a hybrid model provides the best of both worlds. For more insight, revisit our sneak peak of Granite-4.0-Tiny-Preview.

The Granite 4 architecture

The architecture powering Granite 4.0-H-Micro, Granite 4.0-H-Tiny and Granite 4.0-H-Small combines Mamba-2 layers and conventional transformer blocks sequentially in a 9:1 ratio. Essentially, the Mamba-2 blocks efficiently process global context and periodically pass that contextual information through a transformer block that delivers a more nuanced parsing of local context through self-attention before passing it along to the next grouping of Mamba-2 layers.

It’s worth noting that most of the world’s LLM-serving infrastructure was historically tailored to transformer-only models. Following our experimental launch of Granite 4.0-Tiny-Preview earlier this year, we’ve collaborated extensively with ecosystem partners to establish support for the Granite 4 Hybrid architecture in inference frameworks including vLLM, llama.cpp, NexaML and MLX in preparation for today’s release.

Granite-4.0-H-Tiny and Granite-4.0-H-Small pass the output of each Mamba-2 and transformer block to a fine-grained mixture of experts (MoE) block (whose specifications have changed slightly since Granite 4.0-Tiny-Preview). While fine-grained MoEs have been an area of active IBM research since the release of Granite 3.0 in 2024, Tiny and Small are our first MoEs to utilize shared experts that are always activated, which improve their parameter efficiency and enable the other “experts” to better develop distinctly specialized knowledge.

Granite 4.0-H-Micro utilizes conventional dense feedforward layers in lieu of MoE blocks, but otherwise mirrors the architecture shared by Tiny and Small.

Unconstrained context length

One of the more tantalizing aspects of state space model (SSM)-based language models like Mamba is their theoretical potential to handle infinitely long sequences. All Granite 4.0 models have been trained on data samples up to 512K tokens in context length. Performance has been validated on tasks involving context length of up to 128K tokens, but theoretically, the context length can extend further.

In standard transformer models, the maximum context window is fundamentally constrained by the limitations of positional encoding. Because a transformer’s attention mechanism processes every token at once, it doesn’t preserve any information about the order of tokens. Positional encoding (PE) adds that information back in. Some research suggests that models using common PE techniques such as rotary positional encoding (RoPE) struggle on sequences longer than what they’ve seen in training.²

The Granite 4.0-H architecture uses no positional encoding (NoPE). We found that, simply put, they don’t need it: Mamba inherently does preserve information about the order of tokens, because it “reads” them sequentially.

Across their varying architecture implementations, all Granite 4.0 models are trained on samples drawn from the same carefully compiled 22T-token corpus of enterprise-focused training data, as well the same improved pre-training methodologies, post-training regimen and chat template.

Granite 4.0 was pre-trained on a broad spectrum of samples curated from DataComp-LM (DCLM), GneissWeb, TxT360 subsets, Wikipedia and other enterprise-relevant sources. They were further post-trained to excel at enterprise tasks, leveraging both synthetic and open datasets across domains including language, code, math and reasoning, multilinguality, safety, tool calling, RAG and cybersecurity. All training datasets were prepared with the open-source Data Prep Kit framework.

A notable departure from prior generations of Granite models is the decision to split our post-trained Granite 4.0 models into separate instruction-tuned (released today) and reasoning variants (to be released later this fall). Echoing the findings of recent industry research, we found in training that splitting the two resulted in better instruction-following performance for the Instruct models and better complex reasoning performance for the Thinking models. This has the added benefit of simplifying chat templates for both variants.

What's next for IBM Granite?

Later this fall, the Base and Instruct variants of Granite 4.0 models will be joined by their “Thinking” counterparts, whose post-training for enhanced performance on complex logic-driven tasks is ongoing.

By the end of year, we plan to also release additional model sizes, including not only Granite 4.0 Medium, but also Granite 4.0 Nano, an array of significantly smaller models designed for (among other things) inference on edge devices.

Get started with Granite 4.0

Granite 4.0 models are now available across a broad spectrum of platform providers and inference frameworks for us as both fast and efficient standalone workhorse models and key building blocks of ensemble workflows alongside leading large frontier models. You can also try them out on the Granite Playground.

The new Granite Hybrid architecture has full, optimized support in vLLM 0.10.2 and Hugging Face Transformers. The Granite Hybrid architecture is also supported in llama.cpp and MLX, though work to fully optimize throughput in these runtimes is still ongoing. We thank our ecosystem partners for their collaboration and hope that our work will help facilitate further experimentation with hybrid models.

Granite 4.0 Instruct models are available now in IBM watsonx.ai, IBM’s integrated AI development studio for making AI deployment simple and scalable. Granite 4.0 Instruct models are also available through platform partners including—alphabetically—Dell Technologies (on Dell Pro AI Studio and Dell Enterprise Hub), Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE and Replicate. Granite 4.0 Base models are available through Hugging Face.

Granite 4.0 models are also supported in Unsloth for fast, memory-efficient fine-tuning, and can be leveraged in Continue to power customized AI coding assistants.

Guides and recipes in Granite Docs can help you get started, including helpful tutorials such as:

Explore IBM Granite 4.0 →

关于《IBM Granite 4.0: Hyper-efficient, High Performance Hybrid Models for Enterprise》的评论

暂无评论

发表评论

摘要

IBM has launched Granite 4, the next generation of its language models, featuring a hybrid Mamba/transformer architecture that significantly reduces memory requirements while maintaining high performance. Available under an Apache 2.0 license and certified with ISO 42001 for security and transparency, Granite 4 is accessible on IBM watsonx.ai and various third-party platforms. The launch marks a new era of small yet efficient enterprise-ready models designed to operate at reduced costs and latency. Models range from 3B to 32B parameters, with sizes optimized for diverse hardware constraints and use cases like multi-tool agents and customer support automation. Granite 4 demonstrates substantial performance improvements over previous generations while significantly lowering the cost of running heavy workloads.