作者:Marcus Mendes

During WWDC25, Apple announced new versions of its on-device and cloud-based foundation models. Now, they have published a tech report detailing how those models were trained, optimized, and evaluated. And the report includes some genuinely interesting under-the-hood tidbits.
In a comprehensive document called “Apple Intelligence Foundation Language Models – Tech Report 2025“, the company walks through multiple aspects of the new models, including their architecture, data sources, pre-training, post-training, tool use development, optimizations, and benchmarks.

It is a very technical, but very worthwhile read if you like to get into the nuts and bolts of this sort of stuff. Here are a few particularly interesting highlights.
We already knew that Apple’s on-device model (the one developers will get to tap into) has around 3 billion parameters. Now, the company has detailed that this model is actually divided into two blocks:
“Block 1 contains 62.5% of the total transformer layers, while Block 2 contains the remaining 37.5% of the transformer layers, but had the key and value projections removed.”
In practice, this means that the local model requires 37.5% less memory for caching, and the time it takes to output the first token (basically, a fragment of a word) was also cut by about 37.5%. Still, Apple structured the split in a way that it says preserves the model’s overall performance and output quality.

As a side note, a few years ago, Apple published this study, which looked at swapping parts of an LLM between RAM and flash storage as needed, in order to pack a local model that was bigger than what would otherwise fit on the device’s memory.
While Apple ultimately took a different route, it is interesting to note the different ways the company has been experimenting to offer good local performance, even on memory-constrained devices.
For its server model, Apple built a custom architecture that was tailor-made for its Private Cloud Compute platform. It’s called Parallel-Track Mixture-of-Experts (PT-MoE), and the way it works is pretty neat.
In a nutshell (and at the risk of oversimplifying things), Mixture of Experts is when, instead of relying on one huge AI model, it’s split into smaller subnetworks (or experts) that are only activated when the task is related to something they’re… well, an expert in.
So if your prompt is about cooking, only cooking-related experts are activated, while others remain dormant. The result is still a massive overall model, but its modular design allows it to respond faster (and often more accurately) than if everything were running through the huge, unified model, for every prompt.
Here is an IBM Mixture of Experts explainer, in case you have 8 minutes to spare:
Apple built a new kind of Transformer called the Parallel Track Transformer, then scaled it up with Mixture of Experts (MoE) layers. That sounds way too complicated, but the gist of it is:
Traditional Transformers process tokens through a single stack of layers, one after the other. But rather than using this single-track approach to calculate every token, Apple’s design splits the model into multiple, parallel tracks. Each track processes tokens independently, and only syncs up at certain points.
Then, inside each of those tracks, Apple replaced every other regular transformer layer with an MoE layer, which activates just a few experts for each token, while the rest stay idle. And because each track has its own local experts, the model avoids the processing bottlenecks that happen when everything has to coordinate across the entire system.

Add to that a clever setup that balances local context with big-picture understanding (called Interleaving Global and Local Attention Layers), and the result is a very modular, efficient, and scalable model that’s faster and leaner, but still pretty smart.
One of the biggest knocks against the initial rollout of Apple Intelligence was (and still is) limited language support beyond English. With its new models, Apple has expanded language support, and the document details the steps it took in order to do that.
According to the document, Apple increased the amount of multilingual data used during training from 8% to 30%. This includes both organic and synthetic content.
Apple also increased its tokenizer (which is basically the model’s token vocabulary) by 50%. This means that its model now knows 150K different tokens, up from the previous 100K.
The company says that these changes led to “significant gains” in performance across non-English benchmarks, especially after reinforcement learning fine-tuning.
In the deocument, Apple explains that evaluations were conducted using prompts written by native speakers (rather than translations), and the model was tested on both accuracy and how natural its responses sounded in local contexts. If this sounds familiar, you probably read our recent coverage of this Apple Research study.
In practice, all of this means that features like Writing Tools should work more reliably in the supported languages.

Like with its first models, most of the training data came from crawling the web. But Apple says that its Applebot crawler respects robots.txt exclusions, meaning that if a website doesn’t want Apple to scrape its content, it can say so, and Applebot will leave it alone.
That said, here is how Apple says it sourced the data for its new models:
There has been no shortage of news on Apple’s internal drama, technical struggles, and overall inability to gain the momentum it needs to bridge the gap (which some might call a chasm) between its AI offerings, and the competition. All of those are true.
Yet, the fact that Apple is largely perceived as being behind on AI doesn’t mean the company is standing still. This report offers an interesting insight into the under-the-hood improvements (and shortcomings) of Apple’s newest models, along with extensive details on a privacy-conscious approach that few companies are even attempting.
FTC: We use income earning auto affiliate links. More.