Apple JUST Dropped a Game-Changer

Alex Ziskind

156,575 views • yesterday

Video Summary

This video explores a breakthrough in building Mac clusters for running large AI models, enabling faster performance with more machines. Previously, adding more Macs to a cluster led to diminishing returns and slower speeds due to limitations in inter-machine communication. However, the new Exo 1.0, combined with Apple's MLX framework and macOS 26.2, leverages RDMA over Thunderbolt 5 to achieve true parallelism, significantly boosting inference speeds. This allows for the efficient running of massive models, even on more affordable Mac Minis (though RDMA requires M4 Pro or higher chips with Thunderbolt 5). A remarkable demonstration shows a 480 billion parameter model achieving 40 tokens per second on a cluster.

The core innovation lies in the combination of three technologies: Exo, a simplified installer for cluster setup; MLX distributed, an array framework optimized for Apple Silicon; and macOS 26.2, which enables RDMA over Thunderbolt 5. This synergistic advancement eliminates previous bottlenecks, making it feasible to build powerful, scalable AI computing clusters using Apple hardware, achieving impressive speeds even with enormous models like a 578 GB Kimmy K2.

Short Highlights

Building Mac clusters for AI models previously suffered from slower speeds as more machines were added.
Exo 1.0 simplifies the setup of Mac clusters, allowing machines to share unified memory for larger models.
The key to faster cluster performance is the combination of Exo, MLX distributed, and macOS 26.2 enabling RDMA over Thunderbolt 5.
This new technology allows for true parallelism via tensor parallelism, making more machines result in faster processing.
Demonstrations show significant speed improvements, such as a 480 billion parameter model achieving 40 tokens per second on a cluster.

Key Details

The Painful Past and Promising Present of Mac Clusters [0:00]

Previous Mac clusters consistently faced a "painful ending" with performance degradation as more machines were added.
The breakthrough introduced allows more machines to actually increase processing speed.
This advancement is achievable even with less expensive Mac Minis costing $500 each, not just high-end Mac Studios.

"Every Mac cluster I've built in the past has had the same painful ending, considerably worse. Until this one."

Exo 1.0: Simplifying Mac Clustering for AI [0:59]

Exo, a previously showcased tool for Mac Mini clusters, has reached version 1.0.
The project was thought to be abandoned but has resurfaced with significant improvements, promising linear scaling.
Exo 1.0 is described as a simple installer that runs on each machine intended for the cluster.

"Well, they've been tweeting, 'Linear scaling achieved.' ... Well, finally, they've reached version 1.0."

The Synergy of Technologies: Exo, MLX, and macOS [01:30]

Achieving this advancement required the alignment of three key technological levels.
Exo 1.0 has paradoxically become even simpler, functioning primarily as a straightforward installer.
This facilitates the use of large models like DeepSeek with 671 billion parameters.

"It's not just Exo. There are three levels of technologies that had to align, like the stars aligning in order for us to get here."

Unleashing Performance: RDMA and Tensor Parallelism [04:39]

The previous method of sharding models, pipeline parallelism, did not offer speed improvements as only one device was active at a time.
Low latency RDMA enables tensor parallelism, allowing for true parallelism where computations happen simultaneously across layers.
This means splitting each layer into separate pieces of computation that can run in parallel, significantly boosting performance.

"Now with low latency RDMA, that enables different kinds of parallelism, namely tensor parallelism, which enables you to actually get true parallelism."

The Power of Thunderbolt 5 and RDMA [07:35]

Apple's Thunderbolt 5, released less than a year prior, was already impressive for file transfers and driving displays.
Thunderbolt can be used for networking machines, as demonstrated in previous Mac Mini and Mac Studio cluster videos.
macOS 26.2 has enabled RDMA over Thunderbolt, increasing inter-machine communication speed by an order of magnitude, thus eliminating bottlenecks.

"With Mac OS Tahoe, say what you will about Mac OS Taho's UI design, but with 26.2, they really cooked. They've enabled RDMA over Thunderbolt, which means communication between machines can now be 10 times faster."

MLX: The Optimized Framework for Apple Silicon [09:06]

MLX is an array framework specifically designed for efficient machine learning research on Apple Silicon.
It is analogous to CUDA on Nvidia hardware, optimized for Apple's architecture and offering superior performance.
While Llama CPP is cross-platform, MLX is faster on Apple Silicon.

"MLX is an array framework designed for efficient and flexible machine learning research on Apple Silicon. It's specifically designed and optimized for Apple Silicon."

Sharding and Quantization: Optimizing Model Performance [11:30]

Sharding (or sharding) involves separating a model into parts to run on different machines.
Dense models shard well, unlike Mixture of Experts (MoE) models which don't always split as effectively across machines.
Quantization reduces model size and memory requirements by discarding some information, allowing them to run on smaller hardware.

"Quantization is, so that it can run on smaller hardware. Typically, that's what you do when you quantize things."

MLX Distributed vs. Llama CPP RPC [13:40]

MLX distributed, when combined with RDMA, offers significantly faster performance for clustering compared to Llama CPP RPC.
Llama CPP RPC, which does not utilize RDMA and tensor parallelism, shows a substantial drop in tokens per second for the same models.
The Exo installer automates the complex setup required for MLX distributed, which would otherwise involve manual scripting for SSH, networking, and model distribution.

"That's where it goes down. That's because Llama CPP RPC is not using RDMA and tensor parallelism."

Pushing the Limits: Massive Models on Mac Clusters [16:43]

The video demonstrates running increasingly larger models, including the 480 billion parameter Quen coder model at 40 tokens per second and the 578 GB Kimmy K2 model.
Even larger models like Deepseek v3.1 (8-bit) are tested, consuming up to 200 GB of memory per machine, achieving 25 tokens per second.
This highlights the capability of Apple hardware to handle extremely demanding AI workloads through clustering.

"And we got 40 tokens per second here on four nodes. 40 tokens per second for a 480 billion parameter model is pretty nice."

Future Potential and Hardware Requirements [21:13]

Support for larger dense models at the MLX distributed and Exo levels is still developing but can be facilitated by opening issues.
The same benefits of Exo's easy setup can be achieved on M4 Mac Minis, though Thunderbolt 4 lacks RDMA support.
RDMA support requires M4 Pro chips or higher with Thunderbolt 5.