Apple did what NVIDIA wouldn't.

jakkuh

718,548 views • 4 days ago

Video Summary

The video explores the exciting advancements in running large language models (LLMs) locally on Apple hardware, specifically showcasing a cluster of four Mac Studios. It highlights how a new macOS beta feature, RDMA over Thunderbolt, combined with the Exo software, enables significant performance gains. This setup allows users to run massive AI models, previously confined to data centers, on their own machines, boasting 1.5 terabytes of unified memory and costing considerably less than comparable enterprise solutions. A key demonstration shows a 70-billion parameter Llama 3.3 model running 3.25 times faster with this new configuration, achieving impressive token-per-second rates.

A particularly striking revelation is that four Mac Studios running a dense model like Llama 3.3 consume around 600 watts, a power draw comparable to, or even less than, a single high-end Nvidia H200 GPU, which costs significantly more and offers less unified memory. This positions the Mac Studio cluster as a powerful, relatively cost-effective, and energy-efficient alternative for local AI model execution.

Short Highlights

Running large language models (LLMs) locally on consumer hardware is now feasible, even on devices like the Mac Studio.
A cluster of four Mac Studios, with a combined 1.5 terabytes of unified memory, can run powerful AI models without sending data to cloud providers.
The key enablers for this performance leap are Apple's RDMA over Thunderbolt feature in the macOS 14.2 beta and the Exo 1.0 software.
A 70-billion parameter Llama 3.3 model, when run on the four-Mac cluster with RDMA, achieved a 3.25x speed improvement, reaching 15.5 tokens per second.
This Mac Studio cluster setup is significantly more cost-effective and energy-efficient than comparable enterprise solutions, drawing around 600 watts for dense models compared to a single Nvidia H200's higher cost and power draw.

Key Details

Running AI Locally on Consumer Hardware [0:07]

It's now possible to run large language models (LLMs) locally on consumer devices, including the computer or phone being used to watch the video.
For those who are hardware enthusiasts or interested in AI, running models locally offers an alternative to subscription-based cloud AI services like OpenAI or Gemini.
A cluster of four Mac Studios, boasting a combined 1.5 terabytes of unified memory, can run "the beefiest and absolute baddest large language models" without sharing data with companies.

"Whether you're for or against AI, it's hard to deny that at least on the hardware and engineering side of things that it's super interesting."

The Mac Studio AI Cluster [0:15]

The setup involves a cluster of Mac Studios, which is described as multiple times faster than before, consuming less than half a North American power circuit, and costing significantly less than comparable solutions.
This configuration is positioned as being in a "class of its own" due to its price and capabilities.

"This Mac Studio cluster should be multiple times faster than it was even just yesterday. All while still consuming less than around a half of a North American power circuit and while realistically costing multiple times less than a solution that can run models in the same class as this."

Key Technologies Enabling the Cluster [0:41]

Two primary advancements are highlighted: Apple's RDMA over Thunderbolt, introduced in the macOS 14.2 beta, and the public release of Exo 1.0, the software that allows four Mac Studios to form an AI cluster.
RDMA over Thunderbolt is crucial for enabling the observed performance increases.

"First is Apple's sneaky release of RDMA over Thunderbolt in the recent Mac OS 26.2 beta. You probably wouldn't think much of it, but that is going to enable most of the performance increases we're seeing today."

The Challenge of Local LLMs vs. Cloud Services [2:06]

Smaller local LLMs may appear less intelligent than cloud-based services like ChatGPT or Gemini due to the vast difference in model size.
Cloud models (ChatGPT, Gemini) are in the range of hundreds or thousands of gigabytes and require massive, expensive data centers to run.
Smaller models are suitable for resource-constrained devices like security cameras or smartphones, where large models cannot be deployed.

"Don't get me wrong, those small models have their uses. You can't exactly fit a Mac Studio amount of performance or memory for that matter in a security camera or a smartphone."

Apple's Unified Memory Architecture [3:00]

Apple's M-series silicon features unified memory shared between CPU and graphics cores.
This architecture, particularly on M3 Ultra chips, enables configurations with up to 512 GB of RAM on a single Mac Studio, with the units in the cluster having either 512 GB or 256 GB.
This substantial memory is sufficient to run "insane AI models" capable of advanced tasks like one-shotting code challenges.

"See, Apple designed their M series of silicon with unified memory that's shared between both the CPU and graphics cores, which enabled them on the launch of the M3 Ultra chips that are in these Mac Studios to release a skew with a whopping 512 GB of RAM..."

Performance Benchmarking with Llama 3.3 [04:54]

A 70 billion parameter Llama 3.3 model in full FP16 precision was tested on a single Mac Studio.
The initial response time was around 2 seconds, with an inference speed of approximately 5 tokens per second.
This is notable given the model's large size (150 GB).

"Let's see how fast that goes across one of our Mac studios. Can you write me a thousandword story about how cool BMWs are? Include discussion of oil leaks. Oh god."

Networking Bottlenecks in Multi-Mac Clusters [06:49]

When a model is spread across multiple machines, each machine processes its "chapters" of the model and passes partial responses to others, creating a bottleneck.
This process is compared to a relay race where the handoff time significantly impacts overall speed, with each output token requiring a "lap" through the network.
Using standard 10 Gb Ethernet for these transfers is likened to forcing fast runners through airport security for each baton pass, leading to significant latency.

"Now, you can imagine the data of our model as a book. And when we load that book across the cluster, each machine gets its own set of chapters, but it doesn't necessarily have access to the entire book."

RDMA Over Thunderbolt: The Solution [08:38]

Directly connecting Macs with Thunderbolt cables offers lower latency than Ethernet.
RDMA (Remote Direct Memory Access) over Thunderbolt bypasses this networking bottleneck by enabling direct memory access between devices, dramatically reducing latency.
Enabling RDMA requires the macOS 14.2 beta and machines with Thunderbolt 5 (M4 Pro/Max or M3 Ultra).

"Now, you could already do better than this simply by using Thunderbolt cables directly connecting all the Macs together rather than using Ethernet with the switch. It's just a little bit lower latency on Apple Stack. Or we can bypass this hypothetical security problem entirely by using RDMA."

RDMA Performance Gains with Llama 3.3 [09:50]

With RDMA enabled and using MLX with tensor sharding, the Llama 3.3 70 billion parameter model achieved 9 tokens per second, nearly double the speed of a single machine.
Further testing on four machines with RDMA resulted in a start time of just over 1 second and 15.5 tokens per second, a 3.25x improvement over the single-machine baseline.

"Holy Nine tokens a second. That's almost double the speed we had running on just one. I'm a little bit in shock. That was a a bit of a dumpster fire getting that all going."

Handling Extremely Large Models: Kimmy K2 [10:26]

The Kimmy K2 instruct model, even at a 4-bit quantization, requires approximately 540 GB of VRAM.
Running this model on a single machine is impossible. Pipelining it across two machines yielded about 25 tokens per second.
Distributing it across four machines increased the throughput to nearly 35 tokens per second, demonstrating scaling, though less efficiently than dense models. The time to first token also improved significantly.

"My quantitative 540 GB. For a baseline, I'm going to start with the standard pipeline MLX ring sharding so that we get a good idea of what it would be like before we use RDMA."

Software Development and Stability Challenges [11:29]

Getting the RDMA setup to work involved about 12 hours of troubleshooting, indicating it's at the "frontier of this technology."
There have been numerous software updates from Exo to improve stability.
Current limitations include the inability to upload custom models and a requirement for models to be in the MLX format. Some quirks exist, such as strict naming conventions for Macs.

"It seems like we're finally there. It wasn't just my changes. They actually have made something like 40 different version updates in the last 2 days just to get this working a little bit more stable."

Performance Comparison: Dense vs. Mixture-of-Experts Models [13:53]

Dense models like Llama 3.3 show much better scaling across multiple machines with RDMA compared to Mixture-of-Experts (MoE) models like Kimmy K2.
MoE models involve smaller, more frequent calculations, leading to more overhead and less efficient parallelization in the current software setup.

"The scaling is nowhere near as good as a dense model, but it is still a large performance increase. And it does also show that the time to first token going from around 6 and 1/2 seconds down to 1 and a half."

Cost and Power Efficiency vs. Enterprise Solutions [14:35]

The Mac Studio cluster offers a compelling alternative to expensive enterprise solutions like Nvidia DGX Spark units, which cost around $32,000, draw more power, and offer less unified memory.
Running a dense model like Llama 3.3 on the four Mac Studios draws about 600 watts, which is comparable to or less than a single Nvidia H200 GPU.
An Nvidia H200 GPU costs approximately $30,000. The Mac cluster achieved 26 tokens per second on an FP8 model, while a single H200 achieved about 50 tokens per second on a halved-size FP8 model.

"I guess you could buy like eight Nvidia DGX Spark units, but then it would draw way more power. You would only have the same amount of VRAM as one of these Mac Studios, and it would cost $32,000."

Power Consumption and Stability with Mixed Models [17:23]

Running a Mixture-of-Experts model like Deepseek 3.1 (671 billion parameters, but only 37 billion active) on two Macs drew 115-125 watts, totaling around 480 watts for four Macs when not fully utilized.
Running the dense Llama 3.3 model on the four-Mac cluster resulted in a power draw of around 600 watts, indicating better hardware utilization.
The system experienced some instability, attributed to beta features in Apple's MLX for syncing CPU and GPU, leading to Exo being closed and restarted.

"Put that all together and we're at around 480 watts for a mixture of experts model where it's really not making perfect use of all that hardware."

Reconciling AI Use Cases [18:54]

While acknowledging the potential for misuse of AI, the speaker emphasizes that there are also many beneficial applications.
The idea of using this setup for an advanced home voice assistant, replacing devices like Alexa, is considered.
However, current Home Assistant integrations with OpenAI-compatible APIs have not worked well for the speaker.

"I'm not going to lie and say there isn't a part of me that goes, I wish the reason we were pushing for optimizations like this wasn't for something that so many people use for bad stuff."