Unlocking the Power of LLMs with MLX and Apple's Neural Accelerators
The Future of AI Development on Mac
Apple silicon has become a hotbed for AI enthusiasts, offering a unique playground for developers and researchers. With MLX, a powerful tool in their arsenal, they can now explore and harness the potential of Large Language Models (LLMs) on their Mac devices. This opens up a world of possibilities for experimentation and innovation.
But here's where it gets controversial... Apple's Neural Accelerators, introduced with the M5 GPU, are a game-changer. They provide dedicated matrix-multiplication capabilities, crucial for many machine learning tasks, and when combined with MLX, they deliver an unprecedented boost in performance.
Let's dive into the world of MLX and uncover its potential.
What is MLX?
MLX, or the MLX Framework, is an open-source array framework tailored for Apple silicon. It's efficient, flexible, and highly optimized, making it a go-to choice for a wide range of applications, from scientific computing to machine learning.
One of its standout features is the built-in support for neural network training and inference, including text and image generation. With MLX, developers can effortlessly generate text or fine-tune large language models on their Apple silicon devices.
MLX leverages Apple silicon's unified memory architecture, allowing operations to run seamlessly on either the CPU or GPU without the need for memory shuffling. Its API follows the familiar and flexible NumPy, making it accessible to developers.
Getting started with MLX is a breeze. Simply install it via:
pip install mlx
For those eager to explore further, the MLX documentation and examples are a treasure trove of information and inspiration.
MLX Swift and Beyond
MLX Swift builds on the core MLX library, offering a Swift-based front-end for developing machine learning applications. It comes with its own set of examples to get you started on your Swift ML journey.
For those who prefer a lower-level approach, MLX provides easy-to-use C and C++ APIs, ensuring compatibility with any Apple silicon platform.
Running LLMs on Apple Silicon with MLX LM
MLX LM is a specialized package built on MLX, designed specifically for text generation and language model fine-tuning. It supports a wide range of LLMs available on Hugging Face, making it a versatile tool for language model enthusiasts.
Installation is straightforward:
pip install mlx-lm
And with a simple terminal command, you can initiate a chat with your favorite language model:
mlx_lm.chat
MLX LM also supports quantization, a compression technique that reduces the memory footprint of language models by using lower precision for parameter storage. With mlx_lm.convert, you can quantize a model downloaded from Hugging Face in a matter of seconds.
For example, quantizing a 7B Mistral model to 4-bit precision takes just a few seconds:
mlx_lm.convert \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
-q \
--upload-repo mlx-community/Mistral-7B-Instruct-v0.3-4bit
Inference Performance on M5 with MLX
The M5 chip's GPU Neural Accelerators are a powerhouse for machine learning tasks. They provide dedicated matrix-multiplication operations, which are critical for many ML workloads. MLX leverages these accelerators, along with the Tensor Operations and Metal Performance Primitives framework introduced with Metal 4, to deliver exceptional performance.
To showcase the capabilities of M5 with MLX, we benchmarked a diverse set of LLMs with varying sizes and architectures. These LLMs were run on a MacBook Pro with M5 and 24GB of unified memory, and the results were compared against a similarly configured MacBook Pro M4.
We evaluated Qwen models with 1.7B and 8B parameters, both in native BF16 precision, and also in 4-bit quantized versions for Qwen 8B and 14B. Additionally, we benchmarked two Mixture of Experts (MoE) models: Qwen 30B (3B active parameters, 4-bit quantized) and GPT OSS 20B (in native MXFP4 precision).
Evaluation was performed using mlx_lm.generate, and the results were reported in terms of time to first token generation and generation speed.
In LLM inference, the generation of the first token is compute-intensive, and it fully utilizes the Neural Accelerators. The M5 chip excels in this regard, pushing the time-to-first-token generation below 10 seconds for a dense 14B architecture and below 3 seconds for a 30B MoE, delivering impressive performance on a MacBook Pro.
Generating subsequent tokens is more memory-bandwidth-limited than compute-limited. On the tested architectures, the M5 provides a 19-27% performance boost compared to the M4, thanks to its higher memory bandwidth (120GB/s for M4 vs. 153GB/s for M5, a 28% increase).
The MacBook Pro with 24GB of memory can easily accommodate an 8B model in BF16 precision or a 30B MoE in 4-bit quantized format, keeping the inference workload under 18GB for both architectures.
| Model | TTFT (M4) | TTFT (M5) | Speedup | Generation Speed (M4) | Generation Speed (M5) | Speedup | Memory (GB) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Qwen3-1.7B-MLX-bf16 | 3.57 | 2.79 | 1.27 | 4.40 | 5.29 | 1.19 | 4.40 |
| Qwen3-8B-MLX-bf16 | 3.62 | 2.89 | 1.24 | 17.46 | 21.41 | 1.24 | 17.46 |
| Qwen3-8B-MLX-4bit | 3.97 | 3.15 | 1.24 | 5.61 | 6.87 | 1.24 | 5.61 |
| Qwen3-14B-MLX-4bit | 4.06 | 3.40 | 1.19 | 9.16 | 10.88 | 1.19 | 9.16 |
| gpt-oss-20b-MXFP4-Q4 | 3.33 | 2.67 | 1.24 | 12.08 | 14.89 | 1.24 | 12.08 |
| Qwen3-30B-A3B-MLX-4bit | 3.52 | 2.79 | 1.25 | 17.31 | 21.25 | 1.25 | 17.31 |
The GPU Neural Accelerators, when combined with MLX, truly shine in ML workloads involving large matrix multiplications. They deliver up to 4x speedup compared to an M4 baseline for time-to-first-token in language model inference. Similarly, generating a 1024x1024 image with FLUX-dev-4bit (12B parameters) is more than 3.8x faster on an M5 compared to an M4.
As MLX continues to evolve, we eagerly anticipate the new architectures and models that the ML community will explore and run on Apple silicon.
Get Started with MLX:
- Learn more about MLX and its capabilities: MLX Framework
- Explore MLX LM on GitHub: MLX LM
- Join the MLX Hugging Face community: MLX Hugging Face
Related Readings and Updates:
Apple researchers are at the forefront of AI and ML research, sharing their findings through publications and conferences to accelerate progress in the field. Next week, the International Conference on Machine Learning (ICML) will take place in Vancouver, Canada, and Apple is proud to be a part of this significant event.
For more insights into Apple's research efforts, check out:
- ICML 2025
- NeurIPS 2024
Apple's commitment to advancing ML research benefits not only its own ecosystem but also the broader research community, fostering innovation and collaboration.