2026-05-30
Your Local LLMs Just Got 2x Faster on Mac (May 2026)
What actually changed for local AI in May 2026?
The headline is honest: no new open-weight model worth adding to your Mac arrived between May 10 and May 30, 2026. The April wave — Qwen3.6, Gemma 4, Llama 4, DeepSeek V4 — still holds the top of the SWE-Bench Verified open-weight leaderboard. DeepSeek V4-Pro-Max leads at 80.6%, unchanged since April.
The action moved one layer down, to the runtime. Three tools — Ollama, LM Studio, and Apple's MLX — shipped speed and quality upgrades that make the models you already have run faster. That matters more than another benchmark-chasing release.
Two announcements grabbed headlines but stay out of reach for local users. Alibaba previewed Qwen 3.7 at its Cloud Summit on May 20, but it shipped API-only — no open weights yet (Yotta Labs). And inclusionAI's Ring-2.6-1T is open-weight but roughly one trillion parameters (Hugging Face). Even at 4-bit it needs ~500GB — past the 512GB ceiling of an M3 Ultra Mac Studio. Neither runs on a normal Mac.
What is MTP speculative decoding, and why is it 2x faster?
Multi-token prediction (MTP) speculative decoding lets a model draft several tokens in one step instead of one token per forward pass. Models like Gemma 4 ship with built-in MTP heads. The model predicts a short run of tokens, then verifies them in a single pass and keeps the ones that match.
The payoff is throughput. Ollama v0.23.1 (May 5, 2026) describes it directly: "Gemma 4 MTP speculative decoding is now supported on Macs. This can give over a 2x speed increase for the Gemma 4 31B model on coding tasks." Source: Ollama GitHub releases.
Coding workloads gain the most. Code is repetitive and structured, so drafted tokens hit more often. That raises the acceptance rate and pushes decode speed up. Chat and prose see smaller but real gains.
Ollama in May 2026: MTP, MLX rework, and a llama.cpp pivot
Ollama shipped the busiest month of the three. Every claim below is from the official GitHub release notes.
| Version | Date | Key change for Mac users |
|---|---|---|
| v0.23.1 | May 5 | Gemma 4 MTP speculative decoding — "over a 2x speed increase" on Gemma 4 31B coding |
| v0.23.2 | May 7 | /api/show caching — ~6.7x faster median response time |
| v0.24.0 | May 14 | "Reworked the MLX sampler for improved generation quality on Apple Silicon"; Codex App support |
| v0.30.0-rc31 | May 13 | Pre-release: architecture now calls llama.cpp directly, adds GGUF file-format compatibility |
The MLX sampler rework in v0.24.0 is the quiet win. It improves generation quality, not just speed, on Apple Silicon. If your outputs felt slightly off after the April MLX backend landed, this update is for you.
The v0.30.0 release candidate signals a strategic shift. Ollama is re-architecting "to directly support llama.cpp instead of building on top of GGML," which adds GGUF compatibility while keeping MLX for Apple Silicon acceleration. It is still a pre-release as of late May — wait for the stable tag before depending on it.
LM Studio in May 2026: stable MTP and faster vision models
LM Studio matched the MTP wave. All points are from the official changelog.
- 0.4.13 (May 13): mlx-engine v1.8.1 "significantly improves performance and adds parallel predictions for vision-capable models such as Qwen 3.5/3.6 and Gemma 4."
- 0.4.14 (May 22): stable MTP speculative decoding that "speeds up generation with models that include built-in multi-token prediction heads."
- 0.4.15 (May 29): tensor parallelism for multi-GPU loads, a Physical Batch Size load option, and a Claude Code API fix.
The pattern across both tools is clear. MTP decoding landed in Ollama, LM Studio, and upstream llama.cpp inside one month. The April model families — especially Gemma 4 and Qwen3.6 — are the main beneficiaries on Apple Silicon.
How do I get the speedup on my Mac?
Update your runtime. The faster decoding ships in the runtime, not the model weights, so a re-download is unnecessary.
1. Ollama: upgrade to v0.24.0 or later. Run ollama --version to check, then pull the latest build from ollama.com. MTP for Gemma 4 works automatically once you are on v0.23.1+.
2. LM Studio: update to 0.4.14 or later for stable MTP. Enable speculative decoding in the model load settings.
3. Pick an MTP-capable model: Gemma 4 31B gets the headline 2x on coding. See our April 2026 model wave breakdown for the full lineup and RAM costs.
4. Match the model to your RAM with the ModelFit wizard or browse the full model list.
If you already run Gemma 4 31B on a 24GB or larger Mac, the upgrade alone roughly doubles your coding throughput. That is the cheapest performance win of the year.
Should you wait for Qwen 3.7 or run something now?
Run something now. Qwen 3.7 is API-only and Ring-2.6-1T is too large for any Mac, so the best local pick has not changed since April. Qwen3.6-27B stays the default coding model at 24GB, and Gemma 4 31B is now the speed champion with MTP. There is no reason to wait.
FAQ
Was there a new local AI model in May 2026?
No new open-weight model worth running locally shipped between May 10 and May 30, 2026. Qwen 3.7 launched API-only on May 20, and Ring-2.6-1T at ~1T parameters is too large for consumer Macs. The April 2026 families remain the local picks.
Does MTP speculative decoding change output quality?
No. MTP drafts multiple tokens, then verifies them in a single pass and keeps only the correct ones. The output matches standard decoding — you get the same text, faster. Ollama's separate v0.24.0 MLX sampler rework improves quality independently.
Which Ollama version do I need for the 2x Gemma 4 speedup?
Ollama v0.23.1 or later, on a Mac. The release notes cite "over a 2x speed increase for the Gemma 4 31B model on coding tasks." Upgrade to v0.24.0+ to also get the reworked MLX sampler.
Is Ollama 0.30 safe to use yet?
Not for production. As of late May 2026, v0.30.0 exists only as a release candidate (v0.30.0-rc31). It re-architects Ollama to call llama.cpp directly and adds GGUF support. Wait for the stable release before relying on it.
Did the local LLM leaderboard move in May 2026?
No. DeepSeek V4-Pro-Max still leads SWE-Bench Verified among open-weight models at 80.6%, a score set in April. No May release overtook the April wave, so the speed gains — not new models — are the story this month.
The weekly local-AI refresh
New open-weight models, real Apple Silicon benchmarks, and the one model worth running on your Mac this week. Free, one email a week, unsubscribe anytime.
Have questions? Reach out on X/Twitter