Ollama 0.17 Update: 15% Faster on Apple Silicon (2026)

Ollama 0.17 dropped on February 21, 2026, with a major overhaul of its inference engine. Performance gains hit 40% on NVIDIA GPUs and 10-15% on Apple Silicon. Here's what actually changes for Mac users.

What's New in Ollama 0.17

The most significant Ollama update in months. The project replaces its legacy llama.cpp server mode with a new integrated engine that handles scheduling and memory directly. Result: better performance across all platforms and improved handling of large models.

The official changelog also lists OpenClaw integration, support for new models (GLM-5, MiniMax-M2.5), and tokenizer improvements.

Performance Gains by the Numbers

The new engine delivers variable gains depending on hardware:

Platform	Prompt Processing	Token Generation
NVIDIA RTX 4090	+40%	+18%
Apple Silicon (M2 Pro/M3 Max/M4)	+10-15%	Modest improvement
AMD RDNA 4 (new)	Supported	Supported

On an RTX 4090, a 2,000-token prompt drops from ~1.5 seconds to under 1 second. Generation jumps from ~60 to over 70 tokens/sec (Web And IT News, February 2026).

On Apple Silicon, gains are more modest but real. Chips with large unified memory (M2 Pro, M3 Max, M4 Pro/Max) benefit most: large models stay entirely in GPU memory, avoiding costly round-trips to system RAM.

8-Bit KV Cache: Longer Conversations

One of the most useful daily-driver changes. Ollama 0.17 now supports 8-bit KV cache quantization, down from 16-bit. In practice, this reduces cache memory usage by ~50% with minimal impact on response quality.

Why it matters: the KV cache stores your conversation context. The more compact it is, the longer you can maintain exchanges without the model forgetting the start of the discussion. On a MacBook Air M3 with 24GB RAM, this makes a real difference when running a 14B model with extended context.

Automatic Context Based on Your RAM

Ollama now automatically adjusts context length based on available VRAM:

Available VRAM	Default Context
Less than 24GB	4,096 tokens
24-48GB	32,768 tokens
48GB and more	262,144 tokens

On Apple Silicon, unified memory counts as VRAM. A Mac mini M4 Pro with 48GB therefore gets 32K context automatically, while a Mac Studio M4 Ultra with 128GB enjoys 256K context without any manual configuration.

You can still adjust manually via the num_ctx parameter in your Modelfile or via the API.

MLX Runner: Expanded Support

The MLX runner, which uses Apple's framework optimized for the Neural Engine, now supports additional architectures:

Gemma 3 - Google's open model family
Llama - Meta's models
Qwen 3 - Alibaba's models

This means these models run with native Apple Silicon acceleration via the MLX framework, potentially faster than the standard llama.cpp backend for certain configurations.

New Available Models

Version 0.17 adds support for:

GLM-5: Reasoning model with 744B total parameters (40B active via Mixture of Experts). Try it with ollama run glm5
MiniMax-M2.5: Optimized for productivity and code
Qwen2, Command R: Additional supported architectures

Expanded GGUF file support and simplified Hugging Face Safetensors conversion also make custom models easier to use.

How to Update on Your Mac

If you installed Ollama via Homebrew:

brew upgrade ollama

ollama --version

If you use the macOS app, it updates automatically. Verify you're on 0.17.0:

ollama --version

To test performance gains, run a model you use regularly and compare response times:

ollama run llama3.3:8b

Note the initial loading time and generation speed. If you were already on 0.16, you should see a difference in prompt processing.

What This Means in Practice

For Mac users in daily use, the three most impactful improvements are:

1. 8-bit KV cache. Long conversations consume half the memory. If you were running a 14B model and context was saturating your RAM after 20 exchanges, you should now go much further. 2. Automatic context. No more guessing the right num_ctx value. Ollama chooses the maximum reasonable for your config. A Mac with 48GB jumps straight to 32K context. 3. Expanded MLX runner. Gemma 3, Llama, and Qwen 3 now benefit from native Apple Silicon acceleration via MLX. For these models, this can make a noticeable difference in tokens/sec compared to the standard backend.

Ollama 0.17 isn't revolutionary, but it's a solid update that improves daily use. Update and enjoy.

Related: Check our Ollama setup guide if you're new, or find the best model for your MacBook. See how Qwen 3.5 models perform with the new engine.

Frequently Asked Questions

How much faster is Ollama 0.17 on Apple Silicon?

Ollama 0.17 delivers 10-15% faster prompt processing on Apple Silicon Macs. NVIDIA GPUs see larger gains (up to 40%). The improvement comes from the new integrated inference engine that handles scheduling and memory directly, replacing the legacy llama.cpp server mode.

Does Ollama 0.17 work with all Mac models?

Yes. Ollama 0.17 works on any Mac with Apple Silicon (M1, M2, M3, M4 and their Pro/Max/Ultra variants). Macs with more unified memory benefit most because large models stay entirely in GPU memory. See our device recommendations for your specific Mac.

What is 8-bit KV cache and why does it matter?

The KV cache stores your conversation context. Ollama 0.17 supports 8-bit quantization (down from 16-bit), cutting cache memory usage by roughly 50%. This means longer conversations before the model forgets earlier context, especially useful on MacBooks with 16-24GB RAM.

How do I update Ollama to 0.17?

If installed via Homebrew, run brew upgrade ollama. The macOS desktop app updates automatically. Verify with ollama --version. No configuration changes are needed; the new engine activates automatically.

Does Ollama 0.17 support MLX on Apple Silicon?

Yes. The MLX runner now supports Gemma 3, Llama, and Qwen 3 models. MLX uses Apple's framework optimized for the Neural Engine, potentially delivering faster inference than the standard llama.cpp backend for supported models.

---

Sources: Ollama GitHub Releases, Web And IT News, WebProNews