2026-05-14
The April 2026 Local LLM Wave: Qwen3.6, Gemma 4, Llama 4, DeepSeek V4
What actually shipped in April 2026?
Between April 2 and April 30, four major labs released frontier-class open-weight models. None of them is a refresh — each ships new architecture, new benchmarks, or new licensing.
Qwen3.6 (April 16 and 21, 2026). Alibaba shipped Qwen3.6-35B-A3B (MoE, 3B active) and the flagship dense Qwen3.6-27B. Both Apache 2.0, both 262K native context extensible to 1M. The 27B dense model hits 86.2 on MMLU-Pro and 77.2% on SWE-Bench Verified — the highest score yet on an open-weight dense model. Source: Qwen blog, Hugging Face model card. Gemma 4 (April 2, 2026). Google DeepMind released a four-tier family: a 31B dense flagship, a 26B-A4B MoE, plus E4B (4.5B effective) and E2B (2.3B effective) on-device variants using Per-Layer Embeddings. The 31B scores 85.2 MMLU-Pro and 84.3 GPQA Diamond. The E2B runs at ~158 tokens/sec on M5 Max via MLX — the fastest model in the 2026 batch by a wide margin. Source: Google DeepMind, Ollama gemma4 library. Llama 4 Scout and Maverick (April 5, 2026). Meta finally shipped the Llama 4 MoE family. Scout: 109B total / 17B active, 10M context, multimodal. Maverick: 400B total / 17B active, 1M context, multimodal. The Behemoth variant (288B active) is still in training. Source: Meta AI blog, Ollama llama4. DeepSeek V4 (April 24, 2026). DeepSeek released both a Flash (284B / 13B active) and Pro (1.6T / 49B active) variant, MIT licensed. Flash with Thinking Mode rivals OpenAI o-series on competitive math and PhD-level science. Local quantization requires antirez's llama.cpp fork or upstream PR #22378. Source: Hugging Face, MIT Tech Review.Plus two from the second tier: Mistral Medium 3.5 (April 30, 128B dense, SWE-Bench 77.6%) and GLM-5.1 from Z.ai (April 7, 744B / 40B active MoE, MIT, SWE-Bench Pro 58.4). And Moonshot's Kimi K2.6 (April 20, 1T / 32B active MoE, SWE-Bench Pro 58.6%, open weights via Unsloth GGUF).
Which model should you actually run?
The answer changed in April 2026. RAM-by-RAM:
| RAM | Best pick | Why |
|---|---|---|
| 8 GB | Gemma 4 E4B | 4.5B effective params, multimodal, runs on iPhone-class Macs |
| 16 GB | Qwen3.5-9B or Gemma 4 26B-A4B (partial offload) | Mature 9B still leads quality-per-GB |
| 24 GB | Qwen3.6-27B | 77.2% SWE-Bench Verified — new sweet spot |
| 32 GB | Qwen3.6-35B-A3B or Gemma 4 31B | MoE wins for agent loops, dense wins for raw coding |
| 64 GB | DeepSeek V4 Flash (IQ2_XXS) or Llama 4 Scout | Frontier reasoning at home |
| 128 GB+ | Llama 4 Scout (full Q4) | 10M context, multimodal |
| 256 GB+ | Llama 4 Maverick, Kimi K2.6, GLM-5.1 | Mac Studio Ultra territory |
The previous sweet spot was Qwen3.5-9B on a 16GB Mac. That's still valid, but if you have 24GB or more, Qwen3.6-27B is now the default coding model. On SWE-Bench Verified it's only 10.4 points behind Claude Opus 4.7 and 0.4 points ahead of Mistral Medium 3.5 — a cloud-tier API model.
How does this compare to cloud frontier in May 2026?
The cloud side moved too. Claude Opus 4.7 (April 16) leads SWE-Bench Verified at 87.6%. GPT-5.5 (April 23) and Gemini 3.1 Pro (April 2026) followed. Grok 4.3 (April 17) extended to 1M context on April 30.
On coding, the gap between best open and best closed is now ~10 percentage points on SWE-Bench Verified — the smallest it has ever been. On reasoning, DeepSeek V4 Flash Thinking Mode reportedly approaches OpenAI o-series on PhD-level science and competitive math, though independent verification is still in progress.
Apple Silicon: Ollama 0.19 MLX backend
The hardware story moved as much as the models. Ollama 0.19 (March 31, 2026) shipped a native MLX backend, replacing the llama.cpp Metal path for several model families on Apple Silicon.
Numbers from Ollama's own benchmark, Qwen3.5-35B-A3B:
- Prefill: 1,810 tok/s (MLX) vs 1,154 tok/s (llama.cpp Metal) — 1.57x speedup
- Decode: 112 tok/s (MLX) vs 58 tok/s (llama.cpp Metal) — 1.93x speedup
- With NVFP4 int4: 1,851 tok/s prefill, 134 tok/s decode
Source: Ollama blog, AppleInsider coverage.
Apple's own ML research team published M5 vs M4 numbers: 4.06x time-to-first-token speedup on Qwen 14B 4-bit, 3.97x on Qwen 8B 4-bit. Source: Apple ML Research.
llama.cpp didn't stand still either. April builds shipped tensor parallelism (b8738, ~3-4x improvement), 1-bit Q1_0 quantization with Metal support (b8712, fits a 7B model in under 1GB), and Walsh-Hadamard KV rotation (b8607) that lifts Q4_0 AIME25 accuracy to 21.7%.
What's the catch?
Three real ones.
License fragmentation. Qwen3.6 and Gemma 4 are Apache 2.0 — clean commercial use. Llama 4 still uses the Llama Community License with the 700M monthly active user trigger. DeepSeek V4, Kimi K2.6, and GLM-5.1 are MIT. Check before shipping. RAM math is brutal at the top. Llama 4 Maverick needs 245GB at Q4_K_M — that's Mac Studio M4 Ultra 256GB territory ($6,000+). DeepSeek V4 Pro is unrunnable on consumer hardware even at IQ2_XXS. The frontier-class wave excludes anyone on a 16GB Mac. Mac Studio M5 Ultra slipped. Bloomberg's Mark Gurman reported on April 19, 2026 that the M5 Ultra Mac Studio is delayed from mid-2026 to October 2026 due to DRAM shortages. Source: MacRumors. If you want a 256GB+ Mac for Llama 4 Maverick or Kimi K2.6, you're waiting.Ollama commands
The straightforward ones:
ollama run qwen3.6:27b # New default coding model
ollama run qwen3.6:35b-a3b # MoE for agents
ollama run gemma4:31b # Google dense flagship
ollama run gemma4:26b # MoE, multimodal
ollama run gemma4:e4b # On-device, 8GB Macs
ollama run gemma4:e2b # IoT and mobile tier
ollama run llama4:scout # 109B/17B MoE, 67GB
ollama run llama4:maverick # 400B/17B MoE, 245GB
Cloud-only (route through Ollama Cloud):
ollama run deepseek-v4-flash:cloud
ollama run glm-5.1:cloud
ollama run mistral-medium-3.5:cloud
FAQ
Is Qwen3.6 better than Qwen3.5 for everyday coding?Yes, on every benchmark we tracked. Qwen3.6-27B beats Qwen3.5-35B-A3B by 6+ points on SWE-Bench Verified and runs in less RAM. If you're on a 24GB Mac, switch.
Should I delete my Qwen3.5 models?No. Qwen3.5-9B is still the best 14GB-budget option in May 2026 and Qwen3.5-4B remains the fastest competent small model. Keep both.
Can I run DeepSeek V4 Flash on a Mac?With a 128GB+ Mac Studio and antirez's IQ2_XXS quantization fork, yes. The full Q4_K_M is ~170GB — feasible only on 192GB+ unified memory. For most users, route via Ollama Cloud.
What about Llama 4 Behemoth?Still training as of May 2026. Meta has not given a release window. Expect late 2026 at earliest.
Will Apple announce Foundation Models v2 at WWDC 2026?Highly likely. WWDC runs June 8-12. The Core ML → Core AI framework rename was reported in March 2026 (AppleInsider). Watch for on-device model size disclosures.
What's coming in May-June 2026
- Apple WWDC 2026 (June 8-12): Core AI framework, Foundation Models updates
- Kimi K2.6 Code formal release (any day now)
- Mac Studio M5 Ultra delayed to October 2026
- DeepSeek V4 distills expected within weeks — that's when the V4 reasoning quality lands on consumer Macs
- GLM-5.1 Air community asking, no official confirmation
The pace doesn't look like it's slowing down. Six months ago a 9B model matching GPT-4o on MMLU was news. Today an open 27B model is within 10 points of Claude Opus 4.7 on SWE-Bench Verified — and it runs on a $2,000 Mac.
Have questions? Reach out on X/Twitter