By ModelFit Team · 2026-06-15

Local LLM Weekly: Gemma 4 12B Hits Ollama (June 2026)

Gemma 4 12B is this week's standout local release: a multimodal model that runs on a 16GB Mac via ollama run gemma4:12b and scores 77.2 on MMLU-Pro. The cloud frontier moved too — Anthropic shipped Claude Fable 5, while Z.ai, Moonshot, MiniMax, NVIDIA and Alibaba all launched new flagships in the first two weeks of June.

This is our weekly model refresh for the window of June 1–15, 2026. Every model below was checked two ways before it went into the modelfit.io database: the exact Ollama tag had to resolve in the registry, and every benchmark figure had to appear verbatim in the lab's own published source. Models that did not clear both gates are listed separately on the watchlist, not in the database. Here is what is new, what you can install today, and what is still a week or two out.

What you can actually run: Gemma 4 12B

The one new model most readers can run on a current laptop is Gemma 4 12B from Google DeepMind, released June 3, 2026. It is a dense, encoder-free multimodal model (text, image and audio) with a 256K-token context window. The 12B size fits a 16GB Mac at Q4, and it pulls with a single command:

ollama run gemma4:12b

On Google's own evaluations it beats the older Gemma 3 27B while using less memory. The published numbers for the instruction-tuned 12B model:

BenchmarkGemma 4 12B
MMLU-Pro77.2
GPQA Diamond78.8
AIME 2026 (no tools)77.5
LiveCodeBench v672.0
MMMU-Pro69.1

All five figures are from the model card at huggingface.co/google/gemma-4-12B-it. For a 12B model that runs on a mid-range laptop, GPQA Diamond at 78.8 is the headline: graduate-level science reasoning that used to need a 30B-class model now fits in roughly 8GB of RAM at Q4.

If you have a 24GB Mac, Gemma 4 12B now shows up as a local_feasible option in the modelfit.io wizard and the modelfit CLI alongside the existing Gemma 4 26B-A4B and the Qwen3 family.

The June 2026 cloud frontier moved

Six new cloud or API-only flagships shipped in the same window. None of them run on consumer hardware, but they reset the comparison line that modelfit.io tracks for "how close is local to the frontier."

  • Claude Fable 5 (Anthropic) — the clearest leaderboard move of the window. Released June 9, it is Anthropic's first publicly available Mythos-class model, a tier above Opus 4.8. Anthropic reports it took the open coding lead on its own benchmarks; precise scores are not stated in plain text on the announcement page, so we list it without a published number rather than repeat a figure from a chart image.
  • MiniMax M3 — a native-multimodal mixture-of-experts model (~428B total, ~23B active) with a 1M-token context. MiniMax reports SWE-Bench Pro 59.0% and Terminal-Bench 2.1 66.0% on its launch blog.
  • NVIDIA Nemotron 3 Ultra — a 550B-total / 55B-active hybrid Mamba-2 + MoE model. Its card lists SWE-Bench Verified 70.7, and Artificial Analysis rates it at an Intelligence Index of 48, the leading US open-weights model this month. It needs datacenter GPUs, so it is API-only for almost everyone.
  • GLM-5.2 (Z.ai) — a coding-first flagship launched June 13 to GLM Coding Plan users, positioned to rival Claude Opus and already wired into Claude Code, Cline and OpenCode. MIT open weights are promised shortly after launch.
  • Kimi K2.7-Code (Moonshot AI) — a 1T-total / 32B-active MoE coding model built on K2.6 that cuts thinking-token usage by about 30%. Open weights under a modified MIT license, but 1T scale keeps it cloud-side for most users.
  • Qwen3.7-Plus (Alibaba) — a proprietary multimodal agent model with vision and GUI control on a 1M-token context, served through Alibaba Cloud Model Studio. API-only, no open weights.

All six are now in the database as cloud entries, with parameter counts left undisclosed where the vendor never published them.

On the watchlist: real, but not on Ollama yet

Three notable releases are genuine but did not clear the install gate, so they are not in the recommender yet:

  • DiffusionGemma 26B-A4B (Google) — the first open discrete-diffusion model from Google (about 25B total, ~4B active). It scores 77.6 on MMLU-Pro and Google's card reports over 1,100 tokens per second on an H100. There is no Ollama tag yet — llama.cpp support is still an open pull request. Worth watching.
  • Cohere North Mini Code 1.0 — a 30B-total / 3B-active Apache-2.0 coding model. Confirmed via Cohere's blog, but no Ollama distribution exists yet.
  • Apple Foundation Models 3 — Apple's WWDC 2026 on-device model (a 20B sparse architecture activating 1–4B parameters per request). It runs only inside Apple's own framework, with no Ollama tag and no open weights, so there is no pull path for general use.

We will re-check each of these next week and add them the moment a verified tag lands.

What changed on modelfit.io this week

Seven models were added: Gemma 4 12B as a local option, plus Claude Fable 5, MiniMax M3, NVIDIA Nemotron 3 Ultra, GLM-5.2, Kimi K2.7-Code and Qwen3.7-Plus as cloud entries. The homepage and benchmark tiles are unchanged: no new model posted a source-confirmed SWE-Bench Verified score above the current local leader (Qwen3.6-27B) or cloud leader, so we left the headline figures alone rather than swap in a number from a different benchmark.

FAQ

Can I run Gemma 4 12B on a 16GB Mac?

Yes. At Q4 it needs roughly 8GB to load, which fits comfortably on a 16GB machine with headroom for the OS and context. Run ollama run gemma4:12b.

Is Gemma 4 12B better than Gemma 4 26B-A4B?

The 26B-A4B mixture-of-experts model still scores higher overall, but the dense 12B is close on reasoning benchmarks at lower memory and is a good fit for 16GB machines that cannot hold the 26B model.

Why isn't Claude Fable 5 shown with a SWE-Bench score?

Anthropic's announcement makes a qualitative coding-lead claim but does not publish the exact figure in plain text on the page. We only show benchmark numbers that appear verbatim in a primary source, so Fable 5 is listed without one.

What about DiffusionGemma — can I install it?

Not yet. It is a real Google release, but no Ollama tag resolves and llama.cpp support is still unmerged. It is on our watchlist for a future update.

How often does this list update?

Weekly. Each refresh verifies every Ollama tag against the registry and every benchmark number against the lab's published source before anything enters the database.

Sources

  • Gemma 4 12B model card: https://huggingface.co/google/gemma-4-12B-it/raw/main/README.md
  • Gemma 4 on Ollama: https://ollama.com/library/gemma4
  • MiniMax M3 launch blog: https://www.minimax.io/blog/minimax-m3
  • NVIDIA Nemotron 3 Ultra card: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/raw/main/README.md
  • DiffusionGemma card: https://huggingface.co/google/diffusiongemma-26B-A4B-it/raw/main/README.md
  • Claude Fable 5 announcement: https://www.anthropic.com/news/claude-fable-5-mythos-5
What hardware runs this?

Match this model to a machine that can run it — by RAM tier for Apple Silicon, or by VRAM for an NVIDIA GPU.

See how this changes your recommendation
Run the wizard

The weekly local-AI refresh

New open-weight models, real Apple Silicon benchmarks, and the one model worth running on your Mac this week. Free, one email a week, unsubscribe anytime.

Have questions? Reach out on X/Twitter