What is the best local LLM for coding right now?

Qwen3.6-27B leads consumer-runnable open-weight models on SWE-Bench Verified at 77.2% and runs on a 24GB Mac. Pull it with: ollama run qwen3.6:27b.

How close are local models to Claude and GPT on SWE-Bench?

The top closed model, Claude Fable 5, scores 95.0% on SWE-Bench Verified; the best consumer-runnable open-weight model, Qwen3.6-27B, scores 77.2%, a gap of 17.8 points.

Are these benchmark numbers measured by ModelFit?

No. Every score is third-party SWE-Bench Verified, re-confirmed against the model’s primary source (an independent leaderboard or the model card) and refreshed weekly. ModelFit runs no benchmarks; tokens-per-second figures elsewhere on the site are estimates.

Local vs Cloud: SWE-Bench Verified Leaderboard

Name: Local LLMs vs Cloud Flagships: SWE-Bench Verified Leaderboard
Creator: ModelFit
License: https://creativecommons.org/licenses/by/4.0/

The best open-weight LLM on SWE-Bench Verified is Qwen3.6-27B (77.2%), which runs on a 24GB Mac, 17.8 points behind Claude Fable 5 (95.0%). Each score is raw-confirmed against its primary source and re-checked weekly.

Auto-verified weekly · updated 2026-07-02

Key facts

Qwen3.6-27B (77.2%) is the top open-weight model on SWE-Bench Verified that runs on a normal 24GB Mac, 17.8 points behind the cloud leader Claude Fable 5 (95.0%).

$ ollama run qwen3.6:27b

BEST LOCAL

Qwen3.6-27B · 77.2%

CLOUD LEADER

Claude Fable 5 · 95.0%

GAP

17.8 pts

RUNS ON

24GB Mac

Scores are third-party (SWE-Bench Verified), raw-confirmed against each model's primary source. ModelFit runs no benchmarks.

BEST LOCAL CODING

77.2%Qwen3.6-27B SWE-Bench Verified

CLOUD LEADER

95.0%Claude Fable 5 SWE-Bench

GAP TO CLOUD

17.8 ptsbest local vs best cloud

MODELS TRACKED

147 local · refreshed weekly

Local vs Cloud: How Close Are We?

SWE-Bench Verified

Real-world software-engineering benchmark, the headline coding metric. Open-weight models you can run at home keep narrowing the gap with the best closed APIs.

Claude Fable 5

95.0% Cloud

Claude Opus 4.8

88.6% Cloud

GPT-5.5

82.6% Cloud

Claude Opus 4.7

82.0% Cloud

Gemini 3.1 Pro

78.8% Cloud

Kimi K2.7-Code

78.2% Cloud

DeepSeek V4 Pro

77.4% Local ✅

Qwen3.6-27B

77.2% Local ✅

Local / open-weight Cloud / closed

Full Leaderboard

Auto-verified weekly · updated 2026-07-02

#	Model	SWE-Bench Verified	Type	Runs on	Get it	Source	Verified
1	Claude Fable 5Cloud API	95.0%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
2	Claude Opus 4.8Cloud API	88.6%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
3	GPT-5.5Cloud API	82.6%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
4	Claude Opus 4.7Cloud API	82.0%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
5	Gemini 3.1 ProCloud API	78.8%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
6	Kimi K2.7-CodeCloud API	78.2%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
7	DeepSeek V4 ProOpen-weight MoE	77.4%	Local	512GB Mac Studio	—	vals.ai SWE-Bench Verified leaderboard	2026-07-02
8	Qwen3.6-27B27B	77.2%	Local	24GB Mac	$ qwen3.6:27bregistry-verified	Qwen3.6-27B HuggingFace model card	2026-07-02
9	MiniMax M3428B / 23B MoE	75.0%	Local	256GB+ Mac Studio	—	vals.ai SWE-Bench Verified leaderboard	2026-07-02
10	Qwen3.5-27B27B	75.0%	Local	24GB Mac	$ qwen3.5:27bregistry-verified	Qwen3.5-27B HuggingFace model card	2026-07-02
11	Qwen3.6-35B-A3B35B / 3B MoE	73.4%	Local	24GB Mac	$ qwen3.6:35b-a3bregistry-verified	Qwen3.6-35B-A3B HuggingFace model card	2026-07-02
12	NVIDIA Nemotron 3 UltraCloud API	69.0%	Cloud	Cloud API	API only	vals.ai SWE-Bench Verified leaderboard	2026-07-02
13	Mistral Medium 3.5128B dense	66.4%	Local	96GB+ Mac	—	vals.ai SWE-Bench Verified leaderboard	2026-07-02
14	Devstral 2 (2512)Open-weight 24B	62.8%	Local	32GB Mac	—	vals.ai SWE-Bench Verified leaderboard	2026-07-02

Third-party SWE-Bench Verified scores. Each % is re-confirmed against the linked primary source and each Ollama tag re-probed against the registry (HTTP 200) weekly. Closed-API parameter counts are undisclosed ("Cloud API"). Tokens-per-second figures elsewhere on ModelFit are estimates, not measured benchmarks.

What Should You Run Today?

MacBook 16GB ⭐

Capable coding on a thin laptop

Gemma 4 E4B + Qwen3.5-9B
Qwen3.5-4B for coding (88.8% MMLU-Redux)

MacBook 24GB ⭐

The sweet spot

Qwen3.6-27B (77.2% SWE-Bench)
Qwen3.6-35B-A3B for agents
Gemma 4 26B-A4B for multimodal

Mac Studio 128GB+

Frontier-class local

DeepSeek V4 Pro (77.4% SWE-Bench)
MiniMax M3 (75.0%, 256GB+)
Mistral Medium 3.5 (128B dense)

Methodology

How this leaderboard stays honest

Every score is SWE-Bench Verified (not the harder SWE-Bench Pro) and links to the primary source it was confirmed against, specifically an independent leaderboard or the model's own card. A weekly routine re-greps each number from that raw source, re-probes each Ollama tag against the registry (HTTP 200), and only swaps a row when a newly confirmed score beats the incumbent. Closed-API parameter counts are vendor-undisclosed, so cloud rows show "Cloud API". ModelFit runs no benchmarks; tokens-per-second figures elsewhere on the site are estimates.

→ Full compatibility dataset → JSON export → Open dataset (CC BY 4.0)

Want the full analysis?

Detailed comparisons, RAM costs on Apple Silicon, and what runs on your Mac.

Read Full Article