Which Local LLM Is Best for Coding? The Tier List
Your RAM (or VRAM) is the hard ceiling, so the ranking below is organized by what actually fits. Benchmark figures shown are the ones published in each vendor's own model card and confirmed in the raw source; where no confirmed figure exists, we say so instead of inventing one. Every ollama run command was probed against the Ollama registry (HTTP 200) on 2026-06-12.
| Model | RAM tier | Download | Context | Sourced score |
|---|---|---|---|---|
| Qwen3.6 27B | 32GB | 17GB | 256K | 77.2 SWE-bench Verified |
| Devstral Small 2 (24B) | 24-32GB | 15GB | 384K | 68.0% SWE-bench Verified |
| Qwen3-Coder 30B (MoE) | 32GB | 19GB | 256K | — |
| gpt-oss-20b | 16-24GB | 14GB | 128K | — |
| Qwen2.5-Coder 7B | 8-16GB | 4.7GB | 32K | — |
| Qwen2.5-Coder 1.5B | any 8GB+ | 986MB | 32K | — |
| gpt-oss-120b | 96-128GB | 65GB | 128K | 41.8% aider polyglot |
| Devstral 2 (123B) | 128GB | 75GB | 256K | 72.2% SWE-bench Verified |
Qwen3.6 27B
32GBThe headline local coder of 2026 — frontier-class agentic coding score in a download that fits a 32GB Mac or 24GB GPU.
ollama run qwen3.6:27b
Devstral Small 2 (24B)
24-32GBMistral’s agentic-coding specialist. Slightly smaller than Qwen3.6 27B with the largest context window at this tier.
ollama run devstral-small-2:24b
Qwen3-Coder 30B (MoE)
32GBMixture-of-experts with only 3.3B active parameters — noticeably faster per token than dense 27B models on the same hardware.
ollama run qwen3-coder:30b
gpt-oss-20b
16-24GBOpenAI’s small open-weight MoE — the strongest coding option below the 32GB tier.
ollama run gpt-oss:20b
Qwen2.5-Coder 7B
8-16GBThe 16GB-and-under workhorse for chat-style coding help. Pair it with the 1.5B variant for autocomplete.
ollama run qwen2.5-coder:7b
Qwen2.5-Coder 1.5B
any 8GB+continue.dev’s officially recommended autocomplete model — small enough to return inline completions instantly.
ollama run qwen2.5-coder:1.5b
gpt-oss-120b
96-128GBFor Mac Studio / 128GB-class machines. Scores 41.8% on the aider polyglot benchmark — above any 32B-class open model on that board.
ollama run gpt-oss:120b
Devstral 2 (123B)
128GBThe strongest open-weight coder that runs on a single high-memory machine.
ollama run devstral-2:123b
Sources: Qwen3.6-27B model card (77.2 SWE-bench Verified, 59.3 Terminal-Bench 2.0), Devstral Small 2 model card (68.0% SWE-bench Verified; Devstral 2 123B at 72.2%), aider polyglot leaderboard (gpt-oss-120b 41.8%). SWE-bench figures are vendor self-reported on their model cards; aider scores are independently run.
What Changed in 2026: the Local Coding Gap Collapsed
A year ago the standard answer was Qwen2.5-Coder 32B — which scores just 16.4% on the aider polyglot benchmark. The 2026 generation rewrote that completely: Devstral Small 2 posts a 68.0% SWE-bench Verified score in a 15GB download, and Qwen3.6 27B reports 77.2 — numbers that were frontier-cloud territory in 2025. If your "best local coding model" knowledge dates from the Qwen2.5-Coder era, every recommendation you remember is out of date.
Two caveats keep cloud models ahead for some work: very long agentic sessions (local context windows fill up faster in practice) and the biggest open-weight coders (DeepSeek, Kimi K2) are 600B-1T-parameter models that no consumer machine runs — open-weight is not the same as locally runnable. For a deeper comparison, see our local vs cloud flagship benchmark breakdown.
How Do I Use a Local LLM in VS Code? (continue.dev)
continue.dev is the most popular open-source Copilot alternative for VS Code and JetBrains. Configuration now lives in ~/.continue/config.yaml (the old config.json is deprecated). A complete local setup — chat on a strong model, autocomplete on a fast one:
models:
- name: Qwen3.6 27B
provider: ollama
model: qwen3.6:27b
- name: Autocomplete
provider: ollama
model: qwen2.5-coder:1.5b
roles:
- autocompleteQwen2.5-Coder 1.5B is continue.dev's officially recommended autocomplete model. Their docs also warn against using thinking/reasoning models for autocomplete — completions need to arrive in milliseconds, not after a reasoning chain.
How Do I Use aider with Ollama?
aider is the terminal-native AI pair programmer. Point it at your local Ollama server:
export OLLAMA_API_BASE=http://127.0.0.1:11434 aider --model ollama_chat/qwen3.6:27b
Use the ollama_chat/ prefix — aider's docs state it is recommended over plain ollama/. The critical gotcha: Ollama defaults to a 2K context window and silently discards anything beyond it (aider's own warning). Coding needs far more. Raise it server-side:
OLLAMA_CONTEXT_LENGTH=32768 ollama serve
How Do I Use a Local LLM in Zed?
Zed auto-discovers every model Ollama has pulled — pull a model, then pick it from the Agent panel's model dropdown. One thing worth overriding: Zed defaults local models to a 4096-token context. In settings.json:
{
"language_models": {
"ollama": {
"api_url": "http://localhost:11434",
"available_models": [
{
"name": "qwen3.6:27b",
"display_name": "Qwen3.6 27B",
"max_tokens": 32768,
"supports_tools": true
}
]
}
}
}Frequently Asked Questions
What is the best local LLM for coding in 2026?
Qwen3.6 27B for most machines — 77.2 SWE-bench Verified (per its model card) in a 17GB download that fits a 32GB Mac or 24GB GPU. Devstral Small 2 is the strongest agentic-coding alternative at the same tier, with a 384K context window.
How much RAM do I need to run a coding LLM locally?
16GB runs 7B-class models plus a fast autocomplete model. 32GB is the sweet spot for the current best local coders (Qwen3.6 27B, Devstral Small 2). 96-128GB unlocks gpt-oss-120b and Devstral 2 123B.
What is the best local model for code autocomplete?
Qwen2.5-Coder 1.5B — continue.dev's official recommendation. Under 1GB, fast enough for inline suggestions. Never use a thinking model for autocomplete.
Can a local LLM replace GitHub Copilot or Claude?
For autocomplete and single-file edits, yes — fully offline. For long agentic sessions on large repos, frontier cloud models still lead, but 2026's local 24-27B models closed most of the gap.