2026-04-04
Run Claude Code Free: Ollama Local Setup in 4 Steps (2026)
Claude Code is Anthropic's AI coding agent — and you can run it locally with Ollama instead of paying $100/month for Claude Max. Since Ollama v0.14 shipped native Anthropic Messages API compatibility in January 2026, the setup is dead simple: two environment variables, one command. As of April 2026, Ollama 0.19 with MLX support makes this even better — local inference is now up to 2x faster on Apple Silicon.
TL;DR: Run ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_AUTH_TOKEN=ollama claude --model qwen3.5:27b to use Claude Code with free local models. Works on any Apple Silicon Mac. Ollama 0.19 with MLX nearly doubles decode speed (58 to 112 tok/s). Best model depends on your RAM: 4B for 8GB, 9B for 16GB, 27B for 24GB+.
This setup went viral after @itsafiz shared it on X, pulling 51,000 views and 781 bookmarks in 48 hours (March 2026). The community response confirmed what many solo developers suspected: Claude Code's real value is in its CLI harness, not just the model behind it.
How Does Claude Code Work with Local Models?
Claude Code accepts two environment variables that redirect all API calls to any Anthropic-compatible endpoint. Since Ollama v0.14 (January 2026), Ollama natively exposes an Anthropic Messages API on localhost:11434 — no proxy needed, no translation layer. Just point and run.
The architecture is simple:
Claude Code CLI → localhost:11434 → Ollama (Anthropic API) → Local Model
Claude Code's file operations, git integration, and project context management all work the same. The only difference is which model generates the responses. Tool calling, multi-turn conversations, vision input, and extended thinking all work through this native API layer.
Step 1: Install Ollama (0.19+ Recommended)
Open Terminal and run:
brew install ollama
If you already have Ollama installed, update to 0.19 for MLX acceleration:
brew upgrade ollama
Start the Ollama server:
ollama serve
Verify it's running:
curl http://localhost:11434/api/tags
You should see a JSON response with your available models. If you just installed Ollama, the list will be empty — that's fine. For a full walkthrough, see our complete Ollama installation guide for Mac.
What's New in Ollama 0.19 (March 2026)
Ollama 0.19 is the biggest Mac performance update yet. It rebuilds Apple Silicon inference on top of Apple's MLX framework, taking full advantage of unified memory. The results (Ollama blog, March 31, 2026):
| Metric | Ollama 0.18 | Ollama 0.19 (MLX) | Improvement |
|---|---|---|---|
| Prefill speed | 1,154 tok/s | 1,810 tok/s | +57% |
| Decode speed | 58 tok/s | 112 tok/s | +93% |
That's nearly 2x faster response generation. On M5 chips, Ollama also leverages the new GPU Neural Accelerators for even bigger gains. The improved cache system reuses data across conversations to lower memory use and speed up prompt processing — a big win for coding workflows with branching prompts.
For more Ollama performance details, see our Ollama 0.17 Apple Silicon benchmarks article, which covers the previous inference engine overhaul.
Note: MLX preview currently requires 32GB+ unified memory and supports Qwen3.5 models. Support for more models is rolling out. Macs with less than 32GB still get the standard llama.cpp backend, which also improved in 0.19.Step 2: Pull a Coding Model
Which model you pull depends on how much RAM your Mac has. Ollama downloads models on first pull and caches them locally.
RAM-to-Model Guide (April 2026)
| Model | Download Size | RAM Needed | Best For | Mac Compatibility |
|---|---|---|---|---|
| qwen3.5:4b | ~2.5 GB | ~4 GB total | Quick fixes, simple tasks | Any Mac (8GB+) |
| qwen3.5:9b | ~5 GB | ~7 GB total | General coding | MacBook Air M4 16GB |
| qwen3.5:27b | ~16 GB | ~20 GB total | Complex refactoring | MacBook Pro 24GB+ |
| glm-4.7:9b | ~5.5 GB | ~8 GB total | Fast + large context (128K) | MacBook Air 16GB |
| qwen3.5:35b-a3b | ~22 GB | ~22 GB total | Best MoE efficiency | Mac Mini/Pro 24GB+ |
Pull your model:
# For 8GB Macs — small but surprisingly capable
ollama pull qwen3.5:4b
# For 16GB MacBook Air — the sweet spot
ollama pull qwen3.5:9b
# For 24GB+ Mac — best local coding quality
ollama pull qwen3.5:27b
# Alternative: GLM 4.7 for speed + large context
ollama pull glm-4.7:9b
Qwen 3.5 4B matches GPT-4o on independent testing with a 49.9% win rate across 1,000 real-world prompts (N8Programs, March 2026). Even the smallest model here is genuinely useful. For a deeper comparison between model families, read our DeepSeek V3 vs Qwen 3.5 Mac comparison.
Step 3: Install Claude Code
You need Node.js 18+ installed. Then:
npm install -g @anthropic-ai/claude-code
Verify the installation:
claude --version
Claude Code gets frequent updates — multi-agent collaboration, computer use, and auto mode all shipped in Q1 2026. Keep it current with npm update -g @anthropic-ai/claude-code.
Step 4: Run Claude Code with Your Local Model
This is the one-liner that makes it all work:
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_AUTH_TOKEN=ollama \
claude --model qwen3.5:27b
Replace qwen3.5:27b with whichever model you pulled in Step 2.
Make It Permanent
Add this to your ~/.zshrc so you don't have to type it every time:
# Claude Code with local Ollama
alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_AUTH_TOKEN=ollama claude --model qwen3.5:27b'
Then reload your shell:
source ~/.zshrc
Now just type claude-local to start a session.
What Works and What Doesn't?
Be honest with yourself about what local models can and cannot do. Claude Code's CLI is excellent, but the model quality gap is real. See our local LLMs vs cloud flagships benchmark for detailed numbers.
Works Well
- File reading and context — Claude Code indexes your project the same way regardless of backend
- Code generation — Single-file functions, components, utilities
- Refactoring — Renaming, restructuring, pattern application
- Git operations — Commits, diffs, branch management
- Debugging — Reading error messages and suggesting fixes
- Explaining code — Summarizing what files and functions do
Works Poorly with Small Models
- Multi-step agentic tasks — Local 4B-9B models lack the reasoning depth for complex chains of edits across multiple files
- Tool calling reliability — Varies significantly by model; smaller models fail more often
- Large context windows — 27B models handle this better, but 4B models struggle past 8K tokens
- Architectural decisions — Don't expect a 4B model to design your system architecture
As @debdoot_x noted: "So basically running Claude Code without a Claude model. Funny." It's a fair point. You're getting Claude Code's interface and tooling with a local model's intelligence. For many tasks, that's enough.
Ollama 0.19 MLX: What It Means for Claude Code Users
The March 31, 2026 release of Ollama 0.19 is a game-changer for this workflow. Here is why it matters specifically for Claude Code with local models.
Faster First Response
Prefill speed jumped from 1,154 to 1,810 tok/s. When Claude Code sends your project context to the model, it processes that context 57% faster. You wait less for the first token of every response.
Faster Code Generation
Decode speed nearly doubled — from 58 to 112 tok/s. When the model writes a 200-line function, it now takes roughly half the time. Over a full coding session, this adds up to minutes saved.
Smarter Caching for Coding Workflows
Ollama 0.19 takes intelligent cache snapshots for branching prompts. Claude Code sessions are full of branching — "try this approach, no wait, try that instead." The new cache reuses previous computation instead of reprocessing everything from scratch.
How to Enable MLX
If you have 32GB+ unified memory and run Qwen3.5 models, MLX activates automatically in Ollama 0.19. No configuration needed. Just update:
brew upgrade ollama
ollama serve
For Macs with 16GB or less, you still get the improved llama.cpp backend. Not as fast as MLX, but still faster than 0.18.
RAM Reality Check
This matters more than anything else for Mac users. As @pinkham warned: "Dont expect to run Chrome, VS Code, Slack and Zoom while doing so."
Here's the real math. macOS itself uses 4-6 GB of RAM. Every browser tab, editor, and communication app adds more. The model needs contiguous memory.
| Your Mac | Available for Model | Recommended Model | What You'll Close |
|---|---|---|---|
| 8GB MacBook Air | ~3-4 GB | qwen3.5:4b | Everything except Terminal |
| 16GB MacBook Air | ~8-10 GB | qwen3.5:9b | Browser tabs, Slack |
| 24GB Mac Mini/Pro | ~16-18 GB | qwen3.5:27b | Maybe Chrome |
| 36GB+ Mac Pro | ~28+ GB | qwen3.5:27b + apps | Nothing |
The Mac Mini M4 Pro with 64GB is the sweet spot for running 27B models alongside your normal workflow. At $1,999-$2,499, it pays for itself in about 20 months compared to Claude Max at $100/month. For a complete breakdown of which models run best on each configuration, see our MacBook Air M4 16GB guide and MacBook Pro M4 Pro 24GB guide.
How Much Money Does This Actually Save?
The cost comparison is straightforward:
| Option | Monthly Cost | Annual Cost | What You Get |
|---|---|---|---|
| Claude Max | $100/month | $1,200/year | Full Opus 4.6, unlimited usage |
| Claude Pro | $20/month | $240/year | Opus with usage limits |
| Claude API | Variable | $50-$500+/year | Pay per token, unpredictable |
| Ollama local | ~$2/month electricity | ~$24/year | Free inference, local models |
As @bygregorr pointed out: this "eliminates unpredictable API costs for solo devs." If you're burning through $50-$200/month in API credits for coding assistance, switching to local models for routine tasks and saving the API budget for complex ones is a legitimate strategy.
Pro tip: Useclaude-local for routine refactoring, file edits, and code explanation. Switch to the real Claude API for complex multi-file agentic tasks. This hybrid approach can cut your monthly bill by 60-80%.
Advanced: LiteLLM Proxy for Model Routing
For power users, LiteLLM lets you map different Claude model names to different local models. This means Claude Code can automatically use a small model for simple queries and a large one for complex tasks.
pip install litellm
Create a litellm_config.yaml:
model_list:
- model_name: claude-opus-4-6-20250915
litellm_params:
model: ollama/qwen3.5:27b
api_base: http://localhost:11434
- model_name: claude-sonnet-4-6-20250514
litellm_params:
model: ollama/qwen3.5:9b
api_base: http://localhost:11434
- model_name: claude-haiku-4-5-20251001
litellm_params:
model: ollama/qwen3.5:4b
api_base: http://localhost:11434
Start the proxy:
litellm --config litellm_config.yaml --port 4000
Then point Claude Code at LiteLLM instead of Ollama directly:
ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_AUTH_TOKEN=local \
claude
This adds a layer of complexity but gives you model routing without changing your Claude Code workflow.
Alternatives Worth Knowing
Claude Code with Ollama isn't the only option. Other tools work with local models natively:
- OpenCode — Open-source Claude Code alternative with native Ollama support. No environment variable hacks needed.
- Aider — AI coding agent with direct Ollama integration and strong git awareness.
- LM Studio — Added Anthropic-compatible
/v1/messagesendpoint in v0.4.1, works as a drop-in Ollama alternative for Claude Code. - Bifrost — Lightweight API proxy as an alternative to LiteLLM.
Each has trade-offs. Claude Code has the most polished CLI experience. Aider has better git integration. OpenCode is fully open-source. LM Studio offers a GUI for model management.
Security Considerations
@tasa2379 pointed out that most guides skip security hardening. Fair criticism. A few things to lock down:1. Bind Ollama to localhost only — The default ollama serve already binds to 127.0.0.1:11434. Don't change this unless you know what you're doing.
2. Don't expose port 11434 to your network — No need for external access if Claude Code runs on the same machine.
3. Review model permissions — Claude Code can read and modify files in your project directory. Local doesn't mean safer if the model hallucinates a destructive command.
4. Keep Ollama updated — brew upgrade ollama regularly for security patches.
FAQ
Can I run this on a Mac Mini?
Yes. The Mac Mini M4 is the best value local AI machine in 2026. The 24GB model runs qwen3.5:27b comfortably. The 16GB base runs qwen3.5:4b or qwen3.5:9b.
Is the local version as good as real Claude Code?
No. You get Claude Code's CLI, file management, and tooling — but the model intelligence depends on what you run locally. A 4B model won't match Opus 4.6 for complex multi-step reasoning. For single-file edits and routine coding, local models handle 80%+ of tasks well.
Which model should I use for coding?
Qwen 3.5 models are the current best for local coding. Start with qwen3.5:4b if you have limited RAM. Move to qwen3.5:27b if you have 24GB+. The 9B variant is the sweet spot for 16GB MacBook Air users. GLM-4.7 is a strong alternative if you need a 128K context window.Does Ollama 0.19 MLX work on all Macs?
The MLX preview requires 32GB+ unified memory and currently supports Qwen3.5 models. Macs with less than 32GB use the standard llama.cpp backend, which also received performance improvements in 0.19. MLX support for more models is planned.
Does this work on Intel Macs?
Technically yes, but performance will be poor. Ollama on Intel Macs uses CPU-only inference, which is 5-10x slower than Apple Silicon's GPU acceleration. We don't recommend it for practical use.
Can I switch between local and cloud models?
Yes. Use the alias approach — set claude-local for Ollama and keep the default claude command pointed at Anthropic's API. Use local for routine tasks and cloud for complex agentic work. This hybrid approach gives you the best of both worlds while keeping API costs low.
What about Claude Code's new features like computer use?
Claude Code's Q1 2026 features — computer use, multi-agent collaboration, auto mode — work with the cloud API. With local models, you get the core CLI features: file operations, git integration, code generation, and tool calling. The advanced agentic features require the full Claude API.
Related Model Families:- Qwen Models — Best local coding models for Claude Code
- DeepSeek Models — Strong reasoning for complex tasks
- How to Install Ollama on Mac — Full setup from scratch
- Claude Code Local LLM Setup — Hardware-focused guide with GPU setups
- Ollama 0.17 Apple Silicon Benchmarks — Previous engine overhaul
- MacBook Air vs Pro for LLMs — Which Mac to buy for local AI
Have questions? Reach out on X/Twitter