2026-04-04

Run Claude Code Free: Ollama Local Setup in 4 Steps (2026)

Claude Code is Anthropic's AI coding agent — and you can run it locally with Ollama instead of paying $100/month for Claude Max. Since Ollama v0.14 shipped native Anthropic Messages API compatibility in January 2026, the setup is dead simple: two environment variables, one command. As of April 2026, Ollama 0.19 with MLX support makes this even better — local inference is now up to 2x faster on Apple Silicon.

TL;DR: Run ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_AUTH_TOKEN=ollama claude --model qwen3.5:27b to use Claude Code with free local models. Works on any Apple Silicon Mac. Ollama 0.19 with MLX nearly doubles decode speed (58 to 112 tok/s). Best model depends on your RAM: 4B for 8GB, 9B for 16GB, 27B for 24GB+.

This setup went viral after @itsafiz shared it on X, pulling 51,000 views and 781 bookmarks in 48 hours (March 2026). The community response confirmed what many solo developers suspected: Claude Code's real value is in its CLI harness, not just the model behind it.

How Does Claude Code Work with Local Models?

Claude Code accepts two environment variables that redirect all API calls to any Anthropic-compatible endpoint. Since Ollama v0.14 (January 2026), Ollama natively exposes an Anthropic Messages API on localhost:11434 — no proxy needed, no translation layer. Just point and run.

The architecture is simple:

Claude Code CLI → localhost:11434 → Ollama (Anthropic API) → Local Model

Claude Code's file operations, git integration, and project context management all work the same. The only difference is which model generates the responses. Tool calling, multi-turn conversations, vision input, and extended thinking all work through this native API layer.

Step 1: Install Ollama (0.19+ Recommended)

Open Terminal and run:

brew install ollama

If you already have Ollama installed, update to 0.19 for MLX acceleration:

brew upgrade ollama

Start the Ollama server:

ollama serve

Verify it's running:

curl http://localhost:11434/api/tags

You should see a JSON response with your available models. If you just installed Ollama, the list will be empty — that's fine. For a full walkthrough, see our complete Ollama installation guide for Mac.

What's New in Ollama 0.19 (March 2026)

Ollama 0.19 is the biggest Mac performance update yet. It rebuilds Apple Silicon inference on top of Apple's MLX framework, taking full advantage of unified memory. The results (Ollama blog, March 31, 2026):

MetricOllama 0.18Ollama 0.19 (MLX)Improvement
Prefill speed1,154 tok/s1,810 tok/s+57%
Decode speed58 tok/s112 tok/s+93%

That's nearly 2x faster response generation. On M5 chips, Ollama also leverages the new GPU Neural Accelerators for even bigger gains. The improved cache system reuses data across conversations to lower memory use and speed up prompt processing — a big win for coding workflows with branching prompts.

For more Ollama performance details, see our Ollama 0.17 Apple Silicon benchmarks article, which covers the previous inference engine overhaul.

Note: MLX preview currently requires 32GB+ unified memory and supports Qwen3.5 models. Support for more models is rolling out. Macs with less than 32GB still get the standard llama.cpp backend, which also improved in 0.19.

Step 2: Pull a Coding Model

Which model you pull depends on how much RAM your Mac has. Ollama downloads models on first pull and caches them locally.

RAM-to-Model Guide (April 2026)

ModelDownload SizeRAM NeededBest ForMac Compatibility
qwen3.5:4b~2.5 GB~4 GB totalQuick fixes, simple tasksAny Mac (8GB+)
qwen3.5:9b~5 GB~7 GB totalGeneral codingMacBook Air M4 16GB
qwen3.5:27b~16 GB~20 GB totalComplex refactoringMacBook Pro 24GB+
glm-4.7:9b~5.5 GB~8 GB totalFast + large context (128K)MacBook Air 16GB
qwen3.5:35b-a3b~22 GB~22 GB totalBest MoE efficiencyMac Mini/Pro 24GB+

Pull your model:

# For 8GB Macs — small but surprisingly capable

ollama pull qwen3.5:4b

# For 16GB MacBook Air — the sweet spot

ollama pull qwen3.5:9b

# For 24GB+ Mac — best local coding quality

ollama pull qwen3.5:27b

# Alternative: GLM 4.7 for speed + large context

ollama pull glm-4.7:9b

Qwen 3.5 4B matches GPT-4o on independent testing with a 49.9% win rate across 1,000 real-world prompts (N8Programs, March 2026). Even the smallest model here is genuinely useful. For a deeper comparison between model families, read our DeepSeek V3 vs Qwen 3.5 Mac comparison.

Step 3: Install Claude Code

You need Node.js 18+ installed. Then:

npm install -g @anthropic-ai/claude-code

Verify the installation:

claude --version

Claude Code gets frequent updates — multi-agent collaboration, computer use, and auto mode all shipped in Q1 2026. Keep it current with npm update -g @anthropic-ai/claude-code.

Step 4: Run Claude Code with Your Local Model

This is the one-liner that makes it all work:

ANTHROPIC_BASE_URL=http://localhost:11434 \

ANTHROPIC_AUTH_TOKEN=ollama \

claude --model qwen3.5:27b

Replace qwen3.5:27b with whichever model you pulled in Step 2.

Make It Permanent

Add this to your ~/.zshrc so you don't have to type it every time:

# Claude Code with local Ollama

alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_AUTH_TOKEN=ollama claude --model qwen3.5:27b'

Then reload your shell:

source ~/.zshrc

Now just type claude-local to start a session.

What Works and What Doesn't?

Be honest with yourself about what local models can and cannot do. Claude Code's CLI is excellent, but the model quality gap is real. See our local LLMs vs cloud flagships benchmark for detailed numbers.

Works Well

  • File reading and context — Claude Code indexes your project the same way regardless of backend
  • Code generation — Single-file functions, components, utilities
  • Refactoring — Renaming, restructuring, pattern application
  • Git operations — Commits, diffs, branch management
  • Debugging — Reading error messages and suggesting fixes
  • Explaining code — Summarizing what files and functions do

Works Poorly with Small Models

  • Multi-step agentic tasks — Local 4B-9B models lack the reasoning depth for complex chains of edits across multiple files
  • Tool calling reliability — Varies significantly by model; smaller models fail more often
  • Large context windows — 27B models handle this better, but 4B models struggle past 8K tokens
  • Architectural decisions — Don't expect a 4B model to design your system architecture

As @debdoot_x noted: "So basically running Claude Code without a Claude model. Funny." It's a fair point. You're getting Claude Code's interface and tooling with a local model's intelligence. For many tasks, that's enough.

Ollama 0.19 MLX: What It Means for Claude Code Users

The March 31, 2026 release of Ollama 0.19 is a game-changer for this workflow. Here is why it matters specifically for Claude Code with local models.

Faster First Response

Prefill speed jumped from 1,154 to 1,810 tok/s. When Claude Code sends your project context to the model, it processes that context 57% faster. You wait less for the first token of every response.

Faster Code Generation

Decode speed nearly doubled — from 58 to 112 tok/s. When the model writes a 200-line function, it now takes roughly half the time. Over a full coding session, this adds up to minutes saved.

Smarter Caching for Coding Workflows

Ollama 0.19 takes intelligent cache snapshots for branching prompts. Claude Code sessions are full of branching — "try this approach, no wait, try that instead." The new cache reuses previous computation instead of reprocessing everything from scratch.

How to Enable MLX

If you have 32GB+ unified memory and run Qwen3.5 models, MLX activates automatically in Ollama 0.19. No configuration needed. Just update:

brew upgrade ollama

ollama serve

For Macs with 16GB or less, you still get the improved llama.cpp backend. Not as fast as MLX, but still faster than 0.18.

RAM Reality Check

This matters more than anything else for Mac users. As @pinkham warned: "Dont expect to run Chrome, VS Code, Slack and Zoom while doing so."

Here's the real math. macOS itself uses 4-6 GB of RAM. Every browser tab, editor, and communication app adds more. The model needs contiguous memory.

Your MacAvailable for ModelRecommended ModelWhat You'll Close
8GB MacBook Air~3-4 GBqwen3.5:4bEverything except Terminal
16GB MacBook Air~8-10 GBqwen3.5:9bBrowser tabs, Slack
24GB Mac Mini/Pro~16-18 GBqwen3.5:27bMaybe Chrome
36GB+ Mac Pro~28+ GBqwen3.5:27b + appsNothing

The Mac Mini M4 Pro with 64GB is the sweet spot for running 27B models alongside your normal workflow. At $1,999-$2,499, it pays for itself in about 20 months compared to Claude Max at $100/month. For a complete breakdown of which models run best on each configuration, see our MacBook Air M4 16GB guide and MacBook Pro M4 Pro 24GB guide.

How Much Money Does This Actually Save?

The cost comparison is straightforward:

OptionMonthly CostAnnual CostWhat You Get
Claude Max$100/month$1,200/yearFull Opus 4.6, unlimited usage
Claude Pro$20/month$240/yearOpus with usage limits
Claude APIVariable$50-$500+/yearPay per token, unpredictable
Ollama local~$2/month electricity~$24/yearFree inference, local models

As @bygregorr pointed out: this "eliminates unpredictable API costs for solo devs." If you're burning through $50-$200/month in API credits for coding assistance, switching to local models for routine tasks and saving the API budget for complex ones is a legitimate strategy.

Pro tip: Use claude-local for routine refactoring, file edits, and code explanation. Switch to the real Claude API for complex multi-file agentic tasks. This hybrid approach can cut your monthly bill by 60-80%.

Advanced: LiteLLM Proxy for Model Routing

For power users, LiteLLM lets you map different Claude model names to different local models. This means Claude Code can automatically use a small model for simple queries and a large one for complex tasks.

pip install litellm

Create a litellm_config.yaml:

model_list:
  • model_name: claude-opus-4-6-20250915
litellm_params:

model: ollama/qwen3.5:27b

api_base: http://localhost:11434

  • model_name: claude-sonnet-4-6-20250514
litellm_params:

model: ollama/qwen3.5:9b

api_base: http://localhost:11434

  • model_name: claude-haiku-4-5-20251001
litellm_params:

model: ollama/qwen3.5:4b

api_base: http://localhost:11434

Start the proxy:

litellm --config litellm_config.yaml --port 4000

Then point Claude Code at LiteLLM instead of Ollama directly:

ANTHROPIC_BASE_URL=http://localhost:4000 \

ANTHROPIC_AUTH_TOKEN=local \

claude

This adds a layer of complexity but gives you model routing without changing your Claude Code workflow.

Alternatives Worth Knowing

Claude Code with Ollama isn't the only option. Other tools work with local models natively:

  • OpenCode — Open-source Claude Code alternative with native Ollama support. No environment variable hacks needed.
  • Aider — AI coding agent with direct Ollama integration and strong git awareness.
  • LM Studio — Added Anthropic-compatible /v1/messages endpoint in v0.4.1, works as a drop-in Ollama alternative for Claude Code.
  • Bifrost — Lightweight API proxy as an alternative to LiteLLM.

Each has trade-offs. Claude Code has the most polished CLI experience. Aider has better git integration. OpenCode is fully open-source. LM Studio offers a GUI for model management.

Security Considerations

@tasa2379 pointed out that most guides skip security hardening. Fair criticism. A few things to lock down:

1. Bind Ollama to localhost only — The default ollama serve already binds to 127.0.0.1:11434. Don't change this unless you know what you're doing.

2. Don't expose port 11434 to your network — No need for external access if Claude Code runs on the same machine.

3. Review model permissions — Claude Code can read and modify files in your project directory. Local doesn't mean safer if the model hallucinates a destructive command.

4. Keep Ollama updatedbrew upgrade ollama regularly for security patches.

FAQ

Can I run this on a Mac Mini?

Yes. The Mac Mini M4 is the best value local AI machine in 2026. The 24GB model runs qwen3.5:27b comfortably. The 16GB base runs qwen3.5:4b or qwen3.5:9b.

Is the local version as good as real Claude Code?

No. You get Claude Code's CLI, file management, and tooling — but the model intelligence depends on what you run locally. A 4B model won't match Opus 4.6 for complex multi-step reasoning. For single-file edits and routine coding, local models handle 80%+ of tasks well.

Which model should I use for coding?

Qwen 3.5 models are the current best for local coding. Start with qwen3.5:4b if you have limited RAM. Move to qwen3.5:27b if you have 24GB+. The 9B variant is the sweet spot for 16GB MacBook Air users. GLM-4.7 is a strong alternative if you need a 128K context window.

Does Ollama 0.19 MLX work on all Macs?

The MLX preview requires 32GB+ unified memory and currently supports Qwen3.5 models. Macs with less than 32GB use the standard llama.cpp backend, which also received performance improvements in 0.19. MLX support for more models is planned.

Does this work on Intel Macs?

Technically yes, but performance will be poor. Ollama on Intel Macs uses CPU-only inference, which is 5-10x slower than Apple Silicon's GPU acceleration. We don't recommend it for practical use.

Can I switch between local and cloud models?

Yes. Use the alias approach — set claude-local for Ollama and keep the default claude command pointed at Anthropic's API. Use local for routine tasks and cloud for complex agentic work. This hybrid approach gives you the best of both worlds while keeping API costs low.

What about Claude Code's new features like computer use?

Claude Code's Q1 2026 features — computer use, multi-agent collaboration, auto mode — work with the cloud API. With local models, you get the core CLI features: file operations, git integration, code generation, and tool calling. The advanced agentic features require the full Claude API.

Related Model Families: Related Guides:

Have questions? Reach out on X/Twitter