How to Run Claude Code on Local LLMs: The Complete Setup

Want to use Claude Code without Anthropic's API? Here's how @sudoingX hacked it to run on local Qwen models with impressive results.

The Hardware

Component	Specs
GPUs	2x NVIDIA RTX 3090 (48GB VRAM total)
CPU	AMD EPYC 7K62 (96 cores, 192 threads)
RAM	220GB DDR4
PCIe	Gen 4
Cost	~$1,400

Why 3090s? They're the sweet spot for VRAM per dollar. Two cards give you 48GB — enough for large MoE models.

The Model: Qwen3-Coder-Next

Specifications:

Total params: 80B
Active params: 3B per token
Architecture: Mixture of Experts (512 experts, 10 active)
Context: 256K tokens
Quantization: Q4_K_M (45.19 GiB)

The magic is in the sparse activation. You're not computing 80B parameters every token — only 3B. This shifts the bottleneck from compute to memory bandwidth.

The Stack

Claude Code → LiteLLM → llama-server → 2x RTX 3090
(Anthropic API)   (Translation)    (Inference)

1. llama-server

Exposes an OpenAI-compatible API locally. This is your inference engine.

2. LiteLLM

The translator. Converts Anthropic's Messages API format to OpenAI's Chat Completions format.

3. The Environment Hack

ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_AUTH_TOKEN=local \claude

Critical: Use ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY) to bypass Anthropic's server validation.

What It Built

Test 1: Particle Simulation

Single prompt: "Build an interactive particle simulation"
Result: 564 lines, physics engine, mouse gravity, collision detection
Iterations: Added trails, explosions, gravity wells, bloom effects
Status: Worked on first run

Test 2: Benchmark CLI Tool

Spec: Full benchmark harness for LLM endpoints
Result: 13 files, modular architecture, Rich TUI
Features: Speed benchmarks, coding tests, GPU monitoring
Meta moment: The model benchmarked itself with the tool it built
Score: 7/7 tests passed

The Speed Journey

Config	Speed	Notes
Single 3090	1.3 tok/s	CPU offloading, unusable
2x 3090 (ngl 21)	21.7 tok/s	Baseline
2x 3090 (ngl 48)	46.2 tok/s	Full GPU layers
2x 3090 (-ot flag)	79 tok/s	Expert offloading optimization

The -ot breakthrough: Selectively moves MoE expert FFN weights to CPU while keeping attention layers on GPU. Since only 10/512 experts activate per token, the CPU handles minimal compute.

# The winning config
-ot "blk.[2][0-3].ffn_.exps=CPU"

Engine	Result	Why
vLLM	❌ OOM	Marlin MoE repack needs 256MB temp buffer
SGLang	❌ OOM	Same Marlin kernel, worse memory usage
llama.cpp	✅ Works	Loads GGUF directly, no conversion

Node	Generation	Prefill	PCIe Gen
Xeon E5	46.2 tok/s	146.6 tok/s	Gen 3
EPYC 7K62	32.75 tok/s	1,072 tok/s	Gen 4

The Gotchas

Context Window

Claude Code's system prompt is ~17,500 tokens. You need minimum 32K context. Run 128K for deep file reading.

Model Name Mapping

Claude Code sends requests to multiple model names:

claude-sonnet-4-6
claude-haiku-4-5-20251001
Others depending on task
Map all of them in LiteLLM's config or requests will 404.
Node.js Version

Claude Code needs Node.js v20+. The default v12 on many Linux containers causes silent SyntaxErrors.

The Auth Token Trick

ANTHROPIC_API_KEY → Validates against real servers (fails)
ANTHROPIC_AUTH_TOKEN → Bypasses validation (works)

vLLM vs llama.cpp: Why Simpler Won

Engine Result Why
vLLM ❌ OOM Marlin MoE repack needs 256MB temp buffer
SGLang ❌ OOM Same Marlin kernel, worse memory usage
llama.cpp ✅ Works Loads GGUF directly, no conversion

At 96% VRAM utilization, that 256MB buffer doesn't exist. llama.cpp's simplicity wins.

Prefill vs Generation: The Hidden Bottleneck
Two nodes, different stories:
Node Generation Prefill PCIe Gen
Xeon E5 46.2 tok/s 146.6 tok/s Gen 3
EPYC 7K62 32.75 tok/s 1,072 tok/s Gen 4
Prefill (prompt ingestion) is bandwidth-bound. Generation is compute-bound. With 17K+ token system prompts hitting every turn, prefill speed dominates wall time.
The Verdict
Is the local inference stack ready for real agent work? Yes.
Models are good enough (Qwen3-Coder-Next rivals Claude Sonnet on coding)
Hardware is cheap enough (2x 3090s for ~$1,400)
Tooling works if you know where to push
The optimization surface is barely explored
The local stack isn't coming. It's here. Related: See our DeepSeek-V3 vs Qwen 3.5 comparison for Mac-specific benchmarks, or check the best models for MacBook Pro. New to Ollama? Start with our setup guide.
Frequently Asked Questions

Can I run Claude Code on a Mac with Apple Silicon?

Yes, but the approach differs from the NVIDIA setup. On a Mac Studio with 128GB RAM, you can run Qwen3-Coder-Next via Ollama with llama.cpp as the backend. Performance will be lower than dual RTX 3090s but usable for smaller models. See our MacBook Pro recommendations for optimal models.

What is the minimum hardware for running Claude Code locally?

You need at least 48GB of VRAM (or unified memory on Mac) for the Qwen3-Coder-Next model. Two RTX 3090s ($1,400 total) provide 48GB VRAM. On Mac, a Mac Studio M2 Ultra with 64GB or more can run smaller coding models that work with Claude Code.

Why use LiteLLM instead of calling llama-server directly?

Claude Code sends requests in Anthropic's Messages API format. LiteLLM translates these to OpenAI's Chat Completions format that llama-server understands. Without this translation layer, Claude Code can't communicate with local models.

What speed should I expect for local Claude Code?

With optimized expert offloading (-ot flag), dual RTX 3090s achieve 79 tokens per second. On Apple Silicon, expect 15-40 tok/s depending on model size and RAM. For practical coding work, 20+ tok/s is the minimum comfortable speed.

Is the quality comparable to Claude Sonnet?

Qwen3-Coder-Next scored 7/7 on coding tests and generated production-quality code including a 564-line particle simulation and a modular benchmark CLI tool. Quality approaches Claude Sonnet for coding tasks, though cloud models still lead on complex multi-step reasoning.

---
Based on the excellent thread by @sudoingX. Follow him for more local LLM experiments.* Try modelfit.io to find the best local models for your exact hardware setup.