2026-02-25
How to Run Claude Code on Local LLMs: The Complete Setup
The Hardware
| Component | Specs |
|---|---|
| GPUs | 2x NVIDIA RTX 3090 (48GB VRAM total) |
| CPU | AMD EPYC 7K62 (96 cores, 192 threads) |
| RAM | 220GB DDR4 |
| PCIe | Gen 4 |
| Cost | ~$1,400 |
The Model: Qwen3-Coder-Next
Specifications:- Total params: 80B
- Active params: 3B per token
- Architecture: Mixture of Experts (512 experts, 10 active)
- Context: 256K tokens
- Quantization: Q4_K_M (45.19 GiB)
The magic is in the sparse activation. You're not computing 80B parameters every token — only 3B. This shifts the bottleneck from compute to memory bandwidth.
The Stack
Claude Code → LiteLLM → llama-server → 2x RTX 3090
(Anthropic API) (Translation) (Inference)
1. llama-server
Exposes an OpenAI-compatible API locally. This is your inference engine.
2. LiteLLM
The translator. Converts Anthropic's Messages API format to OpenAI's Chat Completions format.
3. The Environment Hack
ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_AUTH_TOKEN=local \
claude
Critical: Use ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY) to bypass Anthropic's server validation.
What It Built
Test 1: Particle Simulation
- Single prompt: "Build an interactive particle simulation"
- Result: 564 lines, physics engine, mouse gravity, collision detection
- Iterations: Added trails, explosions, gravity wells, bloom effects
- Status: Worked on first run
Test 2: Benchmark CLI Tool
- Spec: Full benchmark harness for LLM endpoints
- Result: 13 files, modular architecture, Rich TUI
- Features: Speed benchmarks, coding tests, GPU monitoring
- Meta moment: The model benchmarked itself with the tool it built
- Score: 7/7 tests passed
The Speed Journey
| Config | Speed | Notes |
|---|---|---|
| Single 3090 | 1.3 tok/s | CPU offloading, unusable |
| 2x 3090 (ngl 21) | 21.7 tok/s | Baseline |
| 2x 3090 (ngl 48) | 46.2 tok/s | Full GPU layers |
| 2x 3090 (-ot flag) | 79 tok/s | Expert offloading optimization |
# The winning config
-ot "blk.[2][0-3].ffn_.exps=CPU"
The Gotchas
Context Window
Claude Code's system prompt is ~17,500 tokens. You need minimum 32K context. Run 128K for deep file reading.
Model Name Mapping
Claude Code sends requests to multiple model names:
claude-sonnet-4-6claude-haiku-4-5-20251001- Others depending on task
Node.js Version
Claude Code needs Node.js v20+. The default v12 on many Linux containers causes silent SyntaxErrors.
The Auth Token Trick
ANTHROPIC_API_KEY→ Validates against real servers (fails)ANTHROPIC_AUTH_TOKEN→ Bypasses validation (works)
vLLM vs llama.cpp: Why Simpler Won
| Engine | Result | Why |
|---|---|---|
| vLLM | ❌ OOM | Marlin MoE repack needs 256MB temp buffer |
| SGLang | ❌ OOM | Same Marlin kernel, worse memory usage |
| llama.cpp | ✅ Works | Loads GGUF directly, no conversion |
At 96% VRAM utilization, that 256MB buffer doesn't exist. llama.cpp's simplicity wins.
Prefill vs Generation: The Hidden Bottleneck
Two nodes, different stories:| Node | Generation | Prefill | PCIe Gen |
|---|---|---|---|
| Xeon E5 | 46.2 tok/s | 146.6 tok/s | Gen 3 |
| EPYC 7K62 | 32.75 tok/s | 1,072 tok/s | Gen 4 |
The Verdict
Is the local inference stack ready for real agent work? Yes.- Models are good enough (Qwen3-Coder-Next rivals Claude Sonnet on coding)
- Hardware is cheap enough (2x 3090s for ~$1,400)
- Tooling works if you know where to push
- The optimization surface is barely explored
Frequently Asked Questions
Can I run Claude Code on a Mac with Apple Silicon?
Yes, but the approach differs from the NVIDIA setup. On a Mac Studio with 128GB RAM, you can run Qwen3-Coder-Next via Ollama with llama.cpp as the backend. Performance will be lower than dual RTX 3090s but usable for smaller models. See our MacBook Pro recommendations for optimal models.
What is the minimum hardware for running Claude Code locally?
You need at least 48GB of VRAM (or unified memory on Mac) for the Qwen3-Coder-Next model. Two RTX 3090s ($1,400 total) provide 48GB VRAM. On Mac, a Mac Studio M2 Ultra with 64GB or more can run smaller coding models that work with Claude Code.
Why use LiteLLM instead of calling llama-server directly?
Claude Code sends requests in Anthropic's Messages API format. LiteLLM translates these to OpenAI's Chat Completions format that llama-server understands. Without this translation layer, Claude Code can't communicate with local models.
What speed should I expect for local Claude Code?
With optimized expert offloading (-ot flag), dual RTX 3090s achieve 79 tokens per second. On Apple Silicon, expect 15-40 tok/s depending on model size and RAM. For practical coding work, 20+ tok/s is the minimum comfortable speed.
Is the quality comparable to Claude Sonnet?
Qwen3-Coder-Next scored 7/7 on coding tests and generated production-quality code including a 564-line particle simulation and a modular benchmark CLI tool. Quality approaches Claude Sonnet for coding tasks, though cloud models still lead on complex multi-step reasoning.
---
Based on the excellent thread by @sudoingX. Follow him for more local LLM experiments.* Try modelfit.io to find the best local models for your exact hardware setup.Have questions? Reach out on X/Twitter