2026-02-25

How to Run Claude Code on Local LLMs: The Complete Setup

Want to use Claude Code without Anthropic's API? Here's how @sudoingX hacked it to run on local Qwen models with impressive results.

The Hardware

ComponentSpecs
GPUs2x NVIDIA RTX 3090 (48GB VRAM total)
CPUAMD EPYC 7K62 (96 cores, 192 threads)
RAM220GB DDR4
PCIeGen 4
Cost~$1,400
Why 3090s? They're the sweet spot for VRAM per dollar. Two cards give you 48GB — enough for large MoE models.

The Model: Qwen3-Coder-Next

Specifications:
  • Total params: 80B
  • Active params: 3B per token
  • Architecture: Mixture of Experts (512 experts, 10 active)
  • Context: 256K tokens
  • Quantization: Q4_K_M (45.19 GiB)

The magic is in the sparse activation. You're not computing 80B parameters every token — only 3B. This shifts the bottleneck from compute to memory bandwidth.

The Stack

Claude Code → LiteLLM → llama-server → 2x RTX 3090

(Anthropic API) (Translation) (Inference)

1. llama-server

Exposes an OpenAI-compatible API locally. This is your inference engine.

2. LiteLLM

The translator. Converts Anthropic's Messages API format to OpenAI's Chat Completions format.

3. The Environment Hack

ANTHROPIC_BASE_URL=http://localhost:4000 \

ANTHROPIC_AUTH_TOKEN=local \

claude

Critical: Use ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY) to bypass Anthropic's server validation.

What It Built

Test 1: Particle Simulation

  • Single prompt: "Build an interactive particle simulation"
  • Result: 564 lines, physics engine, mouse gravity, collision detection
  • Iterations: Added trails, explosions, gravity wells, bloom effects
  • Status: Worked on first run

Test 2: Benchmark CLI Tool

  • Spec: Full benchmark harness for LLM endpoints
  • Result: 13 files, modular architecture, Rich TUI
  • Features: Speed benchmarks, coding tests, GPU monitoring
  • Meta moment: The model benchmarked itself with the tool it built
  • Score: 7/7 tests passed

The Speed Journey

ConfigSpeedNotes
Single 30901.3 tok/sCPU offloading, unusable
2x 3090 (ngl 21)21.7 tok/sBaseline
2x 3090 (ngl 48)46.2 tok/sFull GPU layers
2x 3090 (-ot flag)79 tok/sExpert offloading optimization
The -ot breakthrough: Selectively moves MoE expert FFN weights to CPU while keeping attention layers on GPU. Since only 10/512 experts activate per token, the CPU handles minimal compute.
# The winning config

-ot "blk.[2][0-3].ffn_.exps=CPU"

The Gotchas

Context Window

Claude Code's system prompt is ~17,500 tokens. You need minimum 32K context. Run 128K for deep file reading.

Model Name Mapping

Claude Code sends requests to multiple model names:

  • claude-sonnet-4-6
  • claude-haiku-4-5-20251001
  • Others depending on task
Map all of them in LiteLLM's config or requests will 404.

Node.js Version

Claude Code needs Node.js v20+. The default v12 on many Linux containers causes silent SyntaxErrors.

The Auth Token Trick

  • ANTHROPIC_API_KEY → Validates against real servers (fails)
  • ANTHROPIC_AUTH_TOKEN → Bypasses validation (works)

vLLM vs llama.cpp: Why Simpler Won

EngineResultWhy
vLLM❌ OOMMarlin MoE repack needs 256MB temp buffer
SGLang❌ OOMSame Marlin kernel, worse memory usage
llama.cpp✅ WorksLoads GGUF directly, no conversion

At 96% VRAM utilization, that 256MB buffer doesn't exist. llama.cpp's simplicity wins.

Prefill vs Generation: The Hidden Bottleneck

Two nodes, different stories:
NodeGenerationPrefillPCIe Gen
Xeon E546.2 tok/s146.6 tok/sGen 3
EPYC 7K6232.75 tok/s1,072 tok/sGen 4
Prefill (prompt ingestion) is bandwidth-bound. Generation is compute-bound. With 17K+ token system prompts hitting every turn, prefill speed dominates wall time.

The Verdict

Is the local inference stack ready for real agent work? Yes.
  • Models are good enough (Qwen3-Coder-Next rivals Claude Sonnet on coding)
  • Hardware is cheap enough (2x 3090s for ~$1,400)
  • Tooling works if you know where to push
  • The optimization surface is barely explored
The local stack isn't coming. It's here. Related: See our DeepSeek-V3 vs Qwen 3.5 comparison for Mac-specific benchmarks, or check the best models for MacBook Pro. New to Ollama? Start with our setup guide.

Frequently Asked Questions

Can I run Claude Code on a Mac with Apple Silicon?

Yes, but the approach differs from the NVIDIA setup. On a Mac Studio with 128GB RAM, you can run Qwen3-Coder-Next via Ollama with llama.cpp as the backend. Performance will be lower than dual RTX 3090s but usable for smaller models. See our MacBook Pro recommendations for optimal models.

What is the minimum hardware for running Claude Code locally?

You need at least 48GB of VRAM (or unified memory on Mac) for the Qwen3-Coder-Next model. Two RTX 3090s ($1,400 total) provide 48GB VRAM. On Mac, a Mac Studio M2 Ultra with 64GB or more can run smaller coding models that work with Claude Code.

Why use LiteLLM instead of calling llama-server directly?

Claude Code sends requests in Anthropic's Messages API format. LiteLLM translates these to OpenAI's Chat Completions format that llama-server understands. Without this translation layer, Claude Code can't communicate with local models.

What speed should I expect for local Claude Code?

With optimized expert offloading (-ot flag), dual RTX 3090s achieve 79 tokens per second. On Apple Silicon, expect 15-40 tok/s depending on model size and RAM. For practical coding work, 20+ tok/s is the minimum comfortable speed.

Is the quality comparable to Claude Sonnet?

Qwen3-Coder-Next scored 7/7 on coding tests and generated production-quality code including a 564-line particle simulation and a modular benchmark CLI tool. Quality approaches Claude Sonnet for coding tasks, though cloud models still lead on complex multi-step reasoning.

---

Based on the excellent thread by @sudoingX. Follow him for more local LLM experiments.* Try modelfit.io to find the best local models for your exact hardware setup.

Have questions? Reach out on X/Twitter