LLM Hardware Requirements Calculator
Enter your RAM or GPU VRAM and see exactly which local AI models you can run, the single best pick, and how fast it will go.
What LLM can I run? A local model at Q4 quantization needs roughly 0.6 GB of memory per billion parameters. ModelFit budgets about 70% of unified memory for the model on machines up to 32GB, scaling to ~85% at 128GB and above, and about 90% of a discrete GPU's VRAM. That means an 8GB device runs models up to ~8B, a 16GB device comfortably runs up to ~14B, 32GB unlocks ~35B-class models, and 64GB or more runs 70B-class models. Use the calculator below to size your exact hardware against 75 local models.
With 16 GB of unified memory, ModelFit budgets about 11 GB for the model and comfortably runs local LLMs up to ~12B parameters at Q4. The best single pick is Qwen3.5 9B Instruct.
Tokens/sec are ModelFit estimates from chip bandwidth and model size, not measured benchmarks. Ollama commands are registry-verified.
Local LLM memory requirements by RAM tier
Every row is derived from ModelFit's catalog of 75 local models across 20 families. Click a tier for the full list.
| Memory | Model budget (~70-85%) | Max model size | Models that fit | Top pick |
|---|---|---|---|---|
| 8 GB | ~5.6 GB | ~8.3B params | 23 / 75 | LFM2.5 8B-A1B |
| 16 GB | ~11.2 GB | ~14B params | 37 / 75 | Qwen3.5 9B Instruct (Q8) |
| 24 GB | ~16.8 GB | ~27B params | 45 / 75 | Gemma 4 12B (Q8) |
| 32 GB | ~22.4 GB | ~35B params | 54 / 75 | Qwen3.6 35B-A3B |
| 48 GB | ~34.8 GB | ~46.7B params | 59 / 75 | Qwen3.6 27B (Q8) |
| 64 GB | ~48 GB | ~70B params | 64 / 75 | Qwen3.6 35B-A3B (Q8) |
| 96 GB | ~76.8 GB | ~122B params | 70 / 75 | Qwen3.5 122B-A10B Instruct |
| 128 GB | ~108.8 GB | ~122B params | 71 / 75 | Qwen3.5 122B-A10B Instruct |
Q4_K_M assumed. Fit and tok/s are ModelFit estimates from the dataset, not measured benchmarks. Updated 2026-07-02.
Frequently asked questions
How much RAM do I need to run a local LLM?
At Q4 quantization a local LLM needs roughly 0.6 GB of memory per billion parameters, and ModelFit budgets ~70% of unified memory for the model up to 32GB, scaling to ~85% at 128GB and above. In practice 8GB runs models up to ~8B, 16GB comfortably runs up to ~14B, 32GB unlocks ~35B-class models, and 64GB or more runs 70B-class models.
What LLM can I run with my GPU VRAM?
VRAM is the hard ceiling for a discrete GPU, and about 90% of it is usable for model weights. 8GB fits a 7-8B model, 12GB fits 7-9B with more context, 16GB reaches 14B, 24GB runs 32B-class models, and 32GB comfortably runs 32B with long context. Switch the calculator to GPU VRAM mode and enter your card memory to see the exact picks.
How does the calculator work?
Enter your memory amount and pick Apple unified memory or GPU VRAM. The calculator runs ModelFit’s recommendation engine in your browser: it sizes each model at ~0.6 GB per billion parameters, applies the memory budget for your hardware, and ranks the models that fit by quality and speed. Tokens per second are ModelFit estimates from memory bandwidth and model size, not measured benchmarks.
Can I combine two GPUs to run bigger local models?
Yes. Ollama and llama.cpp split model layers across cards automatically, so two or three GPUs pool their VRAM for fit: about 90% of the combined VRAM is usable for weights. Expect real throughput below a single card with the same total VRAM, because inter-GPU transfers add overhead and mixed cards run at the slower card’s pace. Switch the calculator to Multi-GPU rig mode to pick your exact cards and see which models fit, and use the Copy link button to share the setup.
Is the ModelFit calculator free?
Yes. The calculator is completely free, needs no sign-up, and runs entirely in your browser with no data sent to a server. The underlying compatibility dataset is open under CC BY 4.0, and the same engine ships as the free npx @wecko-ai/modelfit command-line tool.
Go deeper
Pick your exact MacBook, iPhone or GPU for a tuned recommendation.
Every model against every hardware tier, open under CC BY 4.0.
The model-size-to-memory matrix explained, tier by tier.
Citable key facts on RAM, VRAM and model fit.
The same engine runs offline as a one-line command that detects your machine and names the best local model:
npx @wecko-ai/modelfit