Llama 3.1 8B Instruct
Llama / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 94/100
Perf: ~87.0 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run llama3.1:8b-instruct-q4_K_M
The RTX 3090 is the community favorite for local AI. With 24GB VRAM at $800-1000 on the used market, it runs 32B parameter models that most cards cannot touch. At 87 tokens per second, it delivers flagship-class speed at a fraction of current-gen prices.
24GB GDDR6X at 936 GB/s unlocks a tier of models that 16GB cards cannot reach. DeepSeek-R1 32B, Qwen 2.5 32B, and Command-R 35B all fit comfortably at Q4 quantization. You get about 23GB usable, so 32B Q4 models (~20GB) load fully in VRAM with 3GB left for context. The 3090 is the cheapest way to run 32B models without CPU offloading, making it the darling of the r/LocalLLaMA community.
| GPU | VRAM | Speed | Bandwidth | Price |
|---|---|---|---|---|
| RTX 3090 | 24 GB | 87 tok/s | 936 GB/s | $900 |
| RTX 5080 | 16 GB | 94 tok/s | 960 GB/s | $999 |
| RTX 4080 SUPER | 16 GB | 79 tok/s | 736 GB/s | $1,597 |
| RTX 4090 | 24 GB | 104 tok/s | 1008 GB/s | $2,574 |
Llama / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 94/100
Perf: ~87.0 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run llama3.1:8b-instruct-q4_K_M
Qwen / 9B / Q4_K_M / ~7 GB
Best for: Quality, Coding, Reasoning·Pop: 86/100
Perf: ~78.7 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for quality, coding, reasoning on RTX 3090.
ollama run qwen3.5:9b-instruct-q4_K_M
Qwen / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 88/100
Perf: ~87.0 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run qwen3:8b-q4_K_M
LFM2 / 24B / Q4_K_M / ~14 GB
Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100
Perf: ~34.2 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 3090.
ollama run liquidai/lfm2:24b-a2b-instruct-q4_K_M
Llama / 8B / Q5_K_M / ~8 GB
Best for: Chat, Coding·Pop: 82/100
Perf: ~74.8 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run llama3.1:8b-instruct-q5_K_M
Gemma / 9B / Q4_K_M / ~7 GB
Best for: Chat, Coding·Pop: 81/100
Perf: ~78.7 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run gemma2:9b-instruct-q4_K_M
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding, Quality·Pop: 84/100
Perf: ~54.1 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for coding, quality on RTX 3090.
ollama run qwen3:14b-q4_K_M
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding, Chat·Pop: 80/100
Perf: ~54.1 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for coding, chat on RTX 3090.
ollama run qwen2.5:14b-instruct-q4_K_M
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding·Pop: 79/100
Perf: ~54.1 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for coding on RTX 3090.
ollama run qwen2.5-coder:14b-q4_K_M
Mistral / 12B / Q4_K_M / ~9.5 GB
Best for: Chat, Translation·Pop: 78/100
Perf: ~61.6 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, translation on RTX 3090.
ollama run mistral-nemo:12b-q4_K_M
Alibaba Cloud — Widest size range (0.5B to 235B)
LlamaMeta — Most popular open-weight model family
DeepSeekDeepSeek AI — Best-in-class reasoning with R1 models
MistralMistral AI — Excellent performance-per-parameter ratio
GemmaGoogle DeepMind — Excellent quality at small sizes (1B-9B)
PhiMicrosoft — Best quality-per-parameter in small sizes
The RTX 3090 has 24GB GDDR6X VRAM with 936 GB/s bandwidth. About 23GB is usable for models. This is the cheapest GPU that can run 32B parameter models entirely in VRAM at Q4 quantization.
Up to 32B parameter models at Q4 quantization. Top picks: DeepSeek-R1 32B, Qwen 2.5 32B, and Command-R 35B. For 70B models, you would need Q2 quantization or dual GPUs.
The RTX 3090 is the best value GPU for large model inference in 2026. At $800-1000 used, its 24GB VRAM handles 32B models that $999 16GB cards cannot. The r/LocalLLaMA community consistently ranks it as the top recommendation.
The RTX 4090 is 20% faster (104 vs 87 tok/s) with the same 24GB VRAM. But it costs 2.5x more ($2,574 vs ~$900 used). The 3090 offers much better value per dollar for AI workloads.
Check eBay, r/hardwareswap, and local marketplaces. Prices range from $800-1000. Look for cards that were not used for cryptocurrency mining. The Founders Edition and EVGA models have good cooling for sustained AI workloads.
Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.