Llama 3.1 8B Instruct
Llama / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 94/100
Perf: ~87.0 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run llama3.1:8b-instruct-q4_K_M
The RTX 3090 is the community favorite for local AI. With 24GB VRAM at $800-1000 on the used market, it runs 32B parameter models that most cards cannot touch. At 87 tokens per second, it delivers flagship-class speed at a fraction of current-gen prices.
| GPU | VRAM | Speed | Bandwidth | Price |
|---|---|---|---|---|
| RTX 3090 | 24 GB | 87 tok/s | 936 GB/s | $900 |
| RTX 5080 | 16 GB | 94 tok/s | 960 GB/s | $999 |
| RTX 4080 SUPER | 16 GB | 79 tok/s | 736 GB/s | $1,597 |
| RTX 4090 | 24 GB | 104 tok/s | 1008 GB/s | $2,574 |
Llama / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 94/100
Perf: ~87.0 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run llama3.1:8b-instruct-q4_K_M
Qwen / 9B / Q4_K_M / ~7 GB
Best for: Quality, Coding, Reasoning·Pop: 86/100
Perf: ~78.7 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for quality, coding, reasoning on RTX 3090.
ollama run qwen3.5:9b-instruct-q4_K_M
Qwen / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 88/100
Perf: ~87.0 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run qwen3:8b-q4_K_M
LFM2 / 24B / Q4_K_M / ~14 GB
Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100
Perf: ~34.2 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 3090.
ollama run liquidai/lfm2:24b-a2b-instruct-q4_K_M
Llama / 8B / Q5_K_M / ~8 GB
Best for: Chat, Coding·Pop: 82/100
Perf: ~74.8 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run llama3.1:8b-instruct-q5_K_M
Gemma / 9B / Q4_K_M / ~7 GB
Best for: Chat, Coding·Pop: 81/100
Perf: ~78.7 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.
ollama run gemma2:9b-instruct-q4_K_M
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding, Quality·Pop: 84/100
Perf: ~54.1 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for coding, quality on RTX 3090.
ollama run qwen3:14b-q4_K_M
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding, Chat·Pop: 80/100
Perf: ~54.1 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for coding, chat on RTX 3090.
ollama run qwen2.5:14b-instruct-q4_K_M
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding·Pop: 79/100
Perf: ~54.1 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for coding on RTX 3090.
ollama run qwen2.5-coder:14b-q4_K_M
Mistral / 12B / Q4_K_M / ~9.5 GB
Best for: Chat, Translation·Pop: 78/100
Perf: ~61.6 tok/s · first token ~0.4s
Fits in 24 GB VRAM with room to spare. Best for chat, translation on RTX 3090.
ollama run mistral-nemo:12b-q4_K_M
With 24GB VRAM, the RTX 3090 can run up to 32b parameter models. Top recommendations include Llama 3.1 8B Instruct, Qwen3.5 9B Instruct, Qwen3 8B.
The RTX 3090 achieves 87 tokens per second with Qwen3 8B at Q4 quantization. Smaller models run faster, larger models slower.
24GB VRAM is excellent for local AI. You can comfortably run up to 32b parameter models with room for KV cache. 10 of our top 10 recommended models run at full speed.
Install Ollama from ollama.com, then run models directly. For example: ollama run llama3.1:8b-instruct-q4_K_M. Ollama automatically detects your NVIDIA GPU and uses CUDA acceleration.
Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.
Open ModelFit Wizard →