gpu optimized

Best Local AI Models for RTX 3090 (24GB)

The RTX 3090 is the community favorite for local AI. With 24GB VRAM at $800-1000 on the used market, it runs 32B parameter models that most cards cannot touch. At 87 tokens per second, it delivers flagship-class speed at a fraction of current-gen prices.

Specifications
VRAM
24 GB GDDR6X
Speed (8B Q4)
87 tok/s
Price
$900*
*Used market price
Architecture
Ampere
Bandwidth
936 GB/s
Max Model Size
Up to 32B parameter models
Compatibility
10 excellent, 0 workable

Compare Similar GPUs

GPUVRAMSpeedBandwidthPrice
RTX 309024 GB87 tok/s936 GB/s$900
RTX 508016 GB94 tok/s960 GB/s$999
RTX 4080 SUPER16 GB79 tok/s736 GB/s$1,597
RTX 409024 GB104 tok/s1008 GB/s$2,574

Recommended Models

10 models
01

Llama 3.1 8B Instruct

Llama / 8B / Q4_K_M / ~6.5 GB

Best for: Chat, Coding·Pop: 94/100

Perf: ~87.0 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.

ollama
ollama run llama3.1:8b-instruct-q4_K_M
02

Qwen3.5 9B Instruct

Qwen / 9B / Q4_K_M / ~7 GB

Best for: Quality, Coding, Reasoning·Pop: 86/100

Perf: ~78.7 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for quality, coding, reasoning on RTX 3090.

ollama
ollama run qwen3.5:9b-instruct-q4_K_M
03

Qwen3 8B

Qwen / 8B / Q4_K_M / ~6.5 GB

Best for: Chat, Coding·Pop: 88/100

Perf: ~87.0 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.

ollama
ollama run qwen3:8b-q4_K_M
04

LFM2 24B-A2B Instruct

LFM2 / 24B / Q4_K_M / ~14 GB

Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100

Perf: ~34.2 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 3090.

ollama
ollama run liquidai/lfm2:24b-a2b-instruct-q4_K_M
05

Llama 3.1 8B Instruct (Q5)

Llama / 8B / Q5_K_M / ~8 GB

Best for: Chat, Coding·Pop: 82/100

Perf: ~74.8 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.

ollama
ollama run llama3.1:8b-instruct-q5_K_M
06

Gemma 2 9B Instruct

Gemma / 9B / Q4_K_M / ~7 GB

Best for: Chat, Coding·Pop: 81/100

Perf: ~78.7 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 3090.

ollama
ollama run gemma2:9b-instruct-q4_K_M
07

Qwen3 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Quality·Pop: 84/100

Perf: ~54.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for coding, quality on RTX 3090.

ollama
ollama run qwen3:14b-q4_K_M
08

Qwen2.5 14B Instruct

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Chat·Pop: 80/100

Perf: ~54.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for coding, chat on RTX 3090.

ollama
ollama run qwen2.5:14b-instruct-q4_K_M
09

Qwen2.5 Coder 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding·Pop: 79/100

Perf: ~54.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for coding on RTX 3090.

ollama
ollama run qwen2.5-coder:14b-q4_K_M
10

Mistral Nemo 12B

Mistral / 12B / Q4_K_M / ~9.5 GB

Best for: Chat, Translation·Pop: 78/100

Perf: ~61.6 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, translation on RTX 3090.

ollama
ollama run mistral-nemo:12b-q4_K_M

Similar GPUs

Frequently Asked Questions

What AI models can I run on an RTX 3090?

With 24GB VRAM, the RTX 3090 can run up to 32b parameter models. Top recommendations include Llama 3.1 8B Instruct, Qwen3.5 9B Instruct, Qwen3 8B.

How fast is the RTX 3090 for local AI?

The RTX 3090 achieves 87 tokens per second with Qwen3 8B at Q4 quantization. Smaller models run faster, larger models slower.

Is 24GB VRAM enough for local AI?

24GB VRAM is excellent for local AI. You can comfortably run up to 32b parameter models with room for KV cache. 10 of our top 10 recommended models run at full speed.

How do I run AI models on RTX 3090 with Ollama?

Install Ollama from ollama.com, then run models directly. For example: ollama run llama3.1:8b-instruct-q4_K_M. Ollama automatically detects your NVIDIA GPU and uses CUDA acceleration.

Want Personalized Recommendations?

Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.

Open ModelFit Wizard →