gpu optimized

Best Local AI Models for RTX 5090 (32GB)

The RTX 5090 redefines what is possible with consumer GPUs. With 32GB GDDR7 and an unprecedented 145 tokens per second, it can load 70B parameter models with Q4 quantization. The ultimate card for running the largest local AI models.

Specifications

VRAM

32 GB GDDR7

Speed (8B Q4)

145 tok/s

Price

$2,499

Architecture

Blackwell

Bandwidth

1792 GB/s

Max Model Size

Up to 70B parameter models

Compatibility

10 excellent, 0 workable

Compare Similar GPUs

GPU	VRAM	Speed	Bandwidth	Price
RTX 3090	24 GB	87 tok/s	936 GB/s	$900
RTX 5080	16 GB	94 tok/s	960 GB/s	$999
RTX 5090	32 GB	145 tok/s	1792 GB/s	$2,499
RTX 4090	24 GB	104 tok/s	1008 GB/s	$2,574

Recommended Models

10 models

Qwen3.5 35B-A3B Instruct

Qwen / 35B / Q4_K_M / ~20 GB

Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100

Perf: ~41.4 tok/s · first token ~1.0s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for reasoning, coding, agent scenarios on RTX 5090.

ollama

ollama run qwen3.5:35b-a3b-instruct-q4_K_M

Qwen3.5 27B Instruct

Qwen / 27B / Q4_K_M / ~16 GB

Best for: Chat, Coding, Complex reasoning·Pop: 82/100

Perf: ~51.6 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for chat, coding, complex reasoning on RTX 5090.

ollama

ollama run qwen3.5:27b-instruct-q4_K_M

LFM2 24B-A2B Instruct

LFM2 / 24B / Q4_K_M / ~14 GB

Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100

Perf: ~57.0 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 5090.

ollama

ollama run liquidai/lfm2:24b-a2b-instruct-q4_K_M

Qwen3 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Quality·Pop: 84/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding, quality on RTX 5090.

ollama

ollama run qwen3:14b-q4_K_M

Qwen2.5 14B Instruct

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Chat·Pop: 80/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding, chat on RTX 5090.

ollama

ollama run qwen2.5:14b-instruct-q4_K_M

Qwen2.5 Coder 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding·Pop: 79/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding on RTX 5090.

ollama

ollama run qwen2.5-coder:14b-q4_K_M

Mistral Nemo 12B

Mistral / 12B / Q4_K_M / ~9.5 GB

Best for: Chat, Translation·Pop: 78/100

Perf: ~102.7 tok/s · first token ~0.3s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for chat, translation on RTX 5090.

ollama

ollama run mistral-nemo:12b-q4_K_M

Gemma 3 12B Instruct

Gemma / 12B / Q4_K_M / ~9.5 GB

Best for: Chat, Quality·Pop: 76/100

Perf: ~102.7 tok/s · first token ~0.3s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for chat, quality on RTX 5090.

ollama

ollama run gemma3:12b-instruct-q4_K_M

DeepSeek-R1 Distill Qwen 14B

DeepSeek / 14B / Q4_K_M / ~11 GB

Best for: Reasoning, Quality·Pop: 74/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for reasoning, quality on RTX 5090.

ollama

ollama run deepseek-r1-distill:qwen-14b-q4_K_M

Phi-3 Medium 14B

Phi / 14B / Q4_K_M / ~11 GB

Best for: Coding, Quality·Pop: 69/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding, quality on RTX 5090.

ollama

ollama run phi3:medium-q4_K_M

Frequently Asked Questions

What AI models can I run on an RTX 5090?

With 32GB VRAM, the RTX 5090 can run up to 70b parameter models. Top recommendations include Qwen3.5 35B-A3B Instruct, Qwen3.5 27B Instruct, LFM2 24B-A2B Instruct.

How fast is the RTX 5090 for local AI?

The RTX 5090 achieves 145 tokens per second with Qwen3 8B at Q4 quantization. Smaller models run faster, larger models slower.

Is 32GB VRAM enough for local AI?

32GB VRAM is excellent for local AI. You can comfortably run up to 70b parameter models with room for KV cache. 10 of our top 10 recommended models run at full speed.

How do I run AI models on RTX 5090 with Ollama?

Install Ollama from ollama.com, then run models directly. For example: ollama run qwen3.5:35b-a3b-instruct-q4_K_M. Ollama automatically detects your NVIDIA GPU and uses CUDA acceleration.

Want Personalized Recommendations?

Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.

Open ModelFit Wizard →