gpu optimized

Best Local AI Models for RTX 5090 (32GB)

The RTX 5090 redefines what is possible with consumer GPUs. With 32GB GDDR7 and an unprecedented 145 tokens per second, it can load 70B parameter models with Q4 quantization. The ultimate card for running the largest local AI models.

Specifications

VRAM

32 GB GDDR7

Speed (8B Q4)

145 tok/s

Price

$2,499

Architecture

Blackwell

Bandwidth

1792 GB/s

Max Model Size

Up to 70B parameter models

Compatibility

10 excellent, 0 workable

RTX 5090 VRAM for AI: What Actually Fits?

32GB GDDR7 at 1,792 GB/s is unprecedented in a consumer card. This is enough to load 32B models at Q6 or even Q8 quantization with room to spare, delivering higher quality than Q4 on smaller cards. The 5090 also handles multi-model setups — run a 14B chat model alongside a 7B coding model simultaneously. For the largest 70B models at Q3_K_M (~30GB), it fits with about 1GB headroom. For AI enthusiasts who want to run the biggest models locally, this is the endgame card.

RTX 5090 vs Similar GPUs

GPU	VRAM	Speed	Bandwidth	Price
RTX 3090	24 GB	87 tok/s	936 GB/s	$900
RTX 5080	16 GB	94 tok/s	960 GB/s	$999
RTX 5090	32 GB	145 tok/s	1792 GB/s	$2,499
RTX 4090	24 GB	104 tok/s	1008 GB/s	$2,574

Recommended Models

10 models

Qwen3.5 35B-A3B Instruct

Qwen / 35B / Q4_K_M / ~20 GB

Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100

Perf: ~41.4 tok/s · first token ~1.0s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for reasoning, coding, agent scenarios on RTX 5090.

ollama

ollama run qwen3.5:35b-a3b-instruct-q4_K_M

Qwen3.5 27B Instruct

Qwen / 27B / Q4_K_M / ~16 GB

Best for: Chat, Coding, Complex reasoning·Pop: 82/100

Perf: ~51.6 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for chat, coding, complex reasoning on RTX 5090.

ollama

ollama run qwen3.5:27b-instruct-q4_K_M

LFM2 24B-A2B Instruct

LFM2 / 24B / Q4_K_M / ~14 GB

Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100

Perf: ~57.0 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 5090.

ollama

ollama run liquidai/lfm2:24b-a2b-instruct-q4_K_M

Qwen3 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Quality·Pop: 84/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding, quality on RTX 5090.

ollama

ollama run qwen3:14b-q4_K_M

Qwen2.5 14B Instruct

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Chat·Pop: 80/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding, chat on RTX 5090.

ollama

ollama run qwen2.5:14b-instruct-q4_K_M

Qwen2.5 Coder 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding·Pop: 79/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding on RTX 5090.

ollama

ollama run qwen2.5-coder:14b-q4_K_M

Mistral Nemo 12B

Mistral / 12B / Q4_K_M / ~9.5 GB

Best for: Chat, Translation·Pop: 78/100

Perf: ~102.7 tok/s · first token ~0.3s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for chat, translation on RTX 5090.

ollama

ollama run mistral-nemo:12b-q4_K_M

Gemma 3 12B Instruct

Gemma / 12B / Q4_K_M / ~9.5 GB

Best for: Chat, Quality·Pop: 76/100

Perf: ~102.7 tok/s · first token ~0.3s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for chat, quality on RTX 5090.

ollama

ollama run gemma3:12b-instruct-q4_K_M

DeepSeek-R1 Distill Qwen 14B

DeepSeek / 14B / Q4_K_M / ~11 GB

Best for: Reasoning, Quality·Pop: 74/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for reasoning, quality on RTX 5090.

ollama

ollama run deepseek-r1-distill:qwen-14b-q4_K_M

Phi-3 Medium 14B

Phi / 14B / Q4_K_M / ~11 GB

Best for: Coding, Quality·Pop: 69/100

Perf: ~90.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 32 GB VRAM with room to spare. Best for coding, quality on RTX 5090.

ollama

ollama run phi3:medium-q4_K_M

Similar GPUs for Local AI

RTX 4090 (24GB · 104 tok/s)RTX 5080 (16GB · 94 tok/s)RTX 3090 (24GB · 87 tok/s)

Compatible Model Families

Qwen

Alibaba Cloud — Widest size range (0.5B to 235B)

Llama

Meta — Most popular open-weight model family

DeepSeek

DeepSeek AI — Best-in-class reasoning with R1 models

Mistral

Mistral AI — Excellent performance-per-parameter ratio

Gemma

Google DeepMind — Excellent quality at small sizes (1B-9B)

Phi

Microsoft — Best quality-per-parameter in small sizes

RTX 5090 FAQ: Common Questions

How much VRAM does the RTX 5090 have for LLMs?

The RTX 5090 has 32GB GDDR7 VRAM with 1,792 GB/s bandwidth — the most of any consumer GPU. About 31GB is usable for models. This fits 70B models at Q3 quantization and 32B models at Q8 for maximum quality.

What size LLM can I run on an RTX 5090?

Up to 70B parameter models at Q3 quantization, or 32B models at Q8 for best quality. The 5090 is the only consumer GPU that can run Llama 3.1 70B locally. For 7B-14B models, it delivers an incredible 145 tok/s.

Is the RTX 5090 good for local AI in 2026?

The RTX 5090 is the absolute best consumer GPU for local AI. At 145 tok/s, responses feel instant. Its 32GB VRAM unlocks 70B models that no other consumer card can handle. The $2,499 price is justified if you need large models.

RTX 5090 vs RTX 4090 for running AI models?

The RTX 5090 is 39% faster (145 vs 104 tok/s) with 33% more VRAM (32 vs 24GB). It runs 70B models that the 4090 cannot. At similar pricing (~$2,500), the 5090 is the clear choice for new purchases in 2026.

Can the RTX 5090 replace cloud AI services?

For many use cases, yes. A 32B model at Q6 on the 5090 rivals GPT-4 quality for coding and reasoning tasks, running at 145 tok/s with zero latency and complete privacy. The main gap is context length — cloud models support 128K+ tokens while local 70B models are limited to 8K-16K.

Related Guides & Benchmarks

Local LLMs vs Cloud Flagships

Can a $2,499 GPU truly replace GPT-4 and Claude?

DeepSeek-V3 vs Qwen 3.5

Run the largest variants on 32GB VRAM.

Claude Code on Local LLMs

The 5090 makes local Claude Code indistinguishable from cloud.

Browse All NVIDIA GPUs for AI

RTX 3060 RTX 4060 Ti RTX 5060 Ti RTX 4070 RTX 4070 SUPER RTX 5070 RTX 5070 Ti RTX 4070 Ti SUPER RTX 4080 SUPER RTX 5080 RTX 3090 RTX 4090

Want Personalized Recommendations?

Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.

Open ModelFit Wizard →View Benchmark Tool