Qwen3.5 35B-A3B Instruct
Qwen / 35B / Q4_K_M / ~20 GB
Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100
Perf: ~41.4 tok/s · first token ~1.0s
Fits in 32 GB VRAM with room to spare. Best for reasoning, coding, agent scenarios on RTX 5090.
The RTX 5090 redefines what is possible with consumer GPUs. With 32GB GDDR7 and an unprecedented 145 tokens per second, it can load 70B parameter models with Q4 quantization. The ultimate card for running the largest local AI models.
ModelFit may earn a commission on purchases made through these links, at no extra cost to you. Recommendations are based on local-AI performance, not commissions.
32GB GDDR7 at 1,792 GB/s is unprecedented in a consumer card. This is enough to load 32B models at Q6 or even Q8 quantization with room to spare, delivering higher quality than Q4 on smaller cards. The 5090 also handles multi-model setups — run a 14B chat model alongside a 7B coding model simultaneously. For the largest 70B models at Q3_K_M (~30GB), it fits with about 1GB headroom. For AI enthusiasts who want to run the biggest models locally, this is the endgame card.
Qwen / 35B / Q4_K_M / ~20 GB
Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100
Perf: ~41.4 tok/s · first token ~1.0s
Fits in 32 GB VRAM with room to spare. Best for reasoning, coding, agent scenarios on RTX 5090.
Qwen / 27B / Q4_K_M / ~16 GB
Best for: Chat, Coding, Complex reasoning·Pop: 82/100
Perf: ~51.6 tok/s · first token ~0.4s
Fits in 32 GB VRAM with room to spare. Best for chat, coding, complex reasoning on RTX 5090.
Qwen / 27B / Q4_K_M / ~18 GB
Best for: Coding, Quality, Long context·Pop: 92/100
Perf: ~51.6 tok/s · first token ~0.4s
Fits in 32 GB VRAM with room to spare. Best for coding, quality, long context on RTX 5090.
Gemma / 26B / Q4_K_M / ~16 GB
Best for: Chat, Coding, Multimodal·Pop: 86/100
Perf: ~53.2 tok/s · first token ~0.4s
Fits in 32 GB VRAM with room to spare. Best for chat, coding, multimodal on RTX 5090.
LFM2 / 24B / Q4_K_M / ~14 GB
Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100
Perf: ~57.0 tok/s · first token ~0.4s
Fits in 32 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 5090.
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding, Quality·Pop: 84/100
Perf: ~90.1 tok/s · first token ~0.4s
Fits in 32 GB VRAM with room to spare. Best for coding, quality on RTX 5090.
Mistral / 12B / Q4_K_M / ~9.5 GB
Best for: Chat, Translation·Pop: 78/100
Perf: ~102.7 tok/s · first token ~0.3s
Fits in 32 GB VRAM with room to spare. Best for chat, translation on RTX 5090.
Gemma / 12B / Q4_K_M / ~9.5 GB
Best for: Chat, Quality·Pop: 76/100
Perf: ~102.7 tok/s · first token ~0.3s
Fits in 32 GB VRAM with room to spare. Best for chat, quality on RTX 5090.
Gemma / 31B / Q4_K_M / ~20 GB
Best for: Quality, Coding, Multimodal·Pop: 84/100
Perf: ~45.8 tok/s · first token ~1.0s
Fits in 32 GB VRAM with room to spare. Best for quality, coding, multimodal on RTX 5090.
Qwen / 14B / Q4_K_M / ~11 GB
Best for: Coding·Pop: 68/100
Perf: ~90.1 tok/s · first token ~0.4s
Fits in 32 GB VRAM with room to spare. Best for coding on RTX 5090.
Alibaba Cloud — Widest size range (0.5B to 235B)
LlamaMeta — Most popular open-weight model family
DeepSeekDeepSeek AI — Best-in-class reasoning with R1 models
MistralMistral AI — Excellent performance-per-parameter ratio
GemmaGoogle DeepMind — Excellent quality at small sizes (1B-9B)
PhiMicrosoft — Best quality-per-gigabyte at small sizes
The RTX 5090 has 32GB GDDR7 VRAM with 1,792 GB/s bandwidth — the most of any consumer GPU. About 31GB is usable for models. This fits 70B models at Q3 quantization and 32B models at Q8 for maximum quality.
Up to 70B parameter models at Q3 quantization, or 32B models at Q8 for best quality. The 5090 is the only consumer GPU that can run Llama 3.1 70B locally. For 7B-14B models, it delivers an incredible 145 tok/s.
The RTX 5090 is the absolute best consumer GPU for local AI. At 145 tok/s, responses feel instant. Its 32GB VRAM unlocks 70B models that no other consumer card can handle. The $2,499 price is justified if you need large models.
The RTX 5090 is 39% faster (145 vs 104 tok/s) with 33% more VRAM (32 vs 24GB). It runs 70B models that the 4090 cannot. At similar pricing (~$2,500), the 5090 is the clear choice for new purchases in 2026.
For many use cases, yes. A 32B model at Q6 on the 5090 rivals GPT-4 quality for coding and reasoning tasks, running at 145 tok/s with zero latency and complete privacy. The main gap is context length — cloud models support 128K+ tokens while local 70B models are limited to 8K-16K.
Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.