gpu optimized

Best Local AI Models for RTX 4090 (24GB)

The RTX 4090 is the current king of local AI inference. With 24GB GDDR6X and 104 tokens per second, it handles everything from small chat models to 32B parameter reasoning models with ease. The gold standard for serious AI enthusiasts.

Specifications

VRAM

24 GB GDDR6X

Speed (8B Q4)

104 tok/s

Price

$2,574

Architecture

Ada Lovelace

Bandwidth

1008 GB/s

Max Model Size

Up to 32B parameter models

Compatibility

10 excellent, 0 workable

RTX 4090 VRAM for AI: What Actually Fits?

24GB GDDR6X at 1,008 GB/s gives the RTX 4090 enormous headroom. 32B models at Q4 (~20GB) load fully with 3GB left for KV cache. 14B models at Q5 or Q6 fit easily for higher quality inference. At 104 tok/s with 8B models, the 4090 delivers near-instant responses. The only consumer card faster is the RTX 5090 (145 tok/s, 32GB). For 24GB workloads, the 4090 remains unmatched in speed.

RTX 4090 vs Similar GPUs

GPU	VRAM	Speed	Bandwidth	Price
RTX 3090	24 GB	87 tok/s	936 GB/s	$900
RTX 5080	16 GB	94 tok/s	960 GB/s	$999
RTX 5090	32 GB	145 tok/s	1792 GB/s	$2,499
RTX 4090	24 GB	104 tok/s	1008 GB/s	$2,574

Recommended Models

10 models

Llama 3.1 8B Instruct

Llama / 8B / Q4_K_M / ~6.5 GB

Best for: Chat, Coding·Pop: 94/100

Perf: ~104.0 tok/s · first token ~0.3s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 4090.

ollama

ollama run llama3.1:8b-instruct-q4_K_M

Qwen3.5 9B Instruct

Qwen / 9B / Q4_K_M / ~7 GB

Best for: Quality, Coding, Reasoning·Pop: 86/100

Perf: ~94.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for quality, coding, reasoning on RTX 4090.

ollama

ollama run qwen3.5:9b-instruct-q4_K_M

Qwen3 8B

Qwen / 8B / Q4_K_M / ~6.5 GB

Best for: Chat, Coding·Pop: 88/100

Perf: ~104.0 tok/s · first token ~0.3s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 4090.

ollama

ollama run qwen3:8b-q4_K_M

LFM2 24B-A2B Instruct

LFM2 / 24B / Q4_K_M / ~14 GB

Best for: Local AI agents, privacy-first tool calling, MCP workflows·Pop: 80/100

Perf: ~40.9 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for local ai agents, privacy-first tool calling, mcp workflows on RTX 4090.

ollama

ollama run liquidai/lfm2:24b-a2b-instruct-q4_K_M

Llama 3.1 8B Instruct (Q5)

Llama / 8B / Q5_K_M / ~8 GB

Best for: Chat, Coding·Pop: 82/100

Perf: ~89.4 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 4090.

ollama

ollama run llama3.1:8b-instruct-q5_K_M

Gemma 2 9B Instruct

Gemma / 9B / Q4_K_M / ~7 GB

Best for: Chat, Coding·Pop: 81/100

Perf: ~94.1 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, coding on RTX 4090.

ollama

ollama run gemma2:9b-instruct-q4_K_M

Qwen3 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Quality·Pop: 84/100

Perf: ~64.6 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for coding, quality on RTX 4090.

ollama

ollama run qwen3:14b-q4_K_M

Qwen2.5 14B Instruct

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding, Chat·Pop: 80/100

Perf: ~64.6 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for coding, chat on RTX 4090.

ollama

ollama run qwen2.5:14b-instruct-q4_K_M

Qwen2.5 Coder 14B

Qwen / 14B / Q4_K_M / ~11 GB

Best for: Coding·Pop: 79/100

Perf: ~64.6 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for coding on RTX 4090.

ollama

ollama run qwen2.5-coder:14b-q4_K_M

Mistral Nemo 12B

Mistral / 12B / Q4_K_M / ~9.5 GB

Best for: Chat, Translation·Pop: 78/100

Perf: ~73.7 tok/s · first token ~0.4s

Local OK//Excellent

Fits in 24 GB VRAM with room to spare. Best for chat, translation on RTX 4090.

ollama

ollama run mistral-nemo:12b-q4_K_M

Similar GPUs for Local AI

RTX 3090 (24GB · 87 tok/s)RTX 5090 (32GB · 145 tok/s)RTX 5080 (16GB · 94 tok/s)

Compatible Model Families

Qwen

Alibaba Cloud — Widest size range (0.5B to 235B)

Llama

Meta — Most popular open-weight model family

DeepSeek

DeepSeek AI — Best-in-class reasoning with R1 models

Mistral

Mistral AI — Excellent performance-per-parameter ratio

Gemma

Google DeepMind — Excellent quality at small sizes (1B-9B)

Phi

Microsoft — Best quality-per-parameter in small sizes

RTX 4090 FAQ: Common Questions

How much VRAM does the RTX 4090 have for LLMs?

The RTX 4090 has 24GB GDDR6X VRAM with 1,008 GB/s bandwidth. About 23GB is usable for models. It runs 32B models at Q4 with room for 8K+ context windows.

What size LLM can I run on an RTX 4090?

Up to 32B parameter models at Q4 quantization. This includes DeepSeek-R1 32B, Qwen 2.5 32B, and larger reasoning models. For 70B models, you need Q2 or dual GPUs — or step up to the RTX 5090.

Is the RTX 4090 worth $2,574 for AI?

For AI-only use, the RTX 3090 at ~$900 used offers 83% of the speed with the same 24GB VRAM. The 4090 is worth it if you also game at 4K or need the absolute fastest 24GB card. For pure AI value, the 3090 wins.

RTX 4090 vs RTX 5090 for AI: which is better?

The RTX 5090 is 39% faster (145 vs 104 tok/s) with 8GB more VRAM (32 vs 24GB), enabling 70B models. Priced similarly (~$2,500). If buying new in 2026, the 5090 is the clear choice.

Can the RTX 4090 run 70B models?

Not at full quality. A 70B Q4 model needs ~42GB VRAM. The 4090 has 24GB, so you would need Q2 quantization (lower quality) or run with partial CPU offloading (much slower). For 70B, the RTX 5090 (32GB) is recommended.

Related Guides & Benchmarks

Local LLMs vs Cloud Flagships

32B models on a 4090 vs GPT-4 and Claude API.

DeepSeek-V3 vs Qwen 3.5

Run the best 32B models on your RTX 4090.

Run Claude Code Free with Ollama

The 4090 makes local Claude Code fast and seamless.

Browse All NVIDIA GPUs for AI

RTX 3060 RTX 4060 Ti RTX 5060 Ti RTX 4070 RTX 4070 SUPER RTX 5070 RTX 5070 Ti RTX 4070 Ti SUPER RTX 4080 SUPER RTX 5080 RTX 3090 RTX 5090

Want Personalized Recommendations?

Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.

Open ModelFit Wizard →View Benchmark Tool