Llama 3.1 8B Instruct
Llama / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 94/100
Perf: ~79.0 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run llama3.1:8b-instruct-q4_K_M
The RTX 4080 SUPER delivers near-flagship performance with 16GB GDDR6X. At 79 tokens per second, it runs 7B-14B models with excellent speed. Its high memory bandwidth makes it one of the fastest 16GB cards available.
16GB GDDR6X at 736 GB/s puts the 4080 SUPER near the top of 16GB cards in bandwidth. It handles 14B models at Q4 with ease, and 7B models fly at 79 tok/s. The 736 GB/s bandwidth sits between the 5070 Ti (896) and 4070 Ti SUPER (672). At its MSRP of $1,597, the value proposition is weak compared to the $749 RTX 5070 Ti. Best considered by existing owners or on the used market.
| GPU | VRAM | Speed | Bandwidth | Price |
|---|---|---|---|---|
| RTX 5080 | 16 GB | 94 tok/s | 960 GB/s | $999 |
| RTX 4070 Ti SUPER | 16 GB | 72 tok/s | 672 GB/s | $1,148 |
| RTX 4080 SUPER | 16 GB | 79 tok/s | 736 GB/s | $1,597 |
| RTX 4090 | 24 GB | 104 tok/s | 1008 GB/s | $2,574 |
Llama / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 94/100
Perf: ~79.0 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run llama3.1:8b-instruct-q4_K_M
Qwen / 9B / Q4_K_M / ~7 GB
Best for: Quality, Coding, Reasoning·Pop: 86/100
Perf: ~71.5 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for quality, coding, reasoning on RTX 4080 SUPER.
ollama run qwen3.5:9b-instruct-q4_K_M
Qwen / 8B / Q4_K_M / ~6.5 GB
Best for: Chat, Coding·Pop: 88/100
Perf: ~79.0 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run qwen3:8b-q4_K_M
Mistral / 7B / Q4_K_M / ~5.5 GB
Best for: Chat, Coding·Pop: 90/100
Perf: ~88.5 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run mistral:7b-instruct-q4_K_M
Qwen / 7B / Q4_K_M / ~5.5 GB
Best for: Coding·Pop: 85/100
Perf: ~88.5 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for coding on RTX 4080 SUPER.
ollama run qwen2.5-coder:7b-q4_K_M
Qwen / 7B / Q4_K_M / ~5.5 GB
Best for: Chat, Coding·Pop: 86/100
Perf: ~88.5 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run qwen2.5:7b-instruct-q4_K_M
LFM2 / 8B / Q4_K_M / ~6 GB
Best for: Local agents, tool calling, fast chat·Pop: 75/100
Perf: ~79.0 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for local agents, tool calling, fast chat on RTX 4080 SUPER.
ollama run liquidai/lfm2:8b-a1b-instruct-q4_K_M
DeepSeek / 7B / Q4_K_M / ~5.5 GB
Best for: Reasoning, Coding·Pop: 77/100
Perf: ~88.5 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for reasoning, coding on RTX 4080 SUPER.
ollama run deepseek-r1-distill:qwen-7b-q4_K_M
Llama / 8B / Q5_K_M / ~8 GB
Best for: Chat, Coding·Pop: 82/100
Perf: ~67.9 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run llama3.1:8b-instruct-q5_K_M
Gemma / 9B / Q4_K_M / ~7 GB
Best for: Chat, Coding·Pop: 81/100
Perf: ~71.5 tok/s · first token ~0.4s
Fits in 16 GB VRAM with room to spare. Best for chat, coding on RTX 4080 SUPER.
ollama run gemma2:9b-instruct-q4_K_M
Alibaba Cloud — Widest size range (0.5B to 235B)
LlamaMeta — Most popular open-weight model family
DeepSeekDeepSeek AI — Best-in-class reasoning with R1 models
MistralMistral AI — Excellent performance-per-parameter ratio
GemmaGoogle DeepMind — Excellent quality at small sizes (1B-9B)
PhiMicrosoft — Best quality-per-parameter in small sizes
The RTX 4080 SUPER has 16GB GDDR6X VRAM with 736 GB/s bandwidth. About 15.5GB is usable. Same capacity as the 4070 Ti SUPER and 5070 Ti, but higher bandwidth for faster inference.
Up to 14B parameter models at Q4 quantization. This includes all popular 14B models like DeepSeek-R1 14B and Qwen 2.5 14B. At 79 tok/s, responses feel instant for chat workloads.
It is excellent for AI performance but poor value at MSRP. The RTX 5070 Ti costs less than half the price and delivers similar speed (87 vs 79 tok/s). Buy the 4080 SUPER only on the used market at $800-900.
The RTX 5080 is 19% faster (94 vs 79 tok/s) at a lower MSRP ($999 vs $1,597). Both have 16GB VRAM. If you already own the 4080 SUPER, the upgrade is modest. If buying new, skip to the 5080.
Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.