GPT-OSS 120B
GPT-OSS / 117B / MXFP4 / ~65.4 GB
Best for: Reasoning, Coding, Agents·Pop: 88/100
Perf: ~50.5 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for reasoning, coding, agents on RTX PRO 6000.
The RTX PRO 6000 Blackwell is NVIDIA's workstation flagship, built for professionals rather than gamers. Its 96GB of ECC GDDR7 is three times the RTX 5090's VRAM, at the same 1,792 GB/s bandwidth. That combination lets it load 120B-class dense models entirely in memory while still generating 7B-8B model tokens as fast as the 5090.
The best local LLM for the RTX PRO 6000 is GPT-OSS 120B at ~50.5 tok/s on its 96GB VRAM. It uses ~65.4GB of VRAM; the RTX PRO 6000 handles up to 120b parameter models at Q4. A 14B model runs at ~90 tok/s.
Speeds are ModelFit estimates from memory bandwidth and model size, not measured benchmarks.
*Launch MSRP, workstation channel pricing has since risen with AI GPU demand
| Model Size | Est. Speed | Fit on 96GB |
|---|---|---|
| 7B | ~162 tok/s | Fits in VRAM |
| 14B | ~90 tok/s | Fits in VRAM |
| 32B | ~45 tok/s | Fits in VRAM |
| 70B | ~23 tok/s | Fits in VRAM |
ModelFit estimates from the RTX PRO 6000's 1792 GB/s bandwidth and model size at Q4_K_M, not measured benchmarks. "CPU offload" sizes exceed the 96GB VRAM and run far slower than the figure shown.
A Gen4 M.2 drive keeps your whole GGUF and quant collection on fast local storage, loading models straight off NVMe.
Check price on Amazon40Gbps external storage fast enough to run models from. Pair it with an M.2 drive for a portable model vault.
Check price on AmazonModelFit may earn a commission on purchases through these links, at no extra cost to you. Prices shown are approximate street references.
96GB of ECC GDDR7 at 1,792 GB/s gives the RTX PRO 6000 the same per-token speed as the RTX 5090 on models that fit both cards, since throughput is bandwidth-bound and the two cards share an identical bandwidth spec. The difference is capacity: a 70B model at Q4 (~42GB) leaves over 40GB free, and models up to roughly 120B parameters at Q4 fit inside the usable budget with room for a long context window. That headroom also means the card can hold two or three mid-size models resident at once, useful for running a chat model alongside a coding model without reloading. The 600W power draw and workstation pricing put this card well outside consumer territory; it is aimed at AI builders and studios that need the largest local models on a single GPU.
| Hardware | Memory | Speed | Bandwidth | Price |
|---|---|---|---|---|
| Ryzen AI Max+ 395 | 110 GB | 30 tok/s | 256 GB/s | $1,999 |
| RTX 5090 | 32 GB | 145 tok/s | 1792 GB/s | $2,499 |
| RTX 4090 | 24 GB | 104 tok/s | 1008 GB/s | $2,574 |
| RTX PRO 6000 | 96 GB | 145 tok/s | 1792 GB/s | $8,565 |
GPT-OSS / 117B / MXFP4 / ~65.4 GB
Best for: Reasoning, Coding, Agents·Pop: 88/100
Perf: ~50.5 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for reasoning, coding, agents on RTX PRO 6000.
Qwen / 80B / Q4_K_M / ~50.4 GB
Best for: Chat, Coding, Long Context·Pop: 80/100
Perf: ~82.7 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for chat, coding, long context on RTX PRO 6000.
Qwen / 122B / Q4_K_M / ~72 GB
Best for: Frontier-level reasoning, Complex tasks·Pop: 75/100
Perf: ~41.4 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for frontier-level reasoning, complex tasks on RTX PRO 6000.
Llama / 109B / Q4_K_M / ~67 GB
Best for: Long context, Quality, Multimodal·Pop: 86/100
Perf: ~34.7 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for long context, quality, multimodal on RTX PRO 6000.
Qwen / 35B / Q8_0 / ~38.7 GB
Best for: Reasoning, Coding, Agents·Pop: 88/100
Perf: ~72.8 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for reasoning, coding, agents on RTX PRO 6000.
Qwen / 35B / Q8_0 / ~38.7 GB
Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100
Perf: ~72.8 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for reasoning, coding, agent scenarios on RTX PRO 6000.
Gemma / 26B / Q8_0 / ~28.1 GB
Best for: Chat, Coding, Multimodal·Pop: 86/100
Perf: ~73.1 tok/s · first token ~0.4s
Fits in 96 GB VRAM with room to spare. Best for chat, coding, multimodal on RTX PRO 6000.
Qwen / 27B / Q8_0 / ~30 GB
Best for: Coding, Quality, Long context·Pop: 92/100
Perf: ~32.0 tok/s · first token ~0.5s
Fits in 96 GB VRAM with room to spare. Best for coding, quality, long context on RTX PRO 6000.
Qwen / 80B / Q8_0 / ~84.8 GB
Best for: Chat, Coding, Long Context·Pop: 80/100
Perf: ~51.3 tok/s · first token ~1.0s
Fits in 96 GB VRAM with room to spare. Best for chat, coding, long context on RTX PRO 6000.
Llama / 70B / Q6_K / ~57.9 GB
Best for: Quality, Coding·Pop: 82/100
Perf: ~17.0 tok/s · first token ~1.2s
Fits in 96 GB VRAM with room to spare. Best for quality, coding on RTX PRO 6000.
The RTX PRO 6000 tops out around up to 120b parameter models. For anything bigger, an hourly rented GPU runs the same open weights with the same Ollama workflow, billed by the hour, no hardware purchase needed.
ModelFit may earn a commission on sign-ups made through these links, at no extra cost to you.
Alibaba Cloud: Widest size range (0.5B to 235B)
LlamaMeta: Most popular open-weight model family
DeepSeekDeepSeek AI: Best-in-class reasoning with R1 models
MistralMistral AI: Excellent performance-per-parameter ratio
GemmaGoogle DeepMind: Excellent quality at small sizes (1B-9B)
PhiMicrosoft: Best quality-per-gigabyte at small sizes
The RTX PRO 6000 Blackwell has 96GB of ECC GDDR7 VRAM with 1,792 GB/s bandwidth (NVIDIA, 2026). About 86GB is usable for model loading after driver overhead, enough for dense models up to roughly 120B parameters at Q4 quantization.
Up to roughly 120B parameter models at Q4 quantization fit inside its 96GB VRAM. Smaller 7B-32B models run with a large context window and headroom to keep a second model loaded at the same time.
It is a professional workstation card, not a consumer buy. At around $8,565 MSRP, it costs far more than an RTX 5090, but it is one of the few single GPUs that holds 120B-class models entirely in VRAM without splitting across multiple cards.
Both share the same 1,792 GB/s bandwidth, so 7B-8B token speed is nearly identical (~145 tok/s). The RTX PRO 6000 has three times the VRAM (96GB vs 32GB), letting it hold much larger models, but it costs several times more and is not built for gaming.
No. It uses the same NVIDIA CUDA driver stack as GeForce cards, so Ollama detects it automatically. Keep drivers current, since workstation cards often get certified driver updates on a slightly different cadence than GeForce.
Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.