Mixtral 8x7B Instruct
Mistral / 46.7B / Q4_K_M / ~30 GB
Best for: Coding, Quality·Pop: 72/100
Perf: ~6.7 tok/s · first token ~1.6s
Fits in 110 GB unified memory with room to spare. Best for coding, quality on Ryzen AI Max+ 395.
The Ryzen AI Max+ 395 is AMD's answer to Apple Silicon: a single APU with 128GB of unified memory, of which roughly 110GB is GPU-addressable on Linux. That capacity lets a sub-$2,000 mini PC load 70B and even 120B-class models that no consumer GPU can hold. The trade-off is bandwidth — at 256 GB/s, dense large models load but generate slowly, so MoE models in the 30B-120B range are the sweet spot.
For the Ryzen AI Max+ 395 (110GB unified memory), the best local LLM is Mixtral 8x7B Instruct at ~6.7 tok/s (est.). It uses ~30GB of unified memory; the Ryzen AI Max+ 395 handles up to 120b parameter models at Q4.
Speeds are ModelFit estimates from memory bandwidth and model size, not measured benchmarks.
*128GB GMKtec EVO-X2
| Model Size | Est. Speed | Fit on 110GB |
|---|---|---|
| 7B | ~34 tok/s | Fits in unified memory |
| 14B | ~19 tok/s | Fits in unified memory |
| 32B | ~9 tok/s | Fits in unified memory |
| 70B | ~5 tok/s | Fits in unified memory |
ModelFit estimates from the Ryzen AI Max+ 395's 256 GB/s bandwidth and model size at Q4_K_M — not measured benchmarks. "CPU offload" sizes exceed the 110GB unified memory and run far slower than the figure shown.
ModelFit may earn a commission on purchases made through these links, at no extra cost to you. Recommendations are based on local-AI performance, not commissions.
Unlike a discrete GPU's fixed VRAM, Strix Halo shares one 128GB LPDDR5x pool between CPU and GPU. On Linux, kernel GTT tuning exposes about 110GB of that to the Radeon 8060S iGPU — more than triple an RTX 5090's 32GB. A 70B model at Q4 (~42GB) or a 120B MoE (~65GB) fits with headroom. The catch is memory bandwidth: 256 GB/s (about 215 GB/s measured) is a fraction of a discrete GPU's, and since token generation is bandwidth-bound, dense 70B models run around 5 tok/s. Mixture-of-Experts models — which activate only a few billion parameters per token — are where this chip shines, hitting 50-70+ tok/s. On Windows the GPU is capped at a fixed BIOS allocation with no equivalent shared pool, so the big-model capability is mainly a Linux story today.
Mistral / 46.7B / Q4_K_M / ~30 GB
Best for: Coding, Quality·Pop: 72/100
Perf: ~6.7 tok/s · first token ~1.6s
Fits in 110 GB unified memory with room to spare. Best for coding, quality on Ryzen AI Max+ 395.
Qwen / 35B / Q4_K_M / ~22 GB
Best for: Reasoning, Coding, Agents·Pop: 88/100
Perf: ~8.6 tok/s · first token ~1.5s
Fits in 110 GB unified memory with room to spare. Best for reasoning, coding, agents on Ryzen AI Max+ 395.
Qwen / 35B / Q4_K_M / ~20 GB
Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100
Perf: ~8.6 tok/s · first token ~1.5s
Fits in 110 GB unified memory with room to spare. Best for reasoning, coding, agent scenarios on Ryzen AI Max+ 395.
Qwen / 27B / Q4_K_M / ~16 GB
Best for: Chat, Coding, Complex reasoning·Pop: 82/100
Perf: ~10.7 tok/s · first token ~0.8s
Fits in 110 GB unified memory with room to spare. Best for chat, coding, complex reasoning on Ryzen AI Max+ 395.
Qwen / 27B / Q4_K_M / ~18 GB
Best for: Coding, Quality, Long context·Pop: 92/100
Perf: ~10.7 tok/s · first token ~0.8s
Fits in 110 GB unified memory with room to spare. Best for coding, quality, long context on Ryzen AI Max+ 395.
Gemma / 26B / Q4_K_M / ~16 GB
Best for: Chat, Coding, Multimodal·Pop: 86/100
Perf: ~11.0 tok/s · first token ~0.8s
Fits in 110 GB unified memory with room to spare. Best for chat, coding, multimodal on Ryzen AI Max+ 395.
Llama / 109B / Q4_K_M / ~67 GB
Best for: Long context, Quality, Multimodal·Pop: 86/100
Perf: ~3.3 tok/s · first token ~2.4s
May need partial offloading on 110 GB unified memory. Expect reduced speed compared to fully loaded models.
Gemma / 31B / Q4_K_M / ~20 GB
Best for: Quality, Coding, Multimodal·Pop: 84/100
Perf: ~9.5 tok/s · first token ~1.4s
Fits in 110 GB unified memory with room to spare. Best for quality, coding, multimodal on Ryzen AI Max+ 395.
Qwen / 30B / Q4_K_M / ~22 GB
Best for: Quality, Coding·Pop: 78/100
Perf: ~9.8 tok/s · first token ~1.4s
Fits in 110 GB unified memory with room to spare. Best for quality, coding on Ryzen AI Max+ 395.
Llama / 70B / Q4_K_M / ~42 GB
Best for: Quality, Coding·Pop: 82/100
Perf: ~4.7 tok/s · first token ~2.0s
May need partial offloading on 110 GB unified memory. Expect reduced speed compared to fully loaded models.
The Ryzen AI Max+ 395 tops out around up to 120b parameter models. For anything bigger, an hourly rented GPU runs the same open weights with the same Ollama workflow — no hardware purchase, billed by the hour.
RunPod: Hourly GPU pods (RTX 4090 to H100) with one-click Ollama/vLLM templates.
Vast.ai: Marketplace of rented GPUs — usually the cheapest per-hour prices.
ModelFit may earn a commission on sign-ups made through these links, at no extra cost to you.
Alibaba Cloud — Widest size range (0.5B to 235B)
LlamaMeta — Most popular open-weight model family
DeepSeekDeepSeek AI — Best-in-class reasoning with R1 models
MistralMistral AI — Excellent performance-per-parameter ratio
GemmaGoogle DeepMind — Excellent quality at small sizes (1B-9B)
PhiMicrosoft — Best quality-per-gigabyte at small sizes
Up to 120B-parameter models. Its 128GB unified memory (~110GB GPU-addressable on Linux) holds a 70B model at Q4 (~42GB) or a 120B MoE (~65GB) with room to spare — far beyond any consumer GPU. Mixture-of-Experts models in the 30B-120B range run best.
It depends on the model type. Dense 70B models generate around 5 tok/s because the 256 GB/s memory bandwidth is the bottleneck. MoE models like Qwen3 30B-A3B or gpt-oss-120b run much faster — 50-70+ tok/s — since only a few billion parameters are active per token. All figures are estimates.
They win on different axes. The RTX 5090 (32GB, 1,792 GB/s) is far faster per token for models that fit in 32GB. The Ryzen AI Max+ 395 (110GB usable, 256 GB/s) is slower but holds models 3x larger. AMD claims up to 3x the 5090-class performance only when a model exceeds the Nvidia card's VRAM and spills to system RAM.
For the largest models, effectively yes. On Linux, GTT kernel tuning lets the GPU address roughly 110GB of the 128GB pool. On Windows the GPU is limited to a fixed BIOS memory carve-out with no equivalent shared pool, so the very-large-model capability is mainly a Linux feature today.
The 128GB GMKtec EVO-X2 launched around $1,999, with street prices roughly $1,800-$2,300. AMD's own first-party dev kit is reported at $3,999. The cheaper $1,499 EVO-X2 is the 64GB version, which cannot hold a 235B model.
Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.