Best Local AI Models for AMD Ryzen AI Max+ 395 (Strix Halo)

The Ryzen AI Max+ 395 is AMD's answer to Apple Silicon: a single APU with 128GB of unified memory, of which roughly 110GB is GPU-addressable on Linux. That capacity lets a sub-$2,000 mini PC load 70B and even 120B-class models that no consumer GPU can hold. The trade-off is bandwidth — at 256 GB/s, dense large models load but generate slowly, so MoE models in the 30B-120B range are the sweet spot.

110GB unified
Quick answer

For the Ryzen AI Max+ 395 (110GB unified memory), the best local LLM is Mixtral 8x7B Instruct at ~6.7 tok/s (est.). It uses ~30GB of unified memory; the Ryzen AI Max+ 395 handles up to 120b parameter models at Q4.

$ollama run mixtral:8x7b
TOP PICK
Mixtral 8x7B Instruct
EST. SPEED
~6.7 tok/s
MEMORY NEEDED
~30 GB

Speeds are ModelFit estimates from memory bandwidth and model size, not measured benchmarks.

Unified Memory110 GB unified LPDDR5x
Speed (8B Q4)30 tok/s
Bandwidth256 GB/s
ArchitectureZen 5 + RDNA 3.5
Price$1,999*
Max model sizeUp to 120B parameter models
Compatibility8 excellent, 2 workable

*128GB GMKtec EVO-X2

Ryzen AI Max+ 395 Estimated Tokens/sec by Model Size

Q4_K_M · ModelFit estimate
Model SizeEst. SpeedFit on 110GB
7B~34 tok/sFits in unified memory
14B~19 tok/sFits in unified memory
32B~9 tok/sFits in unified memory
70B~5 tok/sFits in unified memory

ModelFit estimates from the Ryzen AI Max+ 395's 256 GB/s bandwidth and model size at Q4_K_M — not measured benchmarks. "CPU offload" sizes exceed the 110GB unified memory and run far slower than the figure shown.

Where to Buy the Ryzen AI Max+ 395

$1,999 · 128GB GMKtec EVO-X2

ModelFit may earn a commission on purchases made through these links, at no extra cost to you. Recommendations are based on local-AI performance, not commissions.

Ryzen AI Max+ 395 Unified Memory for AI: What Actually Fits?

Unlike a discrete GPU's fixed VRAM, Strix Halo shares one 128GB LPDDR5x pool between CPU and GPU. On Linux, kernel GTT tuning exposes about 110GB of that to the Radeon 8060S iGPU — more than triple an RTX 5090's 32GB. A 70B model at Q4 (~42GB) or a 120B MoE (~65GB) fits with headroom. The catch is memory bandwidth: 256 GB/s (about 215 GB/s measured) is a fraction of a discrete GPU's, and since token generation is bandwidth-bound, dense 70B models run around 5 tok/s. Mixture-of-Experts models — which activate only a few billion parameters per token — are where this chip shines, hitting 50-70+ tok/s. On Windows the GPU is capped at a fixed BIOS allocation with no equivalent shared pool, so the big-model capability is mainly a Linux story today.

Ryzen AI Max+ 395 vs Top GPUs

HardwareMemorySpeedBandwidthPrice
Ryzen AI Max+ 395110 GB30 tok/s256 GB/s$1,999
RTX 509032 GB145 tok/s1792 GB/s$2,499
RTX 409024 GB104 tok/s1008 GB/s$2,574

Recommended Models

registry-verified10 models
01

Mixtral 8x7B Instruct

Mistral / 46.7B / Q4_K_M / ~30 GB

Best for: Coding, Quality·Pop: 72/100

Perf: ~6.7 tok/s · first token ~1.6s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for coding, quality on Ryzen AI Max+ 395.

ollamaregistry-verified
02

Qwen3.6 35B-A3B

Qwen / 35B / Q4_K_M / ~22 GB

Best for: Reasoning, Coding, Agents·Pop: 88/100

Perf: ~8.6 tok/s · first token ~1.5s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for reasoning, coding, agents on Ryzen AI Max+ 395.

ollamaregistry-verified
03

Qwen3.5 35B-A3B Instruct

Qwen / 35B / Q4_K_M / ~20 GB

Best for: Reasoning, Coding, Agent scenarios·Pop: 90/100

Perf: ~8.6 tok/s · first token ~1.5s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for reasoning, coding, agent scenarios on Ryzen AI Max+ 395.

ollamaregistry-verified
04

Qwen3.5 27B Instruct

Qwen / 27B / Q4_K_M / ~16 GB

Best for: Chat, Coding, Complex reasoning·Pop: 82/100

Perf: ~10.7 tok/s · first token ~0.8s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for chat, coding, complex reasoning on Ryzen AI Max+ 395.

ollamaregistry-verified
05

Qwen3.6 27B

Qwen / 27B / Q4_K_M / ~18 GB

Best for: Coding, Quality, Long context·Pop: 92/100

Perf: ~10.7 tok/s · first token ~0.8s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for coding, quality, long context on Ryzen AI Max+ 395.

ollamaregistry-verified
06

Gemma 4 26B-A4B

Gemma / 26B / Q4_K_M / ~16 GB

Best for: Chat, Coding, Multimodal·Pop: 86/100

Perf: ~11.0 tok/s · first token ~0.8s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for chat, coding, multimodal on Ryzen AI Max+ 395.

ollamaregistry-verified
07

Llama 4 Scout

Llama / 109B / Q4_K_M / ~67 GB

Best for: Long context, Quality, Multimodal·Pop: 86/100

Perf: ~3.3 tok/s · first token ~2.4s

Local slowExcellent

May need partial offloading on 110 GB unified memory. Expect reduced speed compared to fully loaded models.

ollamaregistry-verified
08

Gemma 4 31B

Gemma / 31B / Q4_K_M / ~20 GB

Best for: Quality, Coding, Multimodal·Pop: 84/100

Perf: ~9.5 tok/s · first token ~1.4s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for quality, coding, multimodal on Ryzen AI Max+ 395.

ollamaregistry-verified
09

Qwen3 30B

Qwen / 30B / Q4_K_M / ~22 GB

Best for: Quality, Coding·Pop: 78/100

Perf: ~9.8 tok/s · first token ~1.4s

Local OKExcellent

Fits in 110 GB unified memory with room to spare. Best for quality, coding on Ryzen AI Max+ 395.

ollamaregistry-verified
10

Llama 3.3 70B Instruct

Llama / 70B / Q4_K_M / ~42 GB

Best for: Quality, Coding·Pop: 82/100

Perf: ~4.7 tok/s · first token ~2.0s

Local slowExcellent

May need partial offloading on 110 GB unified memory. Expect reduced speed compared to fully loaded models.

ollamaregistry-verified

Models Too Big for 110GB? Rent a Cloud GPU

by the hour

The Ryzen AI Max+ 395 tops out around up to 120b parameter models. For anything bigger, an hourly rented GPU runs the same open weights with the same Ollama workflow — no hardware purchase, billed by the hour.

RunPod: Hourly GPU pods (RTX 4090 to H100) with one-click Ollama/vLLM templates.

Vast.ai: Marketplace of rented GPUs — usually the cheapest per-hour prices.

ModelFit may earn a commission on sign-ups made through these links, at no extra cost to you.

Ryzen AI Max+ 395 FAQ: Common Questions

What size LLM can the Ryzen AI Max+ 395 run?

Up to 120B-parameter models. Its 128GB unified memory (~110GB GPU-addressable on Linux) holds a 70B model at Q4 (~42GB) or a 120B MoE (~65GB) with room to spare — far beyond any consumer GPU. Mixture-of-Experts models in the 30B-120B range run best.

How fast is the Ryzen AI Max+ 395 for local AI?

It depends on the model type. Dense 70B models generate around 5 tok/s because the 256 GB/s memory bandwidth is the bottleneck. MoE models like Qwen3 30B-A3B or gpt-oss-120b run much faster — 50-70+ tok/s — since only a few billion parameters are active per token. All figures are estimates.

Ryzen AI Max+ 395 vs RTX 5090 for local LLMs?

They win on different axes. The RTX 5090 (32GB, 1,792 GB/s) is far faster per token for models that fit in 32GB. The Ryzen AI Max+ 395 (110GB usable, 256 GB/s) is slower but holds models 3x larger. AMD claims up to 3x the 5090-class performance only when a model exceeds the Nvidia card's VRAM and spills to system RAM.

Do I need Linux to run large models on Strix Halo?

For the largest models, effectively yes. On Linux, GTT kernel tuning lets the GPU address roughly 110GB of the 128GB pool. On Windows the GPU is limited to a fixed BIOS memory carve-out with no equivalent shared pool, so the very-large-model capability is mainly a Linux feature today.

How much does a 128GB Strix Halo mini PC cost?

The 128GB GMKtec EVO-X2 launched around $1,999, with street prices roughly $1,800-$2,300. AMD's own first-party dev kit is reported at $3,999. The cheaper $1,499 EVO-X2 is the 64GB version, which cannot hold a 235B model.

Want Personalized Recommendations?

Use our interactive wizard to compare models across Apple Silicon and NVIDIA GPUs.