AMD Strix Halo: 128GB Local AI Under $2,000 (2026)

TL;DR: The AMD Ryzen AI Max+ 395 ("Strix Halo") puts 128GB of unified memory in a mini PC, letting it load models up to ~200 billion parameters locally, something no consumer GPU can do. On Linux it exposes about 108GB to the GPU (CraftRigs), versus 24-32GB on an RTX 4090/5090. The catch: its 256 GB/s memory bandwidth trails Apple's M-series (546-819 GB/s), so big models run but generate slowly (~11 tok/s on a 235B model, per Tech Times). At ~$1,999 for the 128GB config, it is the cheapest way to run frontier-size models at home.

Illustration of a compact mini PC representing the AMD Ryzen AI Max+ 395 Strix Halo platform

The AMD Ryzen AI Max+ 395 (Strix Halo) packs 128GB of unified memory into a book-sized mini PC.

AMD just made a move that matters for anyone running AI at home. At the CES 2026 keynote on January 6, CEO Lisa Su held up a mini PC and called it "the smallest AI development system in the world, capable of running models with up to 200 billion parameters locally" (Rev transcript). No datacenter. No cloud GPU rental. A box the size of a thick book.

The chip behind it, the Ryzen AI Max+ 395, is the first x86 part to fuse a fast CPU, a large GPU, and 128GB of shared memory on one package. That unified-memory design is exactly what Apple has shipped since the M1, and it is what lets these machines hold models that would never fit in a discrete graphics card. This guide covers what the chip actually does for local LLMs, where the hype overshoots, and how it stacks up against the Apple Silicon Macs that pioneered the approach. For sizing any model to any memory budget, see our how much RAM for a local LLM guide.

What Is the Ryzen AI Max+ 395 ("Strix Halo")?

Strix Halo is an APU, one chip carrying both processor and graphics, built for AI workloads. The headline specs:

Spec	Ryzen AI Max+ 395
CPU	16 Zen 5 cores / 32 threads
GPU	Radeon 8060S, 40 compute units (RDNA 3.5)
NPU	XDNA 2, 50 TOPS
Memory	Up to 128GB LPDDR5x-8000, unified
Memory bandwidth	256 GB/s (theoretical)
Default TDP	55W (configurable 45-120W)

The NPU figure of 50 TOPS comes from AMD's own announcement (AMD blog). The part that matters for local LLMs is not the NPU, though; it is the 128GB of memory the GPU can borrow from.

Why Unified Memory Beats a Big GPU for Large Models

A local LLM has to fit its weights in memory the accelerator can reach. On a normal PC, that means GPU VRAM, and VRAM is small and fixed.

Hardware	Memory the GPU can use
RTX 4090	24 GB
RTX 5080	16 GB
RTX 5090	32 GB
Strix Halo (Linux)	~108 GB

On Linux, kernel tuning lets the Radeon 8060S address roughly 108GB of the 128GB pool through a shared GTT allocation (CraftRigs). That is more than triple an RTX 5090's 32GB. On Windows the GPU is capped at whatever you carve out in the BIOS, with no equivalent shared pool, so the big-model story is largely a Linux story today.

This capacity is the whole pitch. A 70B model at 4-bit needs about 42GB; a 120B model needs ~65GB. Neither fits on any single consumer Nvidia card. Both fit on Strix Halo with room to spare.

How Fast Is It, Really?

Here is where expectations need calibrating. Loading a model is not the same as running it quickly. Token generation is bound by memory bandwidth, and Strix Halo's 256 GB/s (about 215 GB/s measured in practice, per Tech Times) is modest next to a discrete GPU's or a high-end Mac's.

The result: dense large models load but crawl, while Mixture-of-Experts (MoE) models, which activate only a few billion parameters per token, fly.

Model	Type	Est. generation speed	Source
Qwen3 235B (Q3)	MoE, huge	~11 tok/s	Tech Times
Qwen3 30B-A3B	MoE, 3B active	~72 tok/s	Level1Techs
gpt-oss-120b	MoE	~52 tok/s	fromthematrix
Llama 3 70B (Q4)	Dense	~5 tok/s	Level1Techs

Third-party community measurements, not ModelFit benchmarks. Speeds vary by quantization, backend (ROCm/Vulkan), and context length.

The takeaway: a 235B model running at ~11 tok/s is a genuine novelty, but it is reading-speed slow. The sweet spot for daily use is MoE models in the 30B-120B range, where Strix Halo delivers 50-70+ tokens per second, fast enough for interactive chat and agents.

Watch the benchmark wording. Some viral charts cite "144 tok/s on Qwen3 235B." That is prompt processing (how fast it reads your input), not generation (how fast it writes the answer). Generation on that model is ~11 tok/s. The two are routinely conflated.

The "3x Faster Than an RTX 5080" Claim, Explained

AMD published a benchmark in March 2025 showing the Ryzen AI Max+ 395 running DeepSeek R1 up to 3x faster than an RTX 5080 (TweakTown). True, with an asterisk.

That gap only appears once the model exceeds the RTX 5080's 16GB of VRAM. Below 16GB the Nvidia card is far faster. Above it, the 5080 has to spill weights into slow system RAM and falls off a cliff, while Strix Halo keeps everything in its 128GB pool. This is an AMD-run, vendor benchmark, and the sources do not state the exact DeepSeek R1 size or quantization, so treat the multiplier as a capacity story, not a raw-compute one. Strix Halo wins by fitting the model, not by being a faster chip.

Strix Halo vs Apple Silicon: The Honest Comparison

Apple shipped unified memory five years before Strix Halo. For our Mac-owning readers, the fair question is how AMD's version compares, and the answer splits cleanly between price, capacity, and bandwidth.

Bar chart comparing memory bandwidth of AMD Strix Halo, Apple M4 Max, M2 Ultra, and M3 Ultra in GB per second

Memory bandwidth sets local LLM token speed. Strix Halo's 256 GB/s trails every Apple Ultra and Max chip. Manufacturer specs.

Chip	Max unified memory	Memory bandwidth	Notes
AMD Strix Halo	128 GB	256 GB/s	~$1,999 mini PC
Apple M4 Max	128 GB	546 GB/s	top configuration
Apple M2 Ultra	192 GB	800 GB/s	Mac Studio
Apple M3 Ultra	512 GB	819 GB/s	Mac Studio

Sources: AMD/TechPowerUp, Apple M4 Max, Apple M2 Ultra, Apple Mac Studio specs.

Three honest conclusions:

On price, AMD wins decisively. A 128GB Strix Halo mini PC runs about $1,999. A 128GB M4 Max Mac Studio costs roughly twice that, and a 512GB M3 Ultra costs far more.
On capacity, it ties the M4 Max (both 128GB) but loses to the Ultra chips; only an M3 Ultra (512GB) can hold dramatically larger models.
On bandwidth, the metric that sets token speed, Apple wins everywhere. Even the two-generation-old M2 Ultra moves data ~3x faster. That is why a Mac of equal memory generates tokens faster than Strix Halo.

For a Mac owner, the verdict is plain: if you already run an M-series machine with enough RAM, Strix Halo offers little upside. Its real audience is price-sensitive builders who need 100GB-class model capacity cheaply and are comfortable on Linux/ROCm. See our Mac Studio local AI guide for the Apple side of this trade.

The Cost-vs-Cloud Math

The pitch beyond privacy is economics. Frontier AI subscriptions stack up: Claude, ChatGPT Pro, and Cursor can each run $20-$200 a month. A $1,999 box that never sends a token off your machine pays for itself in well under a year against a heavy subscription stack, and it removes token limits, rate caps, and 3 a.m. service outages.

The honest caveat: a local 235B model at ~11 tok/s is not a drop-in replacement for a frontier cloud model on hard reasoning tasks. Where local hardware shines is private RAG over your own documents, prototyping, local coding agents, and offline workflows: use cases where privacy and unlimited usage matter more than absolute peak quality.

Should You Buy One?

You are...	Verdict
A Mac owner with 64GB+ unified memory	Skip it: you already have faster unified memory
A budget builder who needs 70B-120B models locally	Strong buy at ~$1,999, especially on Linux
Chasing maximum token speed	Get a discrete GPU (for models ≤24GB) instead
Privacy-first, running RAG/agents on private data	Excellent fit

The Ryzen AI Max+ 395 does not beat a high-end Mac or a fast GPU at their own games. What it does is collapse the price of large-model capacity, putting 100GB-class local AI within reach of a sub-$2,000 budget for the first time. That is the real story, and it is a meaningful one.

FAQ

What is the AMD Ryzen AI Max+ 395?

It is AMD's flagship "Strix Halo" APU: a single chip with 16 Zen 5 CPU cores, a 40-compute-unit Radeon 8060S GPU, a 50 TOPS NPU, and up to 128GB of unified LPDDR5x memory. The unified memory lets the GPU access far more than a discrete card's VRAM, so it can load very large local LLMs.

Can Strix Halo really run a 235B model?

Yes, it can load a quantized 235-billion-parameter MoE model in its 128GB memory pool, which AMD demonstrated as part of its CES 2026 "up to 200 billion parameters" claim. But generation runs at roughly 11 tokens per second (Tech Times): usable, but slow. MoE models in the 30B-120B range are the practical sweet spot at 50-70+ tok/s.

How does Strix Halo compare to Apple Silicon for local AI?

It matches the M4 Max on memory capacity (128GB) and beats every Mac on price, but its 256 GB/s memory bandwidth is roughly half the M4 Max's 546 GB/s and a third of the M3 Ultra's 819 GB/s. Since token generation is bandwidth-bound, an equivalent-memory Mac generates tokens faster. Strix Halo's edge is cost, not speed.

How much does a 128GB Strix Halo mini PC cost?

The GMKtec EVO-X2 with 128GB launched at a pre-sale price of about $1,999, with street prices fluctuating between roughly $1,800 and $2,300 (Notebookcheck). AMD's own first-party dev kit is reported at $3,999. The cheaper $1,499 EVO-X2 is the 64GB version, which cannot hold a 235B model.

Do I need Linux to get the most out of it?

For the largest models, effectively yes. On Linux the GPU can address roughly 108GB of the 128GB pool via a shared GTT allocation. On Windows the GPU is limited to a fixed BIOS memory carve-out with no equivalent shared pool, so the very-large-model capability is mainly a Linux feature today.