TL;DR: The MacBook Air M2 with 16GB RAM comfortably runs models up to ~14B parameters at Q4. Qwen3.5 9B is the best all-rounder — ~7GB loaded, native multimodal, near-frontier quality. Qwen3.5 4B is the speed pick, Gemma 4 E4B covers multimodal chat, and Qwen3 8B is the proven runner-up. Expect roughly 17% slower token generation than an M4 Air — the M2's 100 GB/s memory bus is the limit.
The MacBook Air M2 is still a capable local-AI machine in 2026. Apple's unified memory architecture means there's no separate GPU VRAM pool to fight — every gigabyte of the 16GB works for inference. The M2's 100 GB/s memory bandwidth is the same as the M3 and about 17% behind the M4's 120 GB/s, so it generates tokens a little slower, but the model lineup it can run is identical.
This guide covers exactly which models work on an M2 Air 16GB, which to skip, and how fast each runs. For the full spec rundown and other chips, see the MacBook Air device page.
How Much RAM Do You Actually Have for Models?
macOS reserves memory aggressively, so the 16GB on the box is not all yours. On a 16GB MacBook Air M2, the real budget looks like this:
| Allocation | Typical Size |
|---|---|
| macOS kernel + services | ~2–3 GB |
| Active apps (browser, editor) | ~2–4 GB |
| Available for LLM | ~9–12 GB |
With nothing else open you can push a model toward ~12GB. With a browser and a few tabs, plan for ~9GB. The rule of thumb: Q4_K_M quantization costs roughly 0.55 GB per billion parameters. A 4B model needs ~3.5GB, a 9B needs ~7GB, and a 14B needs ~9.5GB — doable but tight.
The M2's 100 GB/s LPDDR5 bandwidth is the single most important number for inference speed, because LLM token generation is memory-bandwidth-bound. It matches the M3 and trails the M4's 120 GB/s by about 17%. The model you can load is the same; the speed is a touch lower.
Performance Expectations
On the M2 Air 16GB, realistic token generation with Ollama at Q4_K_M looks like this:
| Model | RAM Used | Tokens/sec (est.) | Best For |
|---|---|---|---|
| Qwen3.5 9B Q4_K_M | ~7.0 GB | 18–23 tok/s | Quality all-rounder |
| Qwen3.5 4B Q4_K_M | ~3.5 GB | 35–42 tok/s | Speed, coding |
| Gemma 4 E4B Q4_K_M | ~4.0 GB | 30–38 tok/s | Multimodal chat |
| Qwen3 8B Q4_K_M | ~5.5 GB | 25–33 tok/s | Proven runner-up |
| Gemma 3 12B QAT | ~8.0 GB | 18–23 tok/s | Quality writing |
Models above ~14B parameters are a gamble at 16GB — they technically load but swap into CPU-backed memory under load, dropping speed below 5 tok/s. Skip them until you have 24GB.
The Top Picks
1. Qwen3.5 9B — Best All-Rounder
Qwen3.5 9B is the model the 16GB Air was waiting for. At ~7GB loaded it fits with your browser open, and its output quality competes with 30B-class models from the previous generation. It handles text and images natively — no separate vision model — and its 262K context window swallows long documents whole.
ollama run qwen3.5:9b
Why it wins: Alibaba packed near-frontier quality under 10GB. Writing, analysis, coding, image questions — one model covers them all at 18–23 tok/s on the M2. That is comfortably readable for interactive chat, and the quality lift over 8B-class models is obvious on nuanced instructions.
2. Qwen3.5 4B — Best Speed
When responsiveness matters more than depth, Qwen3.5 4B delivers the best quality-per-gigabyte in its class. At ~3.5GB loaded and 35–42 tok/s, it answers faster than you can read, and it punches above its size on coding and agent tasks. Pair it with an editor through our coding on MacBook Air guide for a fast local autocomplete setup.
ollama run qwen3.5:4b
It shares the 9B's multimodal input and 262K context. For quick chat, summaries, and inline code suggestions, this is the model you keep loaded all day on the M2.
3. Gemma 4 E4B — Best Efficient Multimodal
Google's Gemma 4 E4B uses Per-Layer Embeddings to act like a larger model while loading only ~4GB. It handles text and image input, runs at 30–38 tok/s on the M2, and is the same model that powers on-device AI on high-end phones — so it is tuned hard for efficiency.
ollama run gemma4:e4b
Use case: screenshot questions, chart reading, everyday chat with images in the mix. On a fanless M2, its light memory and compute footprint also means less thermal pressure on long sessions.
4. Qwen3 8B — Proven Runner-Up
The previous-generation favorite still earns a slot. Qwen3 8B is battle-tested, widely documented, and runs at 25–33 tok/s in ~5.5GB. Its hybrid thinking mode handles reasoning chains on-device with no extra memory cost.
ollama run qwen3:8b
If you already use it and it covers your needs, there is no urgency to switch — but new installs should start with Qwen3.5 9B, which is sharper at a similar speed class.
5. Gemma 3 12B QAT — Quality Writing Fallback
Google's QAT (Quantization-Aware Training) variant of Gemma 3 12B survives aggressive quantization with minimal quality loss. At ~8GB loaded it remains a solid choice for creative writing and nuanced instructions, though Qwen3.5 9B now matches or beats it in less memory.
ollama run gemma3:12b
Pick it if you prefer Gemma's prose style — but on the M2 it is near the top of your memory budget, so close other apps before a long writing session.
What to Avoid
70B models — They load but swap to CPU-backed inference. Expect 2–4 tok/s. Not usable on 16GB. 32B models at Q4 — Technically ~20GB, which exceeds the 16GB ceiling. Extreme quantization (Q2_K, IQ2_XS) can squeeze them in, but quality degrades badly. Not worth it. Q8_0 of 12B+ models — At Q8 a 12B model occupies ~13GB — your entire addressable budget with no room for macOS. It will swap and crawl. Use Q4_K_M or QAT instead.Cooling Reality Check
The MacBook Air M2 is fanless. That is fine for 95% of use cases. But under continuous LLM load — generating a long document or a deep reasoning chain — the chip will throttle after 15–25 minutes of sustained inference, a little sooner than the M4 Air because the M2 runs warmer per watt.
Practical impact: for interactive chat (short exchanges) you will never notice. For batch processing or long-context generation, expect a 15–25% slowdown once throttling kicks in. If you run AI for hours at a stretch, a MacBook Pro or Mac Mini with active cooling holds throughput steady. For the next-gen leap in sustained speed and model size, see our Apple M5 Pro & M5 Max local LLM guide.Quick Comparison Table
| Use Case | Recommended Model | Command |
|---|---|---|
| General assistant | Qwen3.5 9B | ollama run qwen3.5:9b |
| Maximum speed | Qwen3.5 4B | ollama run qwen3.5:4b |
| Multimodal chat | Gemma 4 E4B | ollama run gemma4:e4b |
| Proven fallback | Qwen3 8B | ollama run qwen3:8b |
| Quality writing | Gemma 3 12B QAT | ollama run gemma3:12b |
New to Ollama? Our Ollama setup guide walks through installation in under five minutes, and the best LLM for MacBook overview compares picks across every configuration.
FAQ
Can a MacBook Air M2 with 16GB really run LLMs?
Yes. The M2's unified memory means all 16GB is available to the GPU for inference — unlike PC setups limited to discrete VRAM (often 8–12GB on mid-range cards). Models up to 13–14B parameters run smoothly at Q4 quantization, just slightly slower than on an M4 Air.
How much slower is the M2 than the M4 Air for local AI?
About 17% slower on token generation, tracking the memory bandwidth gap (100 GB/s on the M2 vs 120 GB/s on the M4). The M2 matches the M3. The model lineup you can run is identical — only the speed differs.
What's the maximum model size for 16GB on the M2?
A 13B model at Q4_K_M (~9GB) is the practical sweet spot. A 14B fits but is tight. Anything above 14B risks swapping to CPU memory, which drops speed below 5 tok/s — unusable for interactive chat.
Is 16GB enough, or should I get 24GB?
16GB handles the large majority of local-AI use cases on the M2. If you are buying new and plan to run models for extended sessions, 24GB (on a current M-series Air) unlocks 14–20B models cleanly and is the better long-term investment.
Does the MacBook Air M2 throttle during AI tasks?
Yes, but only during sustained inference (15+ minutes continuous), and a bit sooner than the M4. For normal chat — even 1–2 hour bursty conversations — throttling rarely triggers. Batch generation or long-document processing may see a 15–25% slowdown.
Related Model Families:- Qwen Models — all Qwen variants, RAM requirements, and benchmarks
- Gemma Models — Google's efficient models from E2B to 31B
- Phi Models — Microsoft's small-but-mighty models for low-RAM devices
Where to Buy for Local AI
best configsModelFit may earn a commission on purchases made through these links, at no extra cost to you. Recommendations are based on local-AI performance, not commissions.
The weekly local-AI refresh
New open-weight models, real Apple Silicon benchmarks, and the one model worth running on your Mac this week. Free, one email a week, unsubscribe anytime.
Have questions? Reach out on X/Twitter