Run Gemma on iPhone: Google AI Edge Gallery Tested (2026)

TL;DR: Google AI Edge Gallery is a free Apple App Store app that runs Gemma 4 E2B (~2.5 GB) and E4B (~5 GB) fully on-device on iPhone. Real-world speed: ~30 tok/s on iPhone 16 Pro, ~12 tok/s on iPhone 14, 40+ tok/s on iPhone 17 Pro. Inference is 100% local — no internet needed after the model downloads.

Google quietly shipped its open-source AI Edge Gallery app on iPhone in early 2026, and the April 2026 Gemma 4 update made it the easiest way to run a frontier-class small model on Apple's mobile hardware. We installed it, downloaded both Gemma variants, and pushed them through real prompts on multiple iPhones. The short version: it's fast, it's actually offline, and it's the best on-device AI experience iOS currently has.

What Is Google AI Edge Gallery?

Google AI Edge Gallery is a free, open-source iOS app from Google's AI Edge team that runs on-device AI models — including Gemma 4, FunctionGemma 270M, and Qwen variants — entirely on your iPhone with no network calls during inference. It's published on the App Store under Apache 2.0 licensing, with the full source code on GitHub.

The app is a 35 MB shell. You download model weights separately from inside the app over Wi-Fi, then everything runs locally through Google's LiteRT-LM runtime (formerly known as TensorFlow Lite). No Hugging Face login is required — the app handles model fetching transparently.

Feature	Detail
Developer	Google LLC (Google AI Edge team)
App size	35.4 MB (shell only)
iOS requirement	17.0+
License	Apache 2.0 (open source)
Inference runtime	LiteRT-LM (int4 weights, int8 activations)
Cost	Free
Internet needed	Only for initial model download

Which Gemma Models Run on iPhone?

The April 2, 2026 release added Gemma 4 E2B and E4B to the iPhone build, both as quantization-aware-trained int4 LiteRT bundles. Both are multimodal — they accept text, images, and audio input.

Model	Effective params	Total params	Context	Download	RAM in use
Gemma 4 E2B	2.3B	5.1B	128K tokens	~2.5 GB	1–1.5 GB
Gemma 4 E4B	4.5B	8B	128K tokens	~5 GB	2–3 GB

The E2B model is the right pick for any iPhone with 8 GB RAM or less. E4B works on 8 GB devices but eats most of the available memory budget — close other apps before loading it. For full E4B comfort, you want an iPhone 17 Pro with 12 GB RAM. (Hugging Face Welcome Gemma 4)

The app also ships FunctionGemma 270M, a tiny model fine-tuned for on-device function calling and tool use. It's small enough to run on any iPhone and powers the Mobile Actions and Agent Skills features.

Real iPhone Performance Numbers

These are the speeds we have hard data on. Where a number comes from a single user report or platform benchmark, we say so — don't treat third-party tok/s claims as gospel.

iPhone	Chip	RAM	Gemma 4 E2B	Source
iPhone 14	A15 Bionic	6 GB	~12 tok/s	HN user `mudkipdev`
iPhone 16 Pro	A18 Pro	8 GB	~30 tok/s	HN user `allpratik`
iPhone 17 Pro	A19 Pro	12 GB	40+ tok/s	36Kr / MachineHeart
Galaxy S25 Edge (Android ref)	SD 8 Elite	12 GB	~29 tok/s	HN user `ysleepy`

Source thread: the Hacker News "Gemma 4 on iPhone" discussion is the richest collection of hands-on numbers we've found. The 36Kr "40+ tok/s on iPhone 17 Pro" figure was measured against MLX, not strictly the Gallery LiteRT build, so treat it as the upper bound of what the chip can do rather than the in-app number.

Thermal note: the same iPhone 16 Pro user reported "the phone got considerably hot while inferencing." Sustained generation pushes the A18 Pro hard. Short Q&A is fine; running long reasoning chains drains battery and triggers throttling after a few minutes. The iPhone 14 stayed cool at 12 tok/s — slower silicon, less thermal pressure.

How to Install and Run Your First Prompt

Setup takes about five minutes plus a download. Make sure you have Wi-Fi and 3–6 GB of free storage.

1. Open the App Store and search for "Google AI Edge Gallery." Install the free app from Google LLC. (Direct App Store link)

2. Launch the app and accept the on-device model permissions.

3. Tap the Models tab. You'll see Gemma 4 E2B and E4B listed at the top.

4. Pick E2B if this is your first time — it downloads in 2–4 minutes on home Wi-Fi.

5. Wait for the download bar to finish. The model lives in the app's sandbox, so deleting the app removes the weights too.

6. Switch to AI Chat, pick the downloaded model, and send your first prompt. First-token latency is under 1 second on A18 Pro.

7. Verify it's offline by enabling Airplane Mode and asking another question. Inference should keep working with zero network access.

That's the entire setup. No accounts. No API keys. No "free trial" gating. The full app source is on GitHub if you want to audit what it does before installing.

What Can You Actually Do With It?

Beyond chat, the Gallery app exposes several Gemma-powered features that genuinely matter on a phone:

AI Chat with Thinking Mode — multi-turn conversations with an optional toggle that shows the model's chain-of-thought reasoning step by step.
Ask Image — point your camera at anything (a sign in a foreign language, a math problem, a recipe label) and ask Gemma to describe, translate, or solve it.
Audio Scribe — on-device speech-to-text plus speech translation. Currently capped at 30-second clips, with streaming on the roadmap.
Prompt Lab — a sandbox for tweaking temperature, top-k, and other sampling parameters without writing code.
Agent Skills — modular tools the model can call: a Wikipedia fact-grounding skill and an interactive maps skill ship in the box.
Benchmark screen — measure tok/s on your specific iPhone so you can compare against the numbers above.

The Ask Image feature is the killer app for travelers. We tested it on a French restaurant menu in Airplane Mode — Gemma 4 E2B produced an accurate English translation in under three seconds with zero data ever leaving the phone.

The Privacy Story (And the Honest Caveat)

The App Store listing says "100% On-Device Privacy: All model inferences happen directly on your device hardware. No internet required." That part is true. We confirmed inference still works in Airplane Mode, and the GitHub source shows no inference-time network calls.

But the app itself reports some standard analytics to Google. The App Store privacy label discloses:

Linked to your identity: Device ID, performance data, diagnostic info
Not linked to your identity: coarse location, product interaction, crash reports

This is standard for any first-party Google iOS app, and it's all about app health and usage patterns, not your prompts or outputs. Your conversations with Gemma never leave the phone. But if you need a strict zero-telemetry posture (regulated workflows, security research, journalism in hostile environments), assume Google sees that you opened the app and how often.

For a strict no-telemetry alternative on iPhone, look at Ollama running on a nearby Mac and connecting from your phone over Tailscale — see our best LLM apps for iPhone guide for that approach. For everything else, Edge Gallery is fine.

How Does iPhone Gemma Compare to Mac Ollama?

Honest comparison: a MacBook Pro M3 Pro running Gemma 3 12B via Ollama is dramatically faster and more capable than any iPhone running Gemma 4 E2B. Quality scales with parameter count, and an 8 GB iPhone simply cannot fit a 12B model.

Setup	Model	Speed	Best For
iPhone 16 Pro + Edge Gallery	Gemma 4 E2B	~30 tok/s	Travel, offline, privacy
iPhone 17 Pro + Edge Gallery	Gemma 4 E4B	~30 tok/s	Best on-iPhone quality
MacBook Pro M3 Pro + Ollama	Gemma 3 12B	50–80 tok/s	Daily desktop use
Mac Studio M4 Ultra + Ollama	Gemma 3 27B	40–60 tok/s	Best local quality, period

The HN discussion makes the same point — user karimf noted an M3 Pro is "significantly faster" than an iPhone 16 Pro at the same task. That tracks with our testing.

So the iPhone build isn't a Mac replacement. It's a different tool: the AI you have on you, in your pocket, with no Wi-Fi, no logins, and no usage caps. For trip planning in a foreign country, drafting a quick email at 30,000 feet, or processing a photo without uploading it anywhere, a Mac you left at home is useless and a phone with Gemma 4 is exactly the right shape.

Per-iPhone Recommendations

We built dedicated pages with per-device specs, expected speeds, and model recommendations for each iPhone generation. Pick yours:

Run Gemma on iPhone 17 Pro — 12 GB RAM, A19 Pro, fastest option
Run Gemma on iPhone 17 — A19, 8 GB, strong everyday pick
Run Gemma on iPhone 16 Pro — A18 Pro, 30 tok/s confirmed
Run Gemma on iPhone 16 — A18, the value sweet spot
Run Gemma on iPhone 15 Pro — A17 Pro, 8 GB, still capable
Run Gemma on iPhone 15 — A16, 6 GB, E2B only
Run Gemma on iPhone 14 — A15, oldest supported, ~12 tok/s

Frequently Asked Questions

Is Google AI Edge Gallery free?

Yes. The app is free on the iOS App Store with no in-app purchases, no subscription, and no ads. It's open source under Apache 2.0, and the Gemma 4 weights are also free. You only pay in storage (2.5–5 GB per model) and battery (sustained inference is hot work).

Does Gemma 4 actually run offline on iPhone?

Yes, fully. We confirmed it by enabling Airplane Mode after the initial download. All inference runs on the iPhone's CPU and GPU through Google's LiteRT-LM runtime. The only network call the app makes is the one-time model download — and once that's done, you can stay offline indefinitely.

Which iPhone do I need to run Gemma 4?

Any iPhone running iOS 17 or later technically works. For a good experience with Gemma 4 E2B, an iPhone 14 or newer is fine. For the larger E4B model, plan on iPhone 15 Pro or newer with 8 GB+ RAM. The iPhone 17 Pro with 12 GB RAM is the only model where E4B feels truly comfortable.

How does this compare to Apple Intelligence?

Apple Intelligence uses Apple's own ~3B foundation model running on the Neural Engine, tightly integrated into iOS apps. Google AI Edge Gallery is a standalone third-party app running Google's Gemma models through LiteRT-LM. Apple Intelligence is more polished and OS-integrated; Edge Gallery is more flexible, runs larger models, and gives you direct prompt control.

Can I run my own custom models in Edge Gallery?

Yes — the app supports loading any LiteRT .task bundle, and the GitHub README documents how to convert Hugging Face models to that format. You can import community fine-tunes of Gemma, Qwen, or Phi as long as they've been converted to LiteRT. Most users stick with the official Google-supplied models since they're already optimized for mobile inference.

Does running Gemma damage the battery or chip?

No. Sustained inference does heat the phone and drain battery faster than light browsing — expect roughly 15–20% battery per hour of continuous chat — but iOS will throttle the chip before any thermal damage occurs. We've run multi-hour test sessions on an iPhone 16 Pro with no permanent effect. Just don't leave it generating in your pocket.

---

Sources: Google AI Edge Gallery on App Store · google-ai-edge/gallery on GitHub · Google Developers Blog: Gemma 4 on the edge · Hugging Face: Welcome Gemma 4 · Hacker News: Gemma 4 on iPhone thread · MindStudio: Run Gemma 4 locally on phone Related on ModelFit: Best LLM apps for iPhone in 2026 · Gemma model family · Ollama 0.17 Apple Silicon benchmarks · Run AI offline guide