LLMfit: One Command to Find What Runs on Your Mac (2026)

TL;DR: LLMfit is a Rust CLI that detects your RAM, CPU, and GPU, then scores 200+ models across quality, speed, fit, and context. It picks the best quantization that fits your memory and estimates tokens/sec before you download anything. Install it with brew install llmfit and run llmfit to get an interactive model browser.

The Problem LLMfit Solves

Every local LLM user hits the same wall: you find a model on HuggingFace, download 8GB of weights, launch it, and it either crawls at 2 tokens/sec or crashes with an out-of-memory error. Then you try a smaller quant, then a different model, then give up and go back to ChatGPT.

LLMfit kills that loop. One command scans your system and tells you exactly which models will run well — before you download a single byte. The tool launched on GitHub in February 2026 and already has 11,700+ stars and 9,200+ monthly downloads on crates.io.

How It Works

LLMfit operates in three steps:

1. Hardware Detection

It reads your total and available RAM, counts CPU cores, identifies your GPU (NVIDIA, AMD, Intel Arc, or Apple Silicon), and determines your acceleration backend (CUDA, Metal, ROCm, or CPU-only).

2. Dynamic Quantization

For each of the 200+ models in its database, LLMfit walks down a quantization ladder — Q8_0, Q6_K, Q5_K_M, Q4_K_M, all the way to Q2_K — and selects the highest quality that fits your available memory. No guessing.

3. Multi-Dimensional Scoring

Each model gets scored 0-100 on four axes:

Dimension	What It Measures
Quality	Parameter count, model reputation, quantization penalty
Speed	Estimated tokens/sec based on your specific hardware
Fit	Memory utilization efficiency (sweet spot: 50-80% usage)
Context	Window size capability for your config

The final composite score weights these dimensions by use case. Coding tasks favor speed. Reasoning tasks favor quality. General use balances all four.

Why Apple Silicon Users Get the Best Experience

LLMfit correctly treats unified memory as available VRAM. A MacBook Pro with 36GB of RAM sees models scored against the full 36GB — not the 8GB that a naive GPU-only check would report.

The speed estimation uses actual memory bandwidth numbers for each chip:

Hardware	Bandwidth	Efficiency	Estimated 7B Q4 Speed
M4 Max (128GB)	546 GB/s	0.55	~83 tok/s
M4 Pro (48GB)	273 GB/s	0.55	~41 tok/s
M3 (24GB)	100 GB/s	0.55	~15 tok/s
M1 (16GB)	68 GB/s	0.55	~10 tok/s

The formula is straightforward: (bandwidth_GB_s / model_size_GB) x 0.55. The 0.55 efficiency factor accounts for kernel overhead and KV-cache reads — validated against published llama.cpp benchmarks.

For Mixture-of-Experts models like DeepSeek-V3, LLMfit only counts active parameters toward VRAM. Mixtral 8x7B drops from 23.9GB to roughly 6.6GB through expert offloading.

Installation and Usage

Install (macOS)

brew install llmfit

Or with the quick install script:

curl -fsSL https://llmfit.axjns.dev/install.sh | sh

Interactive Mode (Default)

Just run llmfit — you get a full terminal UI built on ratatui with keyboard navigation, search, filters, and 6 color themes (Dracula, Nord, Solarized, Monokai, Gruvbox, and default).

CLI Mode

llmfit --cli fit

Key subcommands:

Command	What It Does
`llmfit system`	Show detected hardware specs
`llmfit fit`	Ranked models filtered by fit level
`llmfit recommend --json`	Top picks in JSON (for scripting)
`llmfit search "qwen"`	Filter by model name or provider
`llmfit info "llama-3.1-8b"`	Detailed model breakdown
`llmfit plan --model "llama-3.1-70b"`	Estimate hardware needed
`llmfit serve`	Launch REST API for cluster scheduling

Ollama Integration

LLMfit detects your installed Ollama models and lets you download new ones directly from the TUI. If you run a remote Ollama instance:

OLLAMA_HOST="http://192.168.1.100:11434" llmfit

It also integrates with llama.cpp (direct GGUF downloads) and MLX (Apple Silicon optimized inference).

What the Community Says

LLMfit hit the Hacker News front page with 297 points and 70 comments. The reception was split:

The praise: Users called it "exactly what I needed" for the first-time model selection problem. The Ollama integration and JSON output for automation were highlighted as standout features. The criticism: The most common complaint was that this should be a website, not a CLI. As one commenter put it: "That's like 4 or 5 fields to fill in on a form. Way less intrusive than installing this thing." Others noted the model database can lag behind new releases. The accuracy debate: One user reported LLMfit said they couldn't run Qwen 3.5 — while it was already running on their machine. The estimates are calculated, not benchmarked. They're directionally correct but not gospel.

LLMfit vs. ModelFit: Different Approaches

LLMfit is a CLI that auto-detects your hardware and scores models locally. ModelFit is a web app where you select your device and get instant recommendations — no installation required.

Feature	LLMfit	ModelFit
Platform	CLI (Rust)	Web app
Hardware detection	Automatic	Manual selection
Model database	200+ models	Curated per device
Ollama integration	Yes (download from TUI)	Setup guides
Use case filtering	Yes (coding, reasoning, etc.)	Yes (by category)
Works on mobile	No	Yes
No install needed	No	Yes

They complement each other. Use LLMfit if you want precise, auto-detected recommendations. Use ModelFit if you want quick answers from your phone or before buying a Mac.

Should You Use It?

Yes, if you're new to local LLMs and don't know where to start. LLMfit answers the first question everyone asks — "what can my machine actually run?" Run it once, sort by fit score, and you have a list of models to try. Yes, if you manage multiple machines or build agent systems that need to select models at runtime. The llmfit serve REST API exposes /api/v1/models/top for exactly this. Maybe not, if you already know your hardware limits. If you've been running Llama 3.1 70B Q4 on your 96GB Mac Studio for months, LLMfit won't tell you much you don't know. Keep in mind: The quality scores come from model reputation, not empirical benchmarks. A model scored 85 might outperform one scored 90 on your specific use case. Use the scores as a starting point, not a final answer.

FAQ

Does LLMfit work on Linux and Windows?

Yes. It supports CUDA (NVIDIA), ROCm (AMD), SYCL (Intel Arc), and CPU-only backends. Apple Silicon gets Metal acceleration. Install via brew, scoop (Windows), cargo install llmfit, or the install script.

How accurate are the speed estimates?

They use a memory-bandwidth model validated against llama.cpp benchmarks. Expect roughly 80% accuracy for GPU models. CPU-only and hybrid CPU+GPU modes are less predictable due to cache effects and thermal throttling.

Does it detect my exact GPU model?

For ~80 popular GPU models, yes — with specific bandwidth numbers. Unknown GPUs get a per-backend fallback (CUDA: 220 GB/s, Metal: 160 GB/s, ROCm: 180 GB/s, CPU x86: 70 GB/s).

How often is the model database updated?

The database is compiled into the binary. New models require a new release. The project averages 36 releases since February 2026 (roughly every few days), so updates come fast.

Can I use LLMfit with a remote Ollama server?

Yes. Set OLLAMA_HOST="http://ip:port" before running LLMfit. It will detect models on the remote instance and show them alongside local compatibility scores.

How to Install Ollama on Mac — Step-by-step setup guide
MacBook Air vs Pro for Local LLMs — Which MacBook to buy for AI
Best LLMs for Mac M1/M2/M3/M4 — Our recommendations by chip and RAM