2026-03-06
LLMfit: One Command to Find What Runs on Your Mac (2026)
brew install llmfit and run llmfit to get an interactive model browser.
The Problem LLMfit Solves
Every local LLM user hits the same wall: you find a model on HuggingFace, download 8GB of weights, launch it, and it either crawls at 2 tokens/sec or crashes with an out-of-memory error. Then you try a smaller quant, then a different model, then give up and go back to ChatGPT.
LLMfit kills that loop. One command scans your system and tells you exactly which models will run well — before you download a single byte. The tool launched on GitHub in February 2026 and already has 11,700+ stars and 9,200+ monthly downloads on crates.io.
How It Works
LLMfit operates in three steps:
1. Hardware DetectionIt reads your total and available RAM, counts CPU cores, identifies your GPU (NVIDIA, AMD, Intel Arc, or Apple Silicon), and determines your acceleration backend (CUDA, Metal, ROCm, or CPU-only).
2. Dynamic QuantizationFor each of the 200+ models in its database, LLMfit walks down a quantization ladder — Q8_0, Q6_K, Q5_K_M, Q4_K_M, all the way to Q2_K — and selects the highest quality that fits your available memory. No guessing.
3. Multi-Dimensional ScoringEach model gets scored 0-100 on four axes:
| Dimension | What It Measures |
|---|---|
| Quality | Parameter count, model reputation, quantization penalty |
| Speed | Estimated tokens/sec based on your specific hardware |
| Fit | Memory utilization efficiency (sweet spot: 50-80% usage) |
| Context | Window size capability for your config |
The final composite score weights these dimensions by use case. Coding tasks favor speed. Reasoning tasks favor quality. General use balances all four.
Why Apple Silicon Users Get the Best Experience
LLMfit correctly treats unified memory as available VRAM. A MacBook Pro with 36GB of RAM sees models scored against the full 36GB — not the 8GB that a naive GPU-only check would report.
The speed estimation uses actual memory bandwidth numbers for each chip:
| Hardware | Bandwidth | Efficiency | Estimated 7B Q4 Speed |
|---|---|---|---|
| M4 Max (128GB) | 546 GB/s | 0.55 | ~83 tok/s |
| M4 Pro (48GB) | 273 GB/s | 0.55 | ~41 tok/s |
| M3 (24GB) | 100 GB/s | 0.55 | ~15 tok/s |
| M1 (16GB) | 68 GB/s | 0.55 | ~10 tok/s |
The formula is straightforward: (bandwidth_GB_s / model_size_GB) x 0.55. The 0.55 efficiency factor accounts for kernel overhead and KV-cache reads — validated against published llama.cpp benchmarks.
For Mixture-of-Experts models like DeepSeek-V3, LLMfit only counts active parameters toward VRAM. Mixtral 8x7B drops from 23.9GB to roughly 6.6GB through expert offloading.
Installation and Usage
Install (macOS)
brew install llmfit
Or with the quick install script:
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
Interactive Mode (Default)
Just run llmfit — you get a full terminal UI built on ratatui with keyboard navigation, search, filters, and 6 color themes (Dracula, Nord, Solarized, Monokai, Gruvbox, and default).
CLI Mode
llmfit --cli fit
Key subcommands:
| Command | What It Does |
|---|---|
llmfit system | Show detected hardware specs |
llmfit fit | Ranked models filtered by fit level |
llmfit recommend --json | Top picks in JSON (for scripting) |
llmfit search "qwen" | Filter by model name or provider |
llmfit info "llama-3.1-8b" | Detailed model breakdown |
llmfit plan --model "llama-3.1-70b" | Estimate hardware needed |
llmfit serve | Launch REST API for cluster scheduling |
Ollama Integration
LLMfit detects your installed Ollama models and lets you download new ones directly from the TUI. If you run a remote Ollama instance:
OLLAMA_HOST="http://192.168.1.100:11434" llmfit
It also integrates with llama.cpp (direct GGUF downloads) and MLX (Apple Silicon optimized inference).
What the Community Says
LLMfit hit the Hacker News front page with 297 points and 70 comments. The reception was split:
The praise: Users called it "exactly what I needed" for the first-time model selection problem. The Ollama integration and JSON output for automation were highlighted as standout features. The criticism: The most common complaint was that this should be a website, not a CLI. As one commenter put it: "That's like 4 or 5 fields to fill in on a form. Way less intrusive than installing this thing." Others noted the model database can lag behind new releases. The accuracy debate: One user reported LLMfit said they couldn't run Qwen 3.5 — while it was already running on their machine. The estimates are calculated, not benchmarked. They're directionally correct but not gospel.LLMfit vs. ModelFit: Different Approaches
LLMfit is a CLI that auto-detects your hardware and scores models locally. ModelFit is a web app where you select your device and get instant recommendations — no installation required.
| Feature | LLMfit | ModelFit |
|---|---|---|
| Platform | CLI (Rust) | Web app |
| Hardware detection | Automatic | Manual selection |
| Model database | 200+ models | Curated per device |
| Ollama integration | Yes (download from TUI) | Setup guides |
| Use case filtering | Yes (coding, reasoning, etc.) | Yes (by category) |
| Works on mobile | No | Yes |
| No install needed | No | Yes |
They complement each other. Use LLMfit if you want precise, auto-detected recommendations. Use ModelFit if you want quick answers from your phone or before buying a Mac.
Should You Use It?
Yes, if you're new to local LLMs and don't know where to start. LLMfit answers the first question everyone asks — "what can my machine actually run?" Run it once, sort by fit score, and you have a list of models to try. Yes, if you manage multiple machines or build agent systems that need to select models at runtime. Thellmfit serve REST API exposes /api/v1/models/top for exactly this.
Maybe not, if you already know your hardware limits. If you've been running Llama 3.1 70B Q4 on your 96GB Mac Studio for months, LLMfit won't tell you much you don't know.
Keep in mind: The quality scores come from model reputation, not empirical benchmarks. A model scored 85 might outperform one scored 90 on your specific use case. Use the scores as a starting point, not a final answer.
FAQ
Does LLMfit work on Linux and Windows?
Yes. It supports CUDA (NVIDIA), ROCm (AMD), SYCL (Intel Arc), and CPU-only backends. Apple Silicon gets Metal acceleration. Install via brew, scoop (Windows), cargo install llmfit, or the install script.
How accurate are the speed estimates?
They use a memory-bandwidth model validated against llama.cpp benchmarks. Expect roughly 80% accuracy for GPU models. CPU-only and hybrid CPU+GPU modes are less predictable due to cache effects and thermal throttling.
Does it detect my exact GPU model?
For ~80 popular GPU models, yes — with specific bandwidth numbers. Unknown GPUs get a per-backend fallback (CUDA: 220 GB/s, Metal: 160 GB/s, ROCm: 180 GB/s, CPU x86: 70 GB/s).
How often is the model database updated?
The database is compiled into the binary. New models require a new release. The project averages 36 releases since February 2026 (roughly every few days), so updates come fast.
Can I use LLMfit with a remote Ollama server?
Yes. Set OLLAMA_HOST="http://ip:port" before running LLMfit. It will detect models on the remote instance and show them alongside local compatibility scores.
Related
- How to Install Ollama on Mac — Step-by-step setup guide
- MacBook Air vs Pro for Local LLMs — Which MacBook to buy for AI
- Best LLMs for Mac M1/M2/M3/M4 — Our recommendations by chip and RAM
Have questions? Reach out on X/Twitter