We benchmarked 18 local AI models for Mac dictation — here's what won
We put 18 local cleanup LLMs and 5 Whisper models through a latency-and-accuracy benchmark on Apple Silicon. A multimodal model beat every text-only LLM on both speed and quality. Here's the data, the failures, and what to actually run.
Saydrop does two AI things, both on your Mac: it transcribes your speech with Whisper, then optionally runs the raw transcript through a small local language model to clean it up — strip the “uh”s and “ähm”s, fix punctuation and capitalisation, tidy the obvious slips. Your voice never leaves the machine.
We shipped with exactly one model for each job. That was the safe choice, but it left users on older or storage-constrained Macs stuck with a default that was bigger and slower than they needed. So we set out to widen the catalog — and to do it honestly, we built a quality scorer first instead of eyeballing speed.
The result was a benchmark of 18 cleanup models and five Whisper transcription models. We expected the tiny text-only models to win on speed and the bigger ones to win on quality, with an obvious trade-off curve in between. The opposite happened. This post is the data.
How we measured it
Everything below was measured on an M1 Pro (10-core, 16 GB unified RAM), on-device, with three timed runs per cell after a single warmup pass.
The cleanup harness uses six paired fixtures — a “dirty” dictation transcript and a hand-written “clean” reference — across two languages and three lengths:
| Fixture | Language | Words |
|---|---|---|
| short_en / short_de | English / German | 34 / 28 |
| medium_en / medium_de | English / German | 120 / 102 |
| long_en / long_de | English / German | 257 / 230 |
For each model we measure p50 latency and word error rate (WER) — a Levenshtein edit distance between the model’s output and the clean reference, normalised for case and punctuation. We also compute a length ratio and raise a hallucination flag when the output runs more than 1.4× longer than the reference, which is a reliable signal that a model has started fabricating text.
A model earns a place in the picker only if it clears all three gates:
| Criterion | Threshold | Why |
|---|---|---|
| p50 latency (short fixture) | ≤ 400 ms | Below this, cleanup feels instant in the dictation UX |
| Mean WER vs reference | ≤ 25% | Ensures the model improves the text rather than degrading it |
| Hallucination flag | none | Output must not balloon past the reference length |
The short fixture drives the latency gate because a sentence-length dictation is the dominant use pattern; latency on longer inputs is informational. (One caveat on the Whisper numbers: those use TTS-generated audio, so the absolute WER is optimistic versus real human speech — but it is consistent across models, which is what we need for a comparison.)
The results that surprised us
Here is the full cleanup screen. First the text-only candidates running on the lighter mlx-lm loader:
| Model | p50 short EN | p50 short DE | p50 long DE | WER | Hallucinated | Decision |
|---|---|---|---|---|---|---|
| Qwen3-0.6B-4bit | 0.14 s | 0.14 s | 1.02 s | 15% | no | Not added (leaves fillers, weak capitalisation) |
| Qwen3-1.7B-4bit | 0.20 s | 0.22 s | 1.51 s | 16% | no | Removed (dominated — see below) |
| Qwen3-4B-4bit | 0.36 s | 0.38 s | 2.96 s | 9% | no | Added |
| Qwen3-8B-4bit | 0.54 s | 0.66 s | 5.81 s | 7% | no | Rejected — 660 ms > gate |
| Qwen2.5-0.5B-Instruct-4bit | 0.23 s | 0.17 s | 0.82 s | 28% | no | Rejected — 28% WER > gate |
| Qwen2.5-1.5B-Instruct-4bit | 0.21 s | 0.28 s | 1.61 s | 10% | no | Added |
| Qwen2.5-3B-Instruct-4bit | 0.34 s | 0.39 s | 2.82 s | 13% | no | Added |
| Llama-3.2-1B-Instruct-4bit | 0.13 s | 0.15 s | 1.23 s | 39% | no | Rejected — 39% WER > gate |
| Llama-3.2-3B-Instruct-4bit | 0.33 s | 0.40 s | 2.70 s | 11% | no | Added |
| OpenELM-3B-Instruct-4bit | — | — | — | — | — | Unavailable — repo 404 |
| gemma-3-1b-it-4bit | 0.55 s | 0.48 s | 3.02 s | 243% | yes | Rejected — catastrophic |
| gemma-3-4b-it-4bit | 0.42 s | 0.44 s | 2.90 s | 12% | no | Rejected — 440 ms > gate |
And the Gemma 4 family on the multimodal mlx-vlm loader:
| Model | p50 short EN | p50 short DE | p50 long DE | WER | Hallucinated | Decision |
|---|---|---|---|---|---|---|
| gemma-4-e2b-it-4bit | 0.24 s | 0.25 s | 3.19 s | 2% | no | Default (already in catalog) |
| gemma-4-e4b-it-4bit | 0.58 s | 0.56 s | 3.53 s | 6% | no | In catalog |
| gemma-4-e4b-it-8bit | — | — | — | — | — | In catalog (highest precision) |
| gemma-4-e4b-it-OptiQ-4bit | — | — | — | — | — | Failed to load |
| gemma-4-E4B-it-qat-4bit | — | — | — | — | — | Failed to load |
| gemma-4-12B-it-4bit | — | — | — | — | — | Failed to load |
The headline is in those tables if you stare at them long enough: Gemma 4 E2B is both the fastest and the most accurate cleanup model we tested. It posts 0.24 s on a short clip at 2% WER. The next-best text-only model, Qwen3-4B, is 0.36 s at 9% WER — slower and less accurate, on every fixture.
That is the counterintuitive part. Gemma 4 E2B is a multimodal checkpoint — it carries machinery for images and audio it never uses in this job. Conventional wisdom says a lean text-only model should beat it on a pure text task. It didn’t. It lost on both axes. The “recommended” badge in Saydrop’s settings isn’t a cautious default; it is the objective winner.
When good numbers hide a bad model
The reason we built the quality scorer before trusting any of this is sitting in the tables above.
Qwen3-0.6B passes the numeric gate — 0.14 s, 15% WER — and a speed-only benchmark would have happily shipped it. But reading its actual output, it consistently leaves filler words in place and skips capitalisation fixes. WER averages those small, frequent errors into a number that looks acceptable. Manual inspection still matters.
Gemma 3 1B is the cautionary tale. It clocked a 243% WER with the hallucination flag raised: its output ran 2.4× longer than the reference because it fabricated entire sentences that were never spoken. A latency benchmark would have called it fast. The quality harness called it dangerous — and it is effectively blacklisted now.
The rest of the ”—” rows are honest failures of a different kind: the OptiQ, QAT, and 12B Gemma 4 variants wouldn’t load (missing parameters, a missing module), and Apple’s OpenELM-3B has been pulled from Hugging Face entirely. We list them so the next person doesn’t burn an afternoon rediscovering it.
Why chunking always loses on a single GPU
A reasonable optimisation idea for long dictations: split the text into sentence-boundary chunks and clean each one, so latency scales with chunk size rather than total length. We tested it against the two cached Gemma 4 models — short fixtures stayed as one chunk, medium split into two, long into four.
| Model | Fixture | Chunks | Full p50 | Chunked p50 | Overhead |
|---|---|---|---|---|---|
| gemma-4-e2b-it-4bit | short_en | 1 | 0.24 s | 0.24 s | 0% |
| gemma-4-e2b-it-4bit | medium_de | 2 | 0.91 s | 0.96 s | +6% |
| gemma-4-e2b-it-4bit | long_de | 4 | 2.14 s | 2.25 s | +5% |
| gemma-4-e4b-it-4bit | medium_de | 2 | 1.53 s | 1.85 s | +21% |
| gemma-4-e4b-it-4bit | long_en | 4 | 3.00 s | 3.94 s | +31% |
| gemma-4-e4b-it-4bit | long_de | 4 | 3.58 s | 6.00 s | +67% |
Chunking was slower at every single text length, and the WER came out identical — no quality gain to offset the cost. The reason is structural: each chunk pays the full prompt overhead independently (system prompt, instruction header, KV-cache fill), and that overhead repeats once per chunk with nothing to amortise it against, because local MLX inference is single-threaded on the GPU. On E4B’s four-chunk long document the overhead compounds into a 67% penalty. E2B is hurt less only because its per-token cost is lower.
Could you run the chunks in parallel? Not usefully — multiple MLX processes contend for the same Metal GPU memory. We rejected chunking for local cleanup.
The Whisper trap: “turbo” isn’t an upgrade
The transcription side held its own surprise. Five Whisper candidates:
| Model | p50 long DE | WER (German) | Decision |
|---|---|---|---|
| whisper-medium | 2.43 s | ~8% | Recommended default |
| whisper-large-v3-turbo | ~2.0 s | 67% on long DE | Demoted — English-only |
| whisper-large-v3 | 5.49 s | ~7% | In catalog; best quality, too slow for most |
| distil-whisper-large-v3 | fast | 92–99% | Rejected — English-only |
| whisper-small-mlx | fast | moderate | Kept for speed-first / old Macs |
The name “turbo” implies a strict upgrade over medium. For English, fine. For long German audio it posted a 67% word error rate — more than half the words wrong. distil-whisper-large-v3 was worse still on German (92–99%). Both are English-only models wearing names that don’t say so. We switched the recommended default from turbo to whisper-medium, which is both faster and more accurate on the multilingual workload Saydrop actually targets. Turbo stays in the catalog with an explicit “English only” label so it can’t quietly wreck a German dictation.
(One small footgun for fellow MLX users: the Hugging Face repo mlx-community/whisper-small 404s. The one you want is mlx-community/whisper-small-mlx.)
So which model should you actually run?
For almost everyone, the answer is: nothing — you’re already on it. Saydrop installs with Gemma 4 E2B for cleanup and whisper-medium for transcription selected by default, and that is precisely the pairing that won this benchmark on both speed and accuracy. There is no faster or more accurate combination to switch to — the recommended default is the optimal one. If you just installed Saydrop, you can stop reading here.
The go-to setup → Gemma 4 E2B + whisper-medium (~4.8 GB total). It ships pre-selected and on by default. Don’t change it unless you’re short on disk space or running an older Mac.
The only reason to open the model pickers is a hardware constraint, never a quality gain:
| Your situation | Cleanup | Transcribe | Total download | Notes |
|---|---|---|---|---|
| Most users — the default | Gemma 4 E2B | whisper-medium | ~4.8 GB | Ships pre-selected. Fastest and most accurate — nothing to change |
| Storage constrained | Qwen2.5-1.5B | whisper-medium | ~2.3 GB | Only ~30 ms slower than E2B on short clips; 10% WER is genuinely good |
| Maximum accuracy | Gemma 4 E4B 8-bit | whisper-large-v3 | ~8.5 GB | For long, careful recordings where you’ll proofread anyway |
| Oldest Mac / minimum footprint | Qwen2.5-1.5B | whisper-small | ~1.3 GB | Fastest load time; quality is fine for casual use |
| Open-weight only | Qwen3-4B | whisper-medium | ~3.7 GB | Best non-Google option; 9% WER is close to E2B |
Every alternative is a dropdown in Settings — switch any time, no reinstall, and inference stays fully on-device either way. But the short version stands: the default is the best, so most people should leave it exactly as it ships. For the engineering behind keeping these models hot in memory, see how we run MLX Whisper locally.
The takeaway
The lesson we keep relearning: build the measurement before you trust the intuition. We assumed small text-only models would be the speed champions and were ready to ship one. The benchmark said a multimodal model wins outright, that one “tiny” model fabricates entire sentences, that a clever chunking optimisation only ever loses, and that the model literally named “turbo” is a trap for half our users. None of that was visible from the spec sheets.
If you want dictation that’s been measured this way, download Saydrop here — first-launch onboarding handles the model download and warmup for you.