What's the best local AI model for Mac dictation?

In our benchmark, Gemma 4 E2B was both the fastest and the most accurate cleanup model: 0.24 s on a short clip and 2% word error rate. It is Saydrop's default. The only reason to pick something else is download size — it needs a 3.3 GB download, and the smaller text-only models give you viable alternatives on tight storage.

Do these models run entirely on my Mac?

Yes. Saydrop transcribes and cleans up text fully on-device via Apple's MLX framework. Your voice never leaves the machine, and no internet or account is needed for inference — only the one-time model download.

Which model should I pick on an 8 GB or older Mac?

Pair Qwen2.5-1.5B for cleanup (0.8 GB) with whisper-small for transcription — about 1.3 GB total. It loads fast and the quality is genuinely good: Qwen2.5-1.5B is only ~30 ms slower than the default on short clips at 10% WER.

Why not split long dictations into chunks to go faster?

We tested it. Sequential chunking was 5–67% slower with identical output quality. Each chunk repays the full prompt overhead, and local MLX inference is single-threaded on the GPU, so there is no parallelism to win back. We rejected it.

Is whisper-large-v3-turbo better than whisper-medium?

Not for non-English speech. Turbo hit 67% word error rate on long German audio in our tests. whisper-medium is both faster and more accurate for multilingual use, so it is the default. Treat turbo as English-only.

How much disk space do the models need?

The default setup (Gemma 4 E2B + whisper-medium) is about 4.8 GB. A minimum-footprint setup (Qwen2.5-1.5B + whisper-small) is about 1.3 GB. You can change either model in Settings at any time.

June 16, 2026 • Patrick Lehmann

We benchmarked 18 local AI models for Mac dictation — here's what won

We put 18 local cleanup LLMs and 5 Whisper models through a latency-and-accuracy benchmark on Apple Silicon. A multimodal model beat every text-only LLM on both speed and quality. Here's the data, the failures, and what to actually run.

engineering benchmark mlx apple-silicon local-ai

Saydrop does two AI things, both on your Mac: it transcribes your speech with Whisper, then optionally runs the raw transcript through a small local language model to clean it up — strip the “uh”s and “ähm”s, fix punctuation and capitalisation, tidy the obvious slips. Your voice never leaves the machine.

We shipped with exactly one model for each job. That was the safe choice, but it left users on older or storage-constrained Macs stuck with a default that was bigger and slower than they needed. So we set out to widen the catalog — and to do it honestly, we built a quality scorer first instead of eyeballing speed.

The result was a benchmark of 18 cleanup models and five Whisper transcription models. We expected the tiny text-only models to win on speed and the bigger ones to win on quality, with an obvious trade-off curve in between. The opposite happened. This post is the data.

How we measured it

Everything below was measured on an M1 Pro (10-core, 16 GB unified RAM), on-device, with three timed runs per cell after a single warmup pass.

The cleanup harness uses six paired fixtures — a “dirty” dictation transcript and a hand-written “clean” reference — across two languages and three lengths:

Fixture	Language	Words
short_en / short_de	English / German	34 / 28
medium_en / medium_de	English / German	120 / 102
long_en / long_de	English / German	257 / 230

For each model we measure p50 latency and word error rate (WER) — a Levenshtein edit distance between the model’s output and the clean reference, normalised for case and punctuation. We also compute a length ratio and raise a hallucination flag when the output runs more than 1.4× longer than the reference, which is a reliable signal that a model has started fabricating text.

A model earns a place in the picker only if it clears all three gates:

Criterion	Threshold	Why
p50 latency (short fixture)	≤ 400 ms	Below this, cleanup feels instant in the dictation UX
Mean WER vs reference	≤ 25%	Ensures the model improves the text rather than degrading it
Hallucination flag	none	Output must not balloon past the reference length

The short fixture drives the latency gate because a sentence-length dictation is the dominant use pattern; latency on longer inputs is informational. (One caveat on the Whisper numbers: those use TTS-generated audio, so the absolute WER is optimistic versus real human speech — but it is consistent across models, which is what we need for a comparison.)

The results that surprised us

Here is the full cleanup screen. First the text-only candidates running on the lighter mlx-lm loader:

Model	p50 short EN	p50 short DE	p50 long DE	WER	Hallucinated	Decision
Qwen3-0.6B-4bit	0.14 s	0.14 s	1.02 s	15%	no	Not added (leaves fillers, weak capitalisation)
Qwen3-1.7B-4bit	0.20 s	0.22 s	1.51 s	16%	no	Removed (dominated — see below)
Qwen3-4B-4bit	0.36 s	0.38 s	2.96 s	9%	no	Added
Qwen3-8B-4bit	0.54 s	0.66 s	5.81 s	7%	no	Rejected — 660 ms > gate
Qwen2.5-0.5B-Instruct-4bit	0.23 s	0.17 s	0.82 s	28%	no	Rejected — 28% WER > gate
Qwen2.5-1.5B-Instruct-4bit	0.21 s	0.28 s	1.61 s	10%	no	Added
Qwen2.5-3B-Instruct-4bit	0.34 s	0.39 s	2.82 s	13%	no	Added
Llama-3.2-1B-Instruct-4bit	0.13 s	0.15 s	1.23 s	39%	no	Rejected — 39% WER > gate
Llama-3.2-3B-Instruct-4bit	0.33 s	0.40 s	2.70 s	11%	no	Added
OpenELM-3B-Instruct-4bit	—	—	—	—	—	Unavailable — repo 404
gemma-3-1b-it-4bit	0.55 s	0.48 s	3.02 s	243%	yes	Rejected — catastrophic
gemma-3-4b-it-4bit	0.42 s	0.44 s	2.90 s	12%	no	Rejected — 440 ms > gate

And the Gemma 4 family on the multimodal mlx-vlm loader:

Model	p50 short EN	p50 short DE	p50 long DE	WER	Hallucinated	Decision
gemma-4-e2b-it-4bit	0.24 s	0.25 s	3.19 s	2%	no	Default (already in catalog)
gemma-4-e4b-it-4bit	0.58 s	0.56 s	3.53 s	6%	no	In catalog
gemma-4-e4b-it-8bit	—	—	—	—	—	In catalog (highest precision)
gemma-4-e4b-it-OptiQ-4bit	—	—	—	—	—	Failed to load
gemma-4-E4B-it-qat-4bit	—	—	—	—	—	Failed to load
gemma-4-12B-it-4bit	—	—	—	—	—	Failed to load

The headline is in those tables if you stare at them long enough: Gemma 4 E2B is both the fastest and the most accurate cleanup model we tested. It posts 0.24 s on a short clip at 2% WER. The next-best text-only model, Qwen3-4B, is 0.36 s at 9% WER — slower and less accurate, on every fixture.

That is the counterintuitive part. Gemma 4 E2B is a multimodal checkpoint — it carries machinery for images and audio it never uses in this job. Conventional wisdom says a lean text-only model should beat it on a pure text task. It didn’t. It lost on both axes. The “recommended” badge in Saydrop’s settings isn’t a cautious default; it is the objective winner.

When good numbers hide a bad model

The reason we built the quality scorer before trusting any of this is sitting in the tables above.

Qwen3-0.6B passes the numeric gate — 0.14 s, 15% WER — and a speed-only benchmark would have happily shipped it. But reading its actual output, it consistently leaves filler words in place and skips capitalisation fixes. WER averages those small, frequent errors into a number that looks acceptable. Manual inspection still matters.

Gemma 3 1B is the cautionary tale. It clocked a 243% WER with the hallucination flag raised: its output ran 2.4× longer than the reference because it fabricated entire sentences that were never spoken. A latency benchmark would have called it fast. The quality harness called it dangerous — and it is effectively blacklisted now.

The rest of the ”—” rows are honest failures of a different kind: the OptiQ, QAT, and 12B Gemma 4 variants wouldn’t load (missing parameters, a missing module), and Apple’s OpenELM-3B has been pulled from Hugging Face entirely. We list them so the next person doesn’t burn an afternoon rediscovering it.

Why chunking always loses on a single GPU

A reasonable optimisation idea for long dictations: split the text into sentence-boundary chunks and clean each one, so latency scales with chunk size rather than total length. We tested it against the two cached Gemma 4 models — short fixtures stayed as one chunk, medium split into two, long into four.

Model	Fixture	Chunks	Full p50	Chunked p50	Overhead
gemma-4-e2b-it-4bit	short_en	1	0.24 s	0.24 s	0%
gemma-4-e2b-it-4bit	medium_de	2	0.91 s	0.96 s	+6%
gemma-4-e2b-it-4bit	long_de	4	2.14 s	2.25 s	+5%
gemma-4-e4b-it-4bit	medium_de	2	1.53 s	1.85 s	+21%
gemma-4-e4b-it-4bit	long_en	4	3.00 s	3.94 s	+31%
gemma-4-e4b-it-4bit	long_de	4	3.58 s	6.00 s	+67%

Chunking was slower at every single text length, and the WER came out identical — no quality gain to offset the cost. The reason is structural: each chunk pays the full prompt overhead independently (system prompt, instruction header, KV-cache fill), and that overhead repeats once per chunk with nothing to amortise it against, because local MLX inference is single-threaded on the GPU. On E4B’s four-chunk long document the overhead compounds into a 67% penalty. E2B is hurt less only because its per-token cost is lower.

Could you run the chunks in parallel? Not usefully — multiple MLX processes contend for the same Metal GPU memory. We rejected chunking for local cleanup.

The Whisper trap: “turbo” isn’t an upgrade

The transcription side held its own surprise. Five Whisper candidates:

Model	p50 long DE	WER (German)	Decision
whisper-medium	2.43 s	~8%	Recommended default
whisper-large-v3-turbo	~2.0 s	67% on long DE	Demoted — English-only
whisper-large-v3	5.49 s	~7%	In catalog; best quality, too slow for most
distil-whisper-large-v3	fast	92–99%	Rejected — English-only
whisper-small-mlx	fast	moderate	Kept for speed-first / old Macs

The name “turbo” implies a strict upgrade over medium. For English, fine. For long German audio it posted a 67% word error rate — more than half the words wrong. distil-whisper-large-v3 was worse still on German (92–99%). Both are English-only models wearing names that don’t say so. We switched the recommended default from turbo to whisper-medium, which is both faster and more accurate on the multilingual workload Saydrop actually targets. Turbo stays in the catalog with an explicit “English only” label so it can’t quietly wreck a German dictation.

(One small footgun for fellow MLX users: the Hugging Face repo mlx-community/whisper-small 404s. The one you want is mlx-community/whisper-small-mlx.)

So which model should you actually run?

For almost everyone, the answer is: nothing — you’re already on it. Saydrop installs with Gemma 4 E2B for cleanup and whisper-medium for transcription selected by default, and that is precisely the pairing that won this benchmark on both speed and accuracy. There is no faster or more accurate combination to switch to — the recommended default is the optimal one. If you just installed Saydrop, you can stop reading here.

The go-to setup → Gemma 4 E2B + whisper-medium (~4.8 GB total). It ships pre-selected and on by default. Don’t change it unless you’re short on disk space or running an older Mac.

The only reason to open the model pickers is a hardware constraint, never a quality gain:

Your situation	Cleanup	Transcribe	Total download	Notes
Most users — the default	Gemma 4 E2B	whisper-medium	~4.8 GB	Ships pre-selected. Fastest and most accurate — nothing to change
Storage constrained	Qwen2.5-1.5B	whisper-medium	~2.3 GB	Only ~30 ms slower than E2B on short clips; 10% WER is genuinely good
Maximum accuracy	Gemma 4 E4B 8-bit	whisper-large-v3	~8.5 GB	For long, careful recordings where you’ll proofread anyway
Oldest Mac / minimum footprint	Qwen2.5-1.5B	whisper-small	~1.3 GB	Fastest load time; quality is fine for casual use
Open-weight only	Qwen3-4B	whisper-medium	~3.7 GB	Best non-Google option; 9% WER is close to E2B

Every alternative is a dropdown in Settings — switch any time, no reinstall, and inference stays fully on-device either way. But the short version stands: the default is the best, so most people should leave it exactly as it ships. For the engineering behind keeping these models hot in memory, see how we run MLX Whisper locally.

The takeaway

The lesson we keep relearning: build the measurement before you trust the intuition. We assumed small text-only models would be the speed champions and were ready to ship one. The benchmark said a multimodal model wins outright, that one “tiny” model fabricates entire sentences, that a clever chunking optimisation only ever loses, and that the model literally named “turbo” is a trap for half our users. None of that was visible from the spec sheets.

If you want dictation that’s been measured this way, download Saydrop here — first-launch onboarding handles the model download and warmup for you.