Patrick Lehmann

Local vs cloud dictation on your Mac

Local speech-to-text on Mac keeps audio private. Compare accuracy, cost, latency, and privacy tradeoffs between on-device and cloud dictation.

Local dictation on your Mac avoids subscriptions and keeps your voice data private. Cloud dictation trades privacy and cost for guaranteed latest-model access and cross-platform support. Both now match on accuracy, so the choice hinges on your privacy expectations and willingness to run a local model.

This post walks through how each approach actually works, the real costs over time, and whether on-device transcription has finally caught up with cloud.

How cloud dictation works: what your Mac sends up

Cloud dictation routes your audio to a remote server for transcription, typically via HTTPS. When you dictate, the audio bytes are compressed and sent to the provider’s API — services like Whisper API, Google Speech-to-Text, or Microsoft Speech Services — then returned as text.

The key question: what happens to your audio after transcription?

Most cloud providers retain audio for 30–90 days for quality assurance and model improvement. OpenAI, Google, and Microsoft all state in their terms that they may use aggregated audio to improve models unless you specifically request deletion or disable telemetry. Retention policies vary, so check your provider’s privacy policy if you dictate sensitive content (medical notes, legal briefs, client conversations).

Network latency adds 200–500 ms on most connections. Your Mac must have active internet; cellular or congested Wi-Fi slows transcription. The provider controls which model version you use — you get updates automatically, but you cannot opt out or downgrade if a new version introduces issues for your accent or domain.

Cost is subscription-based: OpenAI charges $0.02–0.06 per minute depending on the endpoint; Google charges by 15-second increments. For frequent users — say 30 minutes of dictation per day — annual cloud cost runs $150–$250. That cost is recurring and never drops.

How on-device dictation works: the model stays local

On-device (or “local”) dictation downloads an open-source model like Whisper to your Mac and runs transcription entirely on your hardware. Audio is captured, processed in RAM, and output as text — nothing leaves your machine.

The transcription engine is stateless: each time you dictate, the model loads, transcribes, and releases. On Apple Silicon, models stay resident in memory between transcriptions if the app stays open, so the second dictation is fast.

There is no network dependency. No internet required (though you may want it for optional cleanup or model updates). You are not subject to API rate limits or provider downtime. If a model version works well for your voice, you can pin it indefinitely — updates are optional.

Setup takes longer upfront: the model is 1.5 GB for Whisper large-v3 (830 MB for the quantized 4-bit version). On first launch, download time is 5–10 minutes on typical broadband. Subsequent launches are instant because the model is cached locally.

Privacy: what actually leaves your Mac

With cloud dictation, your audio bytes leave your device. Most services anonymize the associated metadata (IP address, timestamp, app name), but audio itself is transmitted. Depending on the provider, transcripts and audio may be logged for audit trails or regulatory compliance.

With local dictation, audio stays on your Mac entirely. No transmission, no logging, no retention. The model runs in-process. Your personal dictionary (custom words for your field) never syncs to a cloud store — it lives on your disk, read on each dictation.

The tradeoff: if your Mac is compromised (malware, stolen), local transcription offers no protection — the attacker has the same file-system access you do. But for the common threat model (curious vendors, mass surveillance, accidental data breaches), local transcription is meaningfully more private.

If you use cloud dictation but need privacy, use a disposable email, disable data improvement opt-ins, and request deletion of audio after transcription. Not all providers offer this; check before committing.

Cost over time: one-time vs subscription

Cloud dictation is a recurring expense. At typical usage (20 minutes per day), expect $150–$250 annually. Over five years, you have paid $750–$1,250 to the provider, and the expense does not end when you stop using the service.

Local dictation is paid upfront and then free. A one-time license costs $39–$80 depending on the app. Model updates are free (Whisper is open-source and released periodically; you download them as you choose). Storage cost is negligible: the model is 1.5 GB, and the app itself is under 200 MB.

Scenario: a writer who dictates 10 hours per month:

  • Cloud: $30/month × 12 = $360/year. Five-year total: $1,800.
  • Local: $49 one-time. Five-year total: $49.

The break-even point is one month. If you dictate more than a few times per week, local is cheaper within the first month.

Subscription fatigue is real. (West Monroe “State of Subscription Services” reports that 52% of users canceled at least one subscription in the past year; 41% report subscription fatigue.) Local dictation eliminates that friction: you own the software, no renewal reminders, no forced upgrades.

Accuracy and latency: has local caught up?

For years, cloud dictation was noticeably more accurate because cloud providers could run massive models with expensive GPU hardware. Local dictation used smaller, compromised models to fit on consumer hardware.

That gap has closed.

Whisper large-v3 (the open-source model most local apps use) achieves 7.44% word error rate (WER) across real-world test conditions (arxiv.org/html/2510.06961v1, Oct 2025). For comparison, human baseline is 4–6.8% on the same test sets. Cloud services (Whisper API, Google Speech-to-Text) report similar numbers: 2.7% WER on clean laboratory audio, but 8–12% on noisy real-world speech.

The practical difference is small for typical dictation (emails, notes, chat messages). Accuracy degrades with heavy background noise or non-native accents, where cloud might edge ahead by 2–3 percentage points. If you dictate in quiet environments and are a native English speaker, local and cloud are functionally equivalent.

Latency is where local wins. On Apple Silicon, local transcription is sub-second: your audio is converted to text before you release the hotkey. Cloud transcription adds network round-trip time: 200–500 ms to send audio, 500 ms–2 s for the cloud service to process, 200–500 ms to receive the result. Total: 1–3 seconds of waiting.

Latency matters psychologically. Instant text feels like you are typing; waiting for a network response feels like you are dictating to a machine.

When cloud dictation still makes sense

Local transcription is not a universal win. Cloud dictation is the right choice when:

Cross-platform is required. You need iOS, Android, or Windows support. Local models exist for these platforms, but they are typically smaller and less accurate. If you switch devices during the day, cloud is simpler (one account, no per-device setup).

Disk space is constrained. Machines with < 8 GB unified memory or < 5 GB free storage should use cloud. The 4-bit quantized model brings local down to 830 MB, but cloud avoids the download entirely.

You need the latest model immediately. Cloud services update their models on their schedule, and you get new versions automatically. Local requires you to manually download a new model when a release comes out. If your domain benefits from the latest architectural improvements (and you are tracking model releases), cloud saves that overhead.

Offline resilience is not a priority. If you are always online and do not dictate in airplanes, remote offices, or transit, network dependency is not a concern.

Always-on deployment. If you are running dictation on a server or in an environment where a persistent Python process is not practical, cloud is the only option.

For the majority of Mac users — those who dictate at their desk, want privacy, and have 5 GB free disk space — local is the better default.

Choosing a local dictation app

If you have decided local is right for you, what should you look for?

Model quality. Verify the app uses Whisper large-v3 or newer. Older models (small, tiny) are significantly less accurate. Some apps use proprietary models — check independent benchmarks on your language and accent before committing.

Default behavior. The app should transcribe locally by default, not cloud. (Some apps offer a “hybrid” mode where they fall back to cloud on network error — reasonable, but make sure you can disable cloud entirely if you prefer.)

Cleanup (optional). Transcription is raw: Whisper outputs exactly what it heard, including “um”, “uh”, filler words, and grammatical awkwardness. Cleanup (grammar polish, punctuation, sentence breaks) is optional but useful. Check whether the app offers local cleanup (via a model like Gemma) or only cloud cleanup. Local cleanup keeps your text private; cloud cleanup sends your draft transcript to a remote service.

Hotkey and injection. The app should offer a global hotkey (push-to-talk or always-listening), then inject the result directly into the focused app (via Cmd+V or API). Avoid apps that require you to copy-paste the result manually.

Customization. Can you pin a model version? Add a personal dictionary (custom terms in your field)? Disable silence rejection if you work in loud environments? Look for these knobs if you have specific needs.

Cost structure. One-time purchase beats subscription. If the app charges per transcription or per month, calculate whether that is cheaper than cloud over your expected usage window.

No single app wins on all dimensions. Prioritize privacy, accuracy on your accent, and hotkey responsiveness — then trade off cost and customization based on your needs.

FAQ

Does local dictation work as well as cloud dictation?

Yes. Open Whisper large-v3 achieves 7.44% word error rate across real-world test sets (arxiv.org/html/2510.06961v1, Oct 2025), competitive with cloud services. In controlled conditions, accuracy is nearly identical. The difference matters most for heavy background noise or specialized accents, where cloud services may still edge ahead.

Will local dictation drain my Mac’s battery?

No more than cloud. On-device inference does use the GPU, but Apple Silicon’s unified memory architecture keeps power consumption in the 15–25% CPU, 60–80% GPU range during transcription. Cloud dictation uses network constantly, which drains batteries on laptops faster than GPU inference. Local is typically equal or better on battery life.

How much privacy do I actually get with local dictation?

Complete privacy on transcription. Your audio never leaves your Mac. Cloud services typically save audio for a retention period and may use it for model training unless you opt out—check your provider’s terms. Local transcription means no audio transmission, no logging, no retention.

Can I switch from cloud dictation to local?

Yes. If you have been using the OS or cloud apps, switching to local takes minutes: install an app, download the model on first launch, and set it as default. No data loss, and you keep your personal dictionary and hotkey preferences.

When does cloud dictation still make sense?

Cloud is better when you need cross-platform support (iOS, Android, Windows), zero local disk space, or guaranteed access to the latest model. Cloud also works offline-light: you do not need to manage model downloads or updates.

What local dictation app should I choose?

Look for one that uses Whisper large-v3 or newer, defaults to local transcription (not cloud), and offers optional cleanup without forcing a subscription. Cross-check accuracy on your accent and domain (medical, legal, tech), and verify it runs on your Mac’s chip.


If you are ready to try local dictation, Saydrop offers a 14-day free trial with local transcription via mlx-whisper large-v3 and default-on local cleanup (via Gemma). After the trial, a one-time CHF 39 license gives you lifetime use on macOS 14+. Learn more about pricing.