Wiring iOS 26's SpeechAnalyzer to a Live Microphone: What the Docs Don't Tell You
TL;DR. iOS 26 replaces SFSpeechRecognizer with SpeechAnalyzer + composable modules. This is a practitioner's walkthrough of wiring SpeechTranscriber to a live AVAudioEngine microphone, with copy-pasteable SwiftUI code and the things the documentation doesn't warn you about: you must convert the audio buffer, the language model downloads on first use (so the first offline run fails), volatile vs. final results need different handling, there is no custom vocabulary, and SpeechAnalyzer isn't on watchOS. The full sample is on GitHub (simplememofast/ios26-speechanalyzer-live-mic, MIT) and it builds and runs on a shipping iOS 26 device.
Who this is for: developers moving from SFSpeechRecognizer, or anyone who tried to follow the WWDC sample and ended up with code that compiles but produces no text.
Why SpeechAnalyzer replaces SFSpeechRecognizer
The old SFSpeechRecognizer bundled session management, language handling, and recognition into one object. iOS 26's SpeechAnalyzer splits that apart: the analyzer is an orchestrator you attach modules to, and you compose only the modules you need. The result is optimized for longer, conversational audio, runs fully on-device for supported locales, and — unlike the old API — does not require the user to first enable dictation or Siri voice input in Settings.
The mental model: Analyzer + Modules
SpeechAnalyzer coordinates the pipeline. For speech-to-text you attach a SpeechTranscriber (a SpeechDetector module for voice-activity detection also exists, but most apps need only the transcriber). Audio goes in as AnalyzerInput; results come out of an AsyncSequence. There is also a DictationTranscriber for short, keyboard-style dictation. The flow:
mic ─► AVAudioEngine.installTap ─► AVAudioConverter ─► AnalyzerInput
│
SpeechAnalyzer([ SpeechTranscriber ])
│
for try await result in transcriber.results
result.text (AttributedString) / result.isFinal
Minimal working implementation
Permissions first. You need both microphone and speech-recognition authorization, plus NSMicrophoneUsageDescription and NSSpeechRecognitionUsageDescription in Info.plist. Then build the transcriber and analyzer:
// Normalize the requested locale to one with an on-device model.
guard let locale = await SpeechTranscriber.supportedLocale(equivalentTo: .current) else {
throw Failure.localeNotSupported
}
// .volatileResults streams partial text WHILE the user is still speaking.
let transcriber = SpeechTranscriber(
locale: locale,
transcriptionOptions: [],
reportingOptions: [.volatileResults],
attributeOptions: [] // add .audioTimeRange for per-word timing
)
let analyzer = SpeechAnalyzer(modules: [transcriber])
let analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
Subscribe to results, then feed an input stream and start the analyzer:
// Results loop: volatile -> replace the tail; final -> append + clear the tail.
resultsTask = Task {
for try await result in transcriber.results {
let piece = String(result.text.characters) // result.text is AttributedString
if result.isFinal {
finalizedText += piece
volatileText = ""
} else {
volatileText = piece
}
}
}
let (sequence, builder) = AsyncStream<AnalyzerInput>.makeStream()
self.inputBuilder = builder
try await analyzer.start(inputSequence: sequence)
Now the microphone tap. This is where most people lose an afternoon:
let converter = AudioBufferConverter() // capture locals; never touch self in the tap
let input = audioEngine.inputNode
let micFormat = input.outputFormat(forBus: 0)
input.installTap(onBus: 0, bufferSize: 4096, format: micFormat) { buffer, _ in
// Convert the mic buffer to the format SpeechAnalyzer asked for, THEN yield.
guard let converted = try? converter.convert(buffer, to: analyzerFormat) else { return }
builder.yield(AnalyzerInput(buffer: converted))
}
audioEngine.prepare()
try audioEngine.start()
The GitHub repo has the complete, compiling versions of SpeechSession, the AudioBufferConverter, and a SwiftUI view that renders the two-tone live caption.
Getting the model on-device (the download trap)
Transcription is on-device, but the language model is a system-shared asset that may not be installed yet. Check SpeechTranscriber.installedLocales; if your locale is missing, request and install it. The shared asset does not count against your app's bundle size — but a first run with no network can't download it, so handle that state explicitly instead of failing silently.
let installed = await Set(SpeechTranscriber.installedLocales.map { $0.identifier(.bcp47) })
if !installed.contains(locale.identifier(.bcp47)) {
if let request = try await AssetInventory.assetInstallationRequest(supporting: [transcriber]) {
try await request.downloadAndInstall() // request has a .progress you can show
}
}
Volatile vs. finalized results (the UX hinge)
With reportingOptions: [.volatileResults] you get fast partial results while the user is still speaking; result.isFinal marks committed text. The idiomatic UI shows volatile text dimmed and replaces it when a final arrives, persisting only finals. Because result.text is an AttributedString, adding attributeOptions: [.audioTimeRange] gives you per-word time ranges for highlighting or seeking back into the audio.
The gotchas (the part the docs skip)
1. You must convert the audio buffer
AVAudioEngine's input node format (often 48 kHz, hardware-dependent) usually does not match SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith:). Feed a mismatched buffer and you get a clean compile and zero transcription. Run every buffer through AVAudioConverter first. This is the single most common reason "it builds but nothing happens."
2. Latency: what I measured on shipping iOS 26
The most-cited SpeechAnalyzer latency figure comes from a WWDC25-era developer-forum report: ~14s+ to the first result on an iPhone 16 Pro running iOS 26.0 beta with Xcode beta 5, even after AssetInventory.allocate(locale:) and preheating. If that were representative, the API would be unusable for live capture.
On shipping iOS 26.5 I don't see that. On an iPhone 16e — the binned, non-Pro A18 (6-core CPU, 4-core GPU), i.e. the least powerful A18 device — time to the first volatile result is ~0.3–0.5s, with the model already installed and the locale allocated (a warm start). First-ever launch is different: it includes a one-time model-asset download, so budget for that path separately and show a progress UI.
This is a first-party measurement, not a controlled head-to-head: different device, shipping OS vs beta, timing time-to-first-volatile-result. But the direction is clear, and there's a reason a non-Pro device keeps up — on-device transcription runs primarily on the Neural Engine, which is the same 16-core unit across the whole A18 family (the 16e only gives up one GPU core, which speech doesn't lean on). The likely takeaway: the beta-era latency was a preheat/config/beta issue, not a hardware limit. Measure it on your own device and publish the device + OS + metric alongside the number.
3. There is no Custom Vocabulary
SFSpeechRecognizer had contextualStrings to bias recognition toward known terms. SpeechAnalyzer, as of iOS 26.0, exposes no equivalent. If your domain is full of proper nouns or jargon, budget for that gap now.
4. watchOS: SpeechAnalyzer isn't there — but voice input still is
SpeechAnalyzer ships on iOS, iPadOS, macOS, visionOS and tvOS 26 — not watchOS. That does not mean "no voice on the Watch." You fall back to the watchOS system dictation UI, which hands you back finished text. You lose the SpeechAnalyzer pipeline (volatile results, time ranges, your own audio tap), but voice capture works:
// watchOS: no SpeechAnalyzer — the system handles dictation and returns text.
TextFieldLink(prompt: Text("Speak or type")) {
Image(systemName: "mic.fill")
} onSubmit: { text in
send(text)
}
This is the split we actually ship: SpeechAnalyzer on iPhone, TextFieldLink dictation on the Watch. Most write-ups get this wrong in one of two directions — claiming "no voice on watchOS" or implying SpeechAnalyzer runs everywhere. Neither is true.
5. Gate on availability, and on Swift 6 concurrency
SpeechAnalyzer / SpeechTranscriber are iOS 26+. If your app deploys lower, mark the speech types @available(iOS 26.0, *) and gate the entry point with if #available(iOS 26.0, *). Separately: the mic tap closure runs on a real-time audio thread, so under Swift 6 complete strict concurrency, capture only locals (the continuation, the target format, a fresh converter) and never touch a @MainActor object inside the tap. Done that way it compiles without @unchecked Sendable escape hatches.
Migration checklist from SFSpeechRecognizer
| SFSpeechRecognizer (old) | SpeechAnalyzer (iOS 26) |
|---|---|
| One object does session + recognition | SpeechAnalyzer + composable modules |
append(_:) audio buffers | AnalyzerInput(buffer:) yielded into an AsyncStream |
partialResults flag | reportingOptions: [.volatileResults] |
| delegate / result handler | for try await result in transcriber.results |
bestTranscription.formattedString | String(result.text.characters) (text is AttributedString) |
contextualStrings (custom vocab) | no equivalent |
| user enables dictation in Settings | not required |
| works on watchOS | not on watchOS (use system dictation) |
Where this ships
This exact pipeline (minus the email/send parts) powers the voice input in Simple Memo's on-device voice capture — a fast iPhone memo app that emails notes to yourself and auto-appends them to Obsidian, with no per-use API cost. The repo strips it down to the speech essentials so you can drop it into your own project.
FAQ
Is SpeechAnalyzer on-device? Does it work offline?
Transcription runs fully on-device. The language model downloads on first use, so the first run needs a network connection; after that it works offline.
Does it work on Apple Watch?
No — SpeechAnalyzer is iOS / iPadOS / macOS / visionOS / tvOS 26, not watchOS. Use the system dictation UI (TextFieldLink) on the Watch.
SpeechAnalyzer vs SFSpeechRecognizer — which should I use?
SpeechAnalyzer for new iOS 26+ work. Keep SFSpeechRecognizer only for older OS support or if you need custom vocabulary.
Why is my first transcription slow?
Stale beta builds were slow (the 14s+ reports were on iOS 26.0 beta). On shipping iOS 26.5, an iPhone 16e (the non-Pro A18) reaches the first volatile result in ~0.3–0.5s on a warm start (model installed, locale allocated). Pre-install the model and start the analyzer early; budget separately for the first-ever launch, which downloads the model once.
Can I add custom vocabulary?
Not as of iOS 26.0 — there is no contextualStrings equivalent.
Sample code: github.com/simplememofast/ios26-speechanalyzer-live-mic (MIT). Corrections welcome via PR. Related: offline-first memo architecture.