what model for voice to text
What model does VoiceFlow use in the background for voice transcription? I have a scenario where a caller has to provide their zip code, but the transcription seems to really struggle with this, so was wondering of how I could maybe try out different models specifically for the voice recognition.
1 Reply
Hi! Voiceflow doesn’t lock you into a single speech-to-text engine – you can pick the model that does best for your use case.
Current real-time STT options (Voice Input settings > Provider) are:
• Cartesia “Ink-Whisper” – an OpenAI Whisper–based model tuned for live calls and many languages.
• AssemblyAI “Universal Streaming”.
• Deepgram “Nova” (Nova-2 / Nova-3 and Nova-3 Medical).
New voice projects start on Cartesia Ink-Whisper, but you can switch at any time and even fine-tune parameters like confidence threshold, endpointing, punctuation timeout, etc.
If the engine is mis-hearing numbers such as ZIP codes, try:
1. Switching providers – Deepgram Nova-3 is often strong on numeric strings; AssemblyAI Universal adds turn-detection knobs you can loosen for clearer chunks.
2. Adding the expected 5-digit pattern as a keyword (Voice Input ▶ Keywords) to bias the model toward digit output.
3. Prompting callers to speak the digits separately (“Please say each digit one at a time”).
4. Increasing “On-no-punctuation” / “Minimum turn silence” so the model waits for the full 5-digit utterance before finalising.
You’ll find the provider dropdown and all these tuning knobs in your project’s Voice Input panel (Settings → Voice Input). Try a quick A/B test between models to see which gives the cleanest numeric transcription for your callers.
Docs: https://docs.voiceflow.com → Voice Input (Speech-to-Text settings)
Hope that helps, and happy testing!
Voiceflow
Welcome to Voiceflow's docs
Build, manage, and deliver chat and voice agents for customer support and beyond.