rudisp
rudisp3mo ago

what model for voice to text

What model does VoiceFlow use in the background for voice transcription? I have a scenario where a caller has to provide their zip code, but the transcription seems to really struggle with this, so was wondering of how I could maybe try out different models specifically for the voice recognition.
1 Reply
Tico | Voiceflow Assistant
Hi! Voiceflow doesn’t lock you into a single speech-to-text engine – you can pick the model that does best for your use case. Current real-time STT options (Voice Input settings > Provider) are: • Cartesia “Ink-Whisper” – an OpenAI Whisper–based model tuned for live calls and many languages. • AssemblyAI “Universal Streaming”. • Deepgram “Nova” (Nova-2 / Nova-3 and Nova-3 Medical). New voice projects start on Cartesia Ink-Whisper, but you can switch at any time and even fine-tune parameters like confidence threshold, endpointing, punctuation timeout, etc. If the engine is mis-hearing numbers such as ZIP codes, try: 1. Switching providers – Deepgram Nova-3 is often strong on numeric strings; AssemblyAI Universal adds turn-detection knobs you can loosen for clearer chunks. 2. Adding the expected 5-digit pattern as a keyword (Voice Input ▶ Keywords) to bias the model toward digit output. 3. Prompting callers to speak the digits separately (“Please say each digit one at a time”). 4. Increasing “On-no-punctuation” / “Minimum turn silence” so the model waits for the full 5-digit utterance before finalising. You’ll find the provider dropdown and all these tuning knobs in your project’s Voice Input panel (Settings → Voice Input). Try a quick A/B test between models to see which gives the cleanest numeric transcription for your callers. Docs: https://docs.voiceflow.com → Voice Input (Speech-to-Text settings) Hope that helps, and happy testing!
Voiceflow
Welcome to Voiceflow's docs
Build, manage, and deliver chat and voice agents for customer support and beyond.

Did you find this page helpful?