Best Speech-To-Text Open Source Models for Startups
Whisper vs. Voxtral overview
Speech-to-text and text-to-speech AI models are a hot topic. Companies providing AI models through APIs for both use cases are growing quickly. Still, there are surprisingly few open source models for speech-to-text despite obvious demand and use cases.
There are hot startups like Granola, WisperFlow, and SpeakApp, for example, building both consumer and b2b tools to handle cases including meeting recordings, voice notes, lectures/study-related cases, manage your computer with voice, and multiple others.
But until recently, there were very few open source speech-to-text (STT) models that delivered high enough quality to use in production and build consumer-facing products. Whisper by OpenAI was the main go-to option. The first version was released in 2022 (before even the first version of ChatGPT), and the latest update was in 2023. So not much has changed in STT in recent years despite the immense financing of AI models and startups.
It’s really good news that there is a new open source STT model made by one of the big AI research companies, and the only European one. Mistral AI has released its own STT model named Voxtral. The model is available both via API and as an open source.
Mistral claims Voxtral to beat Whisper in some use cases and be on par with Elevenlabs proprietary Scribe model. Here are the benchmarks from their website.
The price for the model through API is exactly 2x lower than Whisper through API from OpenAI. 0.3 cents per minute vs 0.6 cents for whisper. So if Voxtral’s performance is indeed comparable or evene superior to Whisper, it’s a no-brainer for companies looking to build something around STT to start using Voxtral.
If you are building something in this space, please add your comments on the models, and what you prefer for what use cases. I am always happy to learn more as I myself build in this space as well.


