Q&A: Behind the scenes of Cohere’s new AI transcription model

Enterprise AI company Cohere recently released an open-source version of Cohere Transcribe, an AI model that can generate text from audio in real time—meant to be useful for meetings, note-taking, and even capturing audio over the whir of a blender, according to co-founder Nick Frosst.

The company’s first foray into speech-to-text transcription currently sits atop open-source AI platform Hugging Face’s leaderboard for speech recognition models, which ranks them based on latency (speed), accuracy, and multilingual performance.

BetaKit spoke with Cohere senior product manager Cassie Cao, who helms the multimodal division, about building the model from scratch.

The following interview has been edited for length and clarity.

Why did Cohere decide to build a speech-to-text model for large companies?

Organizations process massive volumes of unstructured audio data day to day. So, we wanted to build this foundation for enterprise speech intelligence and really solve customer needs. We built this model from scratch with the real production use case and needs in mind to serve the enterprise real world challenges in terms of speech transcription.

Does this mean the future of work is people speaking to their computers, and then the computer completing their tasks?

Voice is definitely a big modality for enterprise workflows. It’s a very popular interface. So we wanted to meet what customers want.

There are a few things they’re looking for. One is real-world accuracy. We’re really focused on optimizing the accuracy and minimizing word error rate. We’re trying to build a world that is robust across real world conditions, such as multi-speaker environments and diverse accents.

Of course, speed in real production is an important factor. So we are delivering this best-in-class throughput with high RTFx (real-time factor expressed)—a metric to measure how many seconds of audio a system can process per second of computing time.

How does Cohere’s model compare to a meeting-notes transcription platform like Granola?

Meeting transcription and note taking is definitely one of the target use cases for Cohere Transcribe.

The platform you just mentioned, they’re more model-agnostic and they don’t build their own models from scratch. I think there is a world that the meeting platform company and us could potentially collaborate in that sense.

What kind of data did you use to train the model? How is it different from training a text-based LLM?

I probably cannot comment too much on the data training part. It is a 2B parameter conformer based encoder decoder architecture. [Editor’s note: If you want to dive more into what those words mean, Cohere has more information on how the model works here.]

When we train it, we have this deliberate focus in mind to minimize the word error rate while keeping production readiness top of mind and then incorporating the world that can be noisy. All these priorities and strategies are reflected when we decide the model architecture, data mix, and eval selections as well, so that we can deliver things to target optimized accuracy, optimized speed, under these real-world, messy situations.

What’s the future of this model? Are you already working on another edition?

We are working on enriching more features. We are also very excited to bring this model into North—Cohere’s flagship enterprise workplace AI agent platform—with the integration there too. It’s something we’re actively working towards, even though we’re not putting exact days around that.

[Editor’s note: After BetaKit originally spoke with Cohere, the company confirmed it has brought that North platform to Innovation, Science and Economic Development (ISED) Canada, the company’s “first major expansion of secure AI infrastructure into civilian government operations.”]

Feature image courtesy CoWomen via Unsplash.