Open AI Real time with ASR Vapi AI #support

Open AI Real time with ASR

Brendan

11/20/2024, 8:48 PM

Hey, I was trying out the integration with Open AI realtime. I was seeing that there is still costs and latentcy from the STT (in my case Deepgram). I thought that OpenAI model was native to speech and wouldn't need to go through that integration?

Vapi

11/20/2024, 8:51 PM

@Brendan Hello! I'm glad you're exploring the integration with OpenAI's real-time capabilities. You're correct that OpenAI has models that can process speech, but these models typically work with text input. When you're using voice input, it needs to be converted to text before OpenAI's models can work with it. That's where speech-to-text (STT) services like Deepgram come in. The latency and costs you're experiencing are due to the STT process, which is a necessary step to convert your voice input into text. This conversion allows OpenAI's models to understand and process the content of your speech. While OpenAI's models are powerful, they are not natively designed to process raw audio without this text conversion step. If you have any further questions or need assistance optimizing your integration to reduce costs and latency, please let me know!

Vapi

11/20/2024, 8:51 PM

You can continue this conversation by mentioning me in the message.

Brendan

11/20/2024, 9:04 PM

Not sure I understand this part "OpenAI has models that can process speech, but these models typically work with text input"

Brendan

11/20/2024, 9:58 PM

@Vapi

Vapi

11/20/2024, 9:58 PM

@Brendan I apologize for any confusion. Let me clarify: OpenAI has developed models that can understand and generate human language. However, these models are designed to work with text data, not raw audio. Therefore, when you want to interact with OpenAI's models using voice, your speech needs to be converted into text first. This is where a speech-to-text (STT) service like Deepgram comes into play. The STT service listens to your voice input, converts it into text, and then that text is fed into OpenAI's models for processing. The models can then generate a response based on the text input, which can subsequently be converted back into speech using a text-to-speech (TTS) service if needed. So, in summary, OpenAI's models do not natively process speech; they require the audio to be converted to text first. This is why you still need to use an STT service, which can introduce some latency and incur costs. If you have any more questions or need further assistance, feel free to ask!

bryant

11/21/2024, 2:06 AM

Hey @Brendan , we still feed audio into the model, but we use a separate STT to handle interruptions, etc. This is mainly due to OpenAI's transcription being slower at the moment

Shubham Bajaj

11/22/2024, 8:41 AM

@Brendan checking if Bryant questions answers your doubts.

3 Views

Previous Next