@Brendan
I apologize for any confusion. Let me clarify: OpenAI has developed models that can understand and generate human language. However, these models are designed to work with text data, not raw audio. Therefore, when you want to interact with OpenAI's models using voice, your speech needs to be converted into text first. This is where a speech-to-text (STT) service like Deepgram comes into play.
The STT service listens to your voice input, converts it into text, and then that text is fed into OpenAI's models for processing. The models can then generate a response based on the text input, which can subsequently be converted back into speech using a text-to-speech (TTS) service if needed.
So, in summary, OpenAI's models do not natively process speech; they require the audio to be converted to text first. This is why you still need to use an STT service, which can introduce some latency and incur costs. If you have any more questions or need further assistance, feel free to ask!