Support To Other STT Models
# support
j
I need to obtain cadastral information through conversations, but the nova-2 model consistently misspells the information. I tested my audio samples (which are not in English) with OpenAI Whisper and achieved much better results. I'm looking for a way to integrate OpenAI Whisper into VAPI or a method to retrieve the raw audio, classify it for target important information, and then pass it to Whisper using function calling.
v
@Jean Reinhold It's great to hear that you've been experimenting with different models to find the best fit for your transcription needs. While VAPI currently supports Deepgram's nova family models and Talkscriber's whisper model, it sounds like you're interested in using OpenAI's Whisper model, which is not directly supported by VAPI. However, you can retrieve the raw audio from your conversations and then use an external process to transcribe it with OpenAI's Whisper. Here's a general approach you could take: 1. **Capture Raw Audio**: Ensure that your system captures and stores the raw audio from the conversations. 2. **Classify Audio**: Use an audio classification tool or model to identify the important parts of the audio that contain cadastral information. 3. **Transcribe with Whisper**: Once you have the classified audio segments, you can use an external script or service to transcribe them with OpenAI's Whisper model. 4. **Integrate Transcriptions**: After obtaining the transcriptions, integrate the text back into your system for further processing or storage. Unfortunately, VAPI does not have a built-in function to pass audio to OpenAI's Whisper or to retrieve raw audio for external processing. You would need to implement these steps outside of VAPI. If you need assistance with integrating these steps into your workflow or if you have any other questions, please feel free to ask for more support.
You can continue this conversation by mentioning me in the message.
j
@Vapi is there a way in the api to grab a piece of the last n seconds of the audio of the user input in the call with the sdk?
v
@Jean Reinhold VAPI does not directly provide a function to extract the last n seconds of audio from a user input within the API. However, you can implement this functionality by capturing the audio stream during the call and then processing it to extract the desired segment. Here's a general approach you could take: 1. **Capture Audio Stream**: Use the SDK to capture the audio stream during the call. This might involve setting up an audio recording mechanism that continuously records the conversation. 2. **Buffer Management**: Implement a buffer that keeps only the last n seconds of audio. This can be done by continuously removing older audio data as new data comes in, ensuring that the buffer always contains the most recent n seconds. 3. **Audio Retrieval**: When you need to grab a piece of the last n seconds, you can retrieve it from this buffer. Please note that the specifics of how to implement these steps depend on the programming language and the SDK you are using. If you need more detailed guidance on how to set up the audio capture and buffering, please consult the documentation for the SDK or ask for further support.
j
@Vapi And what would trigger the audio retrieval? Can I retrieve the buffer after a function call? I want to use the python sdk
s
It is already there
use talkscriber -> Whisper model
j
@Sahil, that is talkscribers model, I would like tho have openai whisper
s
Ah, I see. Can you add this request to https://roadmap.vapi.ai
j
@Sahil Thanks! Is there a way to get the raw audio in a function call or perhaps to store it in a buffer and retrieve in the function call? This way, I could process it directly with any model.
s
Nope.
There isn't.
3 Views