Vapi transcript and model input issues
# support
m
Hey there, could I please get some feedback from the Vapi team (or community) about an issue we’ve started encountering. Please feel free to point me to another channel if appropriate. Im not sure if its been a recent change, or if we’ve only just noticed it because we’re playing around with a non-english assistant, but it looks like the platform is using the STT tooling as the point of truth for a call’s active transcript. This obviously makes sense for the user’s input, but it seems to be applied to the whole turn, and includes the assistant’s output as well. Converting from the LLM output, to voice, and then back to text is introducing compounding errors. This is creating a large number of headaches in our non English assistant, but I’ve also noticed it effects our English assistants as well.
An easy example of this issue is the degradation of the transcript / model input itself. LLM output: 明白了。请您详细描述一下脚痛的情况吗?例如,痛了多久了,痛在哪里? => Understood. Could you please describe your foot pain in detail? For example, how long has it been hurting and where is it? Transcript record after TTS -> STT: 明白了,请您详细描述一下交通的情况吗?例如,通了多久了?通在哪里? => Got it, could you please describe the traffic situation in detail? For example, how long has it been on the road? Where is it? The above was using 11labs and deep gram, but we’ve seen these issues on other combinations eg azure and talkscriber. Another more interesting example is the transcript being out of sync. In this scenario we have a relatively long opening message in simplified cheese. The user responds with a short phrase. Vapi then sends us a transcript / model input in the order: - System prompt - Short user message - Assistant intro message. This is never rectified and even the end of call summary is out of order. If the assistants turn is run through STT, I can only assume that this is causing a race condition where the user’s message is parsed before the assistant message which then corrupts the transcript. Can someone please confirm if Vapi is using STT on the voice model’s output for the call transcript? If so, is there some way we can disable this to make the LLM output the source of truth? If this isn’t the cause I would be keen to hear some people’s thoughts on what could be going wrong.
v
@Muppet Hello! I'm here to help with any issues you're encountering with VAPI. It's great that you're exploring the capabilities with a non-English assistant, and I understand how important accurate transcription is for maintaining the quality of the service. The behavior you're describing, where the platform uses speech-to-text (STT) for the entire conversation turn, including the assistant's output, can indeed introduce errors, especially when converting from text to speech and back to text. This is a known challenge in voice technology, particularly with non-English languages where the STT models may not be as robust. To address this, it's often recommended to handle the assistant's output separately from the user's input. The assistant's output can be managed by directly using the text generated from the language model, rather than relying on a transcription of the spoken output. This can help to avoid the compounding errors you mentioned. For the user's input, improving the accuracy of STT can involve fine-tuning the language models, providing more training data in the target language, or adjusting the acoustic models to better handle the specific accents and dialects. If you're looking for more detailed feedback or assistance, the VAPI community forums or the official support channels would be the best places to discuss these issues. There, you can share your experiences with other developers and get insights from the VAPI team on best practices and potential solutions. If you have any specific questions or need further assistance with your current setup, please let me know, and I'll do my best to assist you!
You can continue this conversation by mentioning me in the message.
m
Can you please tag a human to assist
v
Got it. Tagging @User for additional help.
s
here's the process: LLM -> generates text -> voice out -> transcribed
Now for the out of order calls can you share the call id?
because of no response closing this ticket.
17 Views