Issues with Transcript Generation While Using Cust...
# support
a
Hi VAPI Support Team, I’m reaching out regarding issues we’re facing while using the VAPI custom-llm feature to enable voice as a channel for our existing text-based chatbot (which includes its own orchestration layer for handling text queries). While we are successfully generating and streaming output chunks from our backend (via the chat/completions API hosted on our server), we’ve observed that the transcripts on the VAPI side are often incomplete. Although the full model output appears correctly in the VAPI call logs, the corresponding transcripts tend to get cut off mid-sentence, resulting in poor voice quality and frequent drops. More specifically, we’ve encountered two recurring issues: -TTS Not Triggering or Voice Input Missing In some calls, the TTS does not get triggered and the voice input query itself does not seem to be generated. Example call: a9cfa29f-52b4-472a-a0c0-6be82f07b9a4 -Incomplete Assistant Response in Transcripts In other cases, while the full voice input is correctly logged, the assistant's response in the transcript appears truncated. Example call: 440410f2-0acc-4b63-90a4-7acf29d75787 From the logs, we can see the voice output after TTS: VOICE INPUT: Of course, I’d be happy to help you update your order! To get started, VOICE INPUT: would you like to use the same phone number you’re calling from to look up your order? But the assistant transcript shows: "Of course, I'd be happy to help you. Update your order. Get started, would you like to use the same phone number..." We (LevelAI) were seriously considering VAPI as a platform to power our voice integration, but the current behavior is significantly impacting user experience. We’d appreciate your help in investigating these issues and letting us know if there’s a way to address them. Looking forward to your support. Thanks, Arshdeep Singh LevelAI
v
Thank you for sharing these detailed observations about the issues you're experiencing with transcript generation and TTS triggering while using the custom-llm feature for voice integration. It's clear that having incomplete transcripts and inconsistent voice input is significantly affecting your user experience. Your examples and thorough description of the problem are very helpful for understanding the challenges you're facing.
Thank you for sharing these detailed observations about the issues you're experiencing with transcript generation and TTS triggering while using the custom-llm feature for voice integration. It's clear that having incomplete transcripts and inconsistent voice input is significantly affecting your user experience. Your examples and thorough description of the problem are very helpful for understanding the challenges you're facing.
a
@User
k
This may be due to STT/streaming delays or sync issues; introducing slight delays before TTS, optimizing response streaming, and reducing backend latency may help. If still experiencing issues let us know thank you..
a
Certain backend latency(2-3 sec) is expected from our side but thats shouldn't be reason for tts to not work, isnt'it. Can you please confirm this- as this is make or break for our usecase actually... Also once the voice query is generated- whats the reason for it getting truncated when converted to transcript as mentioned in the 2nd example I shared.. Also these are consistent Issues we are getting in our testing and not one off or rare scenarios... If you could please share RCA around why this might be happening(basis our examples) and is there anything(settings etc) we can change to workaround this problem...
k
A backend latency of 2–3 seconds can disrupt the real-time TTS and transcription, causing delays or truncation, so minimizing response time or adjusting timeouts and network conditions is key to maintaining a smooth experience.
a
So just to clarify—does this mean that longer-duration chat/completions calls aren’t well-supported? For example, if we're streaming chunks continuously via SSE every few seconds, and the total response generation takes around 6–7 seconds, that would cause consistent issues ?
k
Yes, longer-duration chat/completions calls can cause issues with real-time TTS and transcription due to backend latency