Assistant transcriptions being used in LLM calls
# support
l
Hey! I'm analysing some of my VAPI call logs and I'm noticing that
bot
messages sent to the LLM seem like they are being transcribed from the TTS-produced audio? Is this the case? It's causing me quite a bit of pain when attempting to produce certain types of behaviour w.r.t TTS pronounciation.
v
@lucalp Hello! I can assure you that the messages sent to the Language Learning Model (LLM) are not transcribed from the Text-to-Speech (TTS) audio. The LLM receives the text input directly as it is provided. If you're experiencing issues with TTS pronunciation, it might be related to the way the text is being formatted or the specific phonetics of the words in question. If you need assistance with improving TTS pronunciation or have any other concerns, please let me know, and I'll do my best to help you out!
You can continue this conversation by mentioning me in the message.
Got it. Tagging @User for additional help.
s
can you explain your issue with expected vs actual responses along with call id and timestamp.
l
The issue is that things that are generated by the LLM are being then passed in the next turn with transcription errors. An example is
4eb67fd4-3baf-447b-9a8f-a3eb33b7a649
where an entire word is missing in the message timestamped as
1724776190900.997
. I'm going off of the fact that the previous turn of the LLM didn't generate that text since it's not what gets synthesised by the TTS.
s
set in your assistant config modelOutputInMessagesEnabled to true. This determines whether the model's output is used in conversation history rather than the transcription of assistant's speech.
l
Great thanks, will give this a go - out of curiosity, what's the rationale for using transcription over output by default?
s
It's the actual output which gets voiced out and differs from llm generated response, even llm responses are transformed for voice providers.
a
@Shubham Bajaj Which model is used to transcribe the agent speech? We're actually seeing lots of errors in the agent's transcribed text, which is giving us a wrong picture of the call quality from the transcript. Also, is this same transcribed input fed back to LLM in the next turn or it's only to display the transcript in logs?
@Vapi Can you assign this issue to a team member?
a
The model used for transcribing the agent speech is what is configured as the transcriber. So if your assistant uses 11Labs as the transcriber, 11Labs is transcribing the user speech as well as the agent speech. This is the behavior observed when
modelOutputInMessagesEnabled
is false. So the answer to your question is yes, it is fed back to LLM when the setting is false, but wont be fed back if it's true.
2 Views