Vapi is the Voice AI platform for developers.

Vapi AI

Is the transcript returned direct from llm or re-transcription of the tts output? 
Sometimes, the tts output (cartesia) speaks the punctuation (says the word dot) and the transcript also has this word. if it was directly coming from the llm, it should not have this token, right?

<@939930793755676753>
The transcript should ideally be a direct output from the LLM before it is processed by the TTS system. If the TTS output is incorrectly speaking punctuation, such as saying the word "dot" instead of pausing for a period, this indicates an issue with the TTS interpretation of the text. The transcript should not include these spoken punctuation errors if it is taken directly from the LLM output.

If you are experiencing this issue where the transcript is including spoken punctuation, it might be a problem with the TTS service or the way the text is being processed before TTS conversion. It would be best to check the settings or contact the TTS provider to resolve this discrepancy. If you need further assistance with this issue, please let me know, and I can help guide you through troubleshooting or provide more information.

You can continue this conversation by mentioning me in the message.

Got it. Tagging <@&1231325148079194222> for additional help.

at what point on the workflow the transcript is returned to client? 
are punctuation words an artifact of chunkPlan?

by default it is from tts but you can change it to llm output .

enable this `"modelOutputInMessagesEnabled": true` to use model output in transcription.

oh! why is it false by default? isn't llm output more accurate for conversation history and discount tts mistakes?

because it is preferrable to act on transcribed output.

Umm the tts output maybe correct and it's what the user hears. 
But if the tts transcription is wrong, that can distract the future turns even though the user heard the correct speech, right?

the transcription won't be wrong normally and most of the time.

but unclear why tts transcription preferable to model output

a) it's what being voiced out
b) model output is transformed(with xyz) before voicing out 

so, because of these we prefer tts transcription. For majority of uses cases tts transcription is worth but in your case might not. Feel free to change.