Issue with custom transcriber
# support
m
Hello, I’m trying to build a custom transcriber. I have a server that receives audio data from Vapi, sends that data to Google Speech to text API and returns the transcription in the following format: { "type": "transcriber-response", "transcription": “transcription here“, "channel": "customer" } I tested having a conversation with multiple turns. I spoke a few times and the assistant replied back. It works, but it has one issue. I checked the Call Log Details and on the Transcripts tab there’s a single entry that contains all my transcriptions merged in one sentence. Example: User: hello how are you what’s the weather in San Francisco If I use a transcriber available off-the-shelf (like Deepgram), Call Log Details show up correctly. Example: User: hello Assistant: hello User: how are you Assistant: I’m just a robot User: what’s the weather in San Francisco I believe whenever I speak the assistant gets the merged transcriptions ("hello how are you what’s the weather in San Francisco"), so the issue is not simply affecting Call Log Details, I think it's also affecting the conversation itself
Can anyone help here? @Vapi @User
v
@Miguel C Hello! It sounds like you're experiencing an issue where your custom transcriber is concatenating all transcriptions into a single entry, rather than separating them by turn. This could be due to how the transcriptions are being processed and sent back to VAPI. To resolve this, you'll need to ensure that each transcription event sent to VAPI is distinct and corresponds to a single turn of the conversation. This might involve modifying your server logic to: 1. Detect pauses or breaks in the audio stream to determine when one turn ends and another begins. 2. Send a separate transcription event after each turn, rather than accumulating them and sending them all at once. If you're already doing this and still facing the issue, it could be related to how the events are being handled or formatted when received by VAPI. You may need to review the integration code to ensure it's compatible with VAPI's expected event format. If you need further assistance with the specifics of your implementation or if you believe there might be an issue with VAPI itself, please provide more details or reach out to the support team for more in-depth troubleshooting.
m
I'm sending a separate transcription event after each turn. Here's the 3 transcriptions that I sent. {"type": "transcriber-response", "transcription": "hello", "channel": "customer"} {"type": "transcriber-response", "transcription": "how are you", "channel": "customer"} {"type": "transcriber-response", "transcription": "what's the weather in San Francisco", "channel": "customer"} @User
s
@Miguel C you need to send alternatives transcriptions your sending all transcriptions as customer role only. FYI: You might need to enable interim transcription.
Do let me know your thoughts on it, and if you want to breakdown then share the call id.
m
- What do you mean by sending alternatives transcriptions? - Do I also need to send the "assistant transcriptions"? - If I send interim transcriptions, how does Vapi know when a transcription is final? Thanks in advance @Shubham Bajaj
s
@Miguel C you have to decide final transcripts to Vapi and differentiate b/w user and assistant transcript and send with right role assigned to Vapi.
If you can share the call ID, I can pinpoint exactly what you're doing wrong.
m
understood! I was able to make the custom transcriber work by doing that. Now, nikhil told me about the modelOutputInMessagesEnabled flag which I would like to try out, but I believe it's not working
call_id: e50d7ff7-4058-4e36-b5e1-d1695e8dc03f
s
@Miguel C Looking into it allow me sometime.
m
thanks 🙂
s
@Miguel C what's currently happening is your passing only the customer transcript you need to send both transcripts of each's turn as per the conversation flow,
modelOutputInMessagesEnabled
is used for setting what to be used in conversation messages LLM Output or TTS Transcription. Check the screenshot for the reference. Let me know your comments on this!! https://cdn.discordapp.com/attachments/1314214616938840197/1316464635582611556/Screenshot_2024-12-11_at_11.20.09_PM.png?ex=675b24bb&is=6759d33b&hm=f20c5542ba6786af633176a4fcae697a1d7c3c00e9d307dcb6feb92431c92108&
m
I thought that by enabling
modelOutputInMessagesEnabled
the LLM output would be used to append the assistant messages directly to the conversation system, and I would not need to transcribe the assistant audio. Even if I enable
modelOutputInMessagesEnabled
I still need to transcribe the assistant audio and send it to Vapi?
s
modelOutputInMessagesEnabled
adds model output to the message history, but we only support it for 11labs right now since it needs the TTS to provide exact timing.
m
understood, thanks for helping out! another question 😅 I'm using Google Speech to text API. This API sends me speech activity events such as an event that indicates when the user has stopped talking. Can I send this event to Vapi? As far as I know, Vapi's documentation only mentions that I can send an event with the transcription. But I would like to know if I can send an event to indicate that the user has stopped talking. The goal is to reduce latency.
s
When you send transcription back to Vapi using custom-sst you control how fast you want to send back transcription to Vapi. Do let me know if anything else is required.
r
@Shubham Bajaj is it on the roadmap to add support for
modelOutputInMessagesEnabled
for other TTS providers like Deepgram?
k
It is already out. You can now enable it.
r
i see - i'm using Vapi TTS and it doesn't seem to work. Transcribed messages are still being added to the transcript / messages. Here is the call id: 3c73fe72-754d-4792-828e-aff641ba3ad9 would be great if you can help
k
Hey Raj,
modelOutputInMessagesEnabled: true
only work for 11labs & assistant. So if the user is using squad or any other voice this is ignored.
5 Views