Custom LLM streaming responses not getting voiced ...
# support
i
I'm running one of the NodeJS samples that have been posted in other support threads for how to do streaming, and I'm struggling to get it to actually trigger TTS correctly. I've confirmed (via curl) that the responses are streaming back when expected, and in fact, in the log you can also see this via the timestamp, but it's inconsistently getting voiced. Sometimes one of the three will happen, but usually at least two or three of the responses won't get voiced until the end of the call - what am I doing wrong? Call ID: a2098cfd-3f76-435c-b94a-ae8aea6a52e8
Copy code
03:30:53:917
[CHECKPOINT]
Model request started

03:30:53:918
[LOG]
Model request started (gpt-4o, custom-llm)

03:31:04:034
[CHECKPOINT]
Model sent start token

03:31:04:035
[LOG]
Model output: Let me think.

03:31:04:035
[CHECKPOINT]
Model sent first output token

03:31:04:037
[LOG]
Voice input: Let me think.

03:31:09:033
[LOG]
Model output: still thinking.

03:31:09:034
[LOG]
Voice input: still thinking.

03:31:13:830
[LOG]
Model output: still thinking.

03:31:13:832
[LOG]
Voice input: still thinking.

03:31:13:834
[CHECKPOINT]
Model sent end token

03:31:14:322
[CHECKPOINT]
11labs: audio received

03:31:14:361
[CHECKPOINT]
Assistant speech started

03:31:14:361
[INFO]
Turn Latency: 20446ms (Endpointing 3ms, Model 10118ms, Voice: 10288ms)

03:31:17:269
[CHECKPOINT]
Assistant speech ended
v
message has been deleted
s
I think you are using a chat completion model example. For that part, you need to send the complete message at once. Checkout this repository https://github.com/VapiAI/server-side-example-python-flask/blob/main/app/api/custom_llm.py
i
@Sahil I don't see anything relevant in that example? The streaming in that example is coming entirely from OpenAI. I'm talking about utilizing a streaming response that is hand-crafted, like another example you've posted before: https://dump.sahilsuman.me/streaming-custom-llm-vapi.txt In fact, the logs and call id is literally this example, where there are three separate chunks being sent 5 seconds apart. It should be reasonable to expect those to be voiced upon being streamed, and not all clustered at the end of the custom-llm call, right?
i'm continuing to try various things to get this to work - here's an example where there seems to be enough text to voice (including a punctuation delimiter), but nothing is sent to be voiced - why? https://cdn.discordapp.com/attachments/1251247053598621727/1251403003953680404/image.png?ex=666e735e&is=666d21de&hm=8a1dbd29b4146bc6017bc4368b234e1543e9b10997f8e5260eccbb0a7b139404&
Some learnings so far: - Streaming doesn't seem to be working AT ALL for
11labs
. However, it DOES seem to be working for
playht
. - I'm bumping into what I think is input buffering somewhere downstream from my custom-llm, and I've worked around it by emitting an SSE "comment" when I want to "flush" the output through, and it seems to be working! I'm now seeing the messages appearing in the VAPI log at the correct timestamps. - It seems like the default configuration won't actually take advantage of this streaming capability though, unless you ALSO configure Punctuation Boundaries. Once I set this to
period
, I seem to be making some progress now finally. - It seems to still not be voicing text input that is too short though, even though I have min chars set to 3 and there is a punctuation mark at the end. Going to continue to experiment with how I can force it to voice these shorter "filler" words, because I know there won't be any more text streamed for a bit.
are there any other special "directives" i can use to influence when TTS happens?
Copy code
res.write('data: [DONE]\n\n');
s
Give me time till monday. I will build some example and share with you!
i
That would be amazing, thanks!
My last update for now: I was able to get a TTS happening "mid-stream" by using
playht
and ensuring that: - the sentence is long enough - the sentence ends in a punctuation mark (which is also explicitly listed in the voice config) - the punctuation mark has a trailing space I'm trying to get TTS happening at three places over 10 seconds - the beginning, the middle and the end. So far I'm only able to get it to happen at the beginning and the end - I'm not sure how else to influence VAPI to engage the TTS service, hopefully this upcoming sample will provide that answer! I actually don't want to use
playht
, I want to use
11labs
- if possible, can you get the sample working with
11labs
?
s
I will need some more context about how exactly you are doing things. Let's have a meeting and understand your process, and then I will be able to help you out in a better way.
@its.jcw Here you go
i
Thanks for the call today @Sahil - let me know what you hear back! Here's the call id's again if needed: 11labs 1794217e-aa8e-4cac-8939-ddbebf880b37 playht f5d92aea-ae82-4d4f-b0a2-12b977a2dc61
s
Will discuss about it tonight.
I talked to Nikhil about this, and he told me that different voice providers handle data in different ways. It's not just about the word characters; punctuation and other factors also play a role. He told me that in order to fix this part for 11labs, just add "" in the string message. For example: "Hello, How are you? "
i
omg that... works! wow, thanks so much - we'll explore this further
6 Views