llmRequestDelaySeconds is an wful way to determine...
# support
y
Juggling between quality and latency for a conversational AI application. Initially set the variable at 100 ms but ended up receiving up to 6-7 requests of a specific interaction made by a user (where 6-7 phrases were present). Is there any other way to improve this ? I heard you're working on a proprietary endpoiniting technique and know that deepgram also gave multiple endpointing techniques more advanced that just countig time ? Wondered if Vapi had something along that ? Thanks !
v
Hi @Youssef_Gep, I'm an AI assistant for Vapi. I'll analyze available sources and get back to you shortly.
----------------------------- **While I generate an answer, here are some sources that might be helpful**: * [Glossary]() * [Introduction]() * [Orchestration Models]() * [List Assistants]() -----------------------------
### Answer Vapi does indeed use a proprietary endpointing technique that goes beyond simple time-based approaches like
llmRequestDelaySeconds
. Vapi's endpointing system uses a fusion of audio-text models to determine when a user has completed their turn, allowing for more nuanced interaction and reducing unnecessary interruptions or repeated requests. This method considers both the tone and content of the user's speech, improving the accuracy of detecting speech endpoints [(1)]() . Additionally, Vapi's system is designed to handle interruptions and turn-taking dynamically, which supports more fluid and natural conversations. This approach is part of Vapi's effort to simulate the flow of natural human conversation, addressing common challenges in voice AI applications [(2)]() . For developers looking to adjust the sensitivity and behavior of the endpointing mechanism, Vapi provides customizable settings, although the specifics of these settings would depend on the integration details and the desired responsiveness of the application [(1)]() .
Got it. Tagging @User for additional help.
s
Can you send me some call_id so that I can see the logs?
y
0bc8e795-fecc-4d0c-a2e8-0bd53a25e225
Let me know if you need something else @Sahil
s
There are three things leading to an increase in your latency: 1. You are using custom_llm, which adds more latency. 2. You are using your own Deepgram credentials. We use Deepgram on-premise, which leads to way better latency. 3. You are using your own OpenAI Key. When you use your own key, we don't utilize intelligent model picking to select the fastest server. You can verify this by simply removing all the credentials and using our models. Then, try adjusting the llmRequestDelaysSeconds parameter. You will not experience a 5-6 second delay.
y
Thanks Sahil, will try number 2. As of number 1 and 2 our openai llms are finetuned and we have a RAG pipeline. HAving said that I believe point 1 and 3 cannot be implemented right ? Additionally, the issue is not that much about latency as it is that llmrequestdelaysecond variable set at 0.1 as you have it in default results in lots of small parts of an interaction being sent progressively to our llm. Example --> Instead of sending Hello. How are you doing ? It sends 2 interactions to our model. First one is 1) Hello and then 2) Hello. How are you doing ?
s
I asked this question to nikhil and his exact response was we send requests when we think the person is done speaking it's a misprediction on our side if they continue speaking from their side, they should make a stateless system and just respond. we'll abort the request if the user continues speaking
y
So @Sahil we switched to Deepgram on your server. Increased the llmrequestdelayseconds to 500 ms . Still, we have multiple requests being sent to our llm for intermediary transcriptions. We are agnostic on our end we process what you send us. This is very weird. Surprised no one complained about this before.
Will increase it to 800 ms and see how it does, but honestly, a simple rule of, wait x amount of time after a :point: to send seems strengely simplistic
s
because most people are either using a fine-tuned model or directly accessing OpenAI models from the dashboard, that might be the case.
y
Allright @Sahil we implemented all 3 recommendations and still having the same issue. The question is pretty simple : What configuration of this variable leads to not sending 5, 6 or 7 times the same request. I'll give a real life example: 1) Okay. Okay. Uh, that's that's really helpful. Now from my understanding, is that If the additional 10 percent of high quality dates outweigh the cost of, uh, buying the chemical, the farmers would offer it. 2) Okay. Okay. Uh, that's that's really helpful. Now from my understanding, is that If the additional 10 percent of high quality dates outweigh the cost of, uh, buying the chemical, the farmers would offer it. And in that case, what I could do is I could calculate how much extra revenue on the 10 percent of good deeds generate for them. 3) Okay. Okay. Uh, that's that's really helpful. Now from my understanding, is that If the additional 10 percent of high quality dates outweigh the cost of, uh, buying the chemical, the farmers would offer it. And in that case, what I could do is I could calculate how much extra revenue on the 10 percent of good deeds generate for them. Um, so that is 5 kilogram times 10 minus 2 or, no, uh, rather 10 This is the increase and there will be a corresponding decrease in the low quality dates which would be 1 kilogram times 2, And this would be 50 minus 2 times https://cdn.discordapp.com/attachments/1242151915433758820/1243515257784107118/image.png?ex=6651c154&is=66506fd4&hm=0a89f5e37aed5940990d083a0e8809d048551113428d43bd1d52732118ae010f&
At this point we really just want to understand if we are screwing something up or this is a limitation of Vapi's service.
s
LLM Request Delay should be 0.4
and response delay is okay.
4 Views