Vapi is the Voice AI platform for developers.

Vapi AI

The estimated latency for GPT-3.5-turbo indicates 250ms; even though it's an estimation, this latency is extremely low compared to the actual latency, which is around 600ms. How was this 250ms estimation calculated?
https://cdn.discordapp.com/attachments/1256644177257566351/1256644177371074652/image.png?ex=66818497&is=66803317&hm=02abcd27360db8fee2afc9689209094725639fc002dfa72f73691b3fbf98195e&

<@399190958765244416>
The estimated latency for a model like GPT-3.5-turbo can be calculated based on a variety of factors, including the average response time over a large number of API calls under controlled conditions. This estimation might not account for real-world variables such as network latency, server load, or the complexity of the input prompt, which can all contribute to higher actual latencies.

The 250ms estimation would typically be derived from ideal conditions, which serve as a baseline. However, in practice, users may experience different latencies due to the aforementioned factors. If you're consistently experiencing higher latencies, it might be worth looking into optimizing your network connection or checking if there's any additional processing in your system that could be contributing to the delay.

If you need more detailed information on how the latency is calculated or if you're experiencing issues that you believe are related to the API itself, please let me know, and I can assist you further or direct you to support for more in-depth troubleshooting.

You can continue this conversation by mentioning me in the message.

Got it. Tagging <@&1231325148079194222> for additional help.

<@319034765808369666> So probably with max_tokens set to 1, which does not correspond to a real-life scenario since the complete response from ChatGPT needs to be sent to the TTS.

It also depends how far the server is how much the load is on the server

Obviously <@319034765808369666> , but as previously mentioned, the average latency is 600ms for a short sentence (max_tokens at 30) from a server located in the us-west-1 region. Now, the latency of the first chunk is indeed around 250ms, but that is not really indicative since the complete response must be sent to the TTS.
I just wanted to make sure I am not missing something in this reasoning.

Yeah, you are correct about it. The latency we calculate from the first chunk and it is correct because we send the data to speak in chunk basis not the whole response of llm.

Thank you for the clarification <@319034765808369666> , it is very clear now.