Hey there! 👋
I wasn't able to find specific information about whether Vapi builds and deploys voice agents separately or runs them in the same runtime engine. However, I can definitely help with how Vapi builds and deploys voice AI agents in general!
According to the [documentation](
https://docs.vapi.ai/introduction), Vapi uses a modular architecture to build voice AI agents. At its core, Vapi functions as an orchestration layer over three primary modules:
1.
The transcriber (Speech-to-Text)
2.
The model (LLM)
3.
The voice (Text-to-Speech)
These three modules can be swapped out with any provider of your choosing, including OpenAI, Groq, Deepgram, ElevenLabs, PlayHT, etc. You can even [plug in your own server to act as the LLM](
https://docs.vapi.ai/quickstart).
For deployment, Vapi offers several options including:
- Calls
- Vapi SDKs
- Server URLs
- SIP Telephony
Regarding latency management (which you mentioned Vapi is great at), the [documentation](
https://docs.vapi.ai/quickstart) explains that Vapi "optimizes the latency, manages the scaling & streaming, and orchestrates the conversation flow to make it sound human." The system performs each phase in realtime (sensitive down to 50-100ms level), streaming between every layer, with the goal of keeping the whole voice-to-voice flow under 500-700ms.
Vapi also runs a suite of specialized models for features like endpointing, interruption detection, noise filtering, and emotion detection to create more natural conversations, as mentioned in the [How Vapi Works documentation](
https://docs.vapi.ai/how-vapi-works).
Is there anything specific about Vapi's voice agent architecture you'd like to know more about?
Sources:
- [Vapi Documentation Introduction](
https://docs.vapi.ai/introduction)
- [Vapi Documentation Quickstart](
https://docs.vapi.ai/quickstart)
- [Vapi Documentation How Vapi Works](
https://docs.vapi.ai/how-vapi-works)