stream Parameter Not Accepted in Custom LLM Setup
# support
m
I'm trying to use Cerebras with a custom LLM (llama-3.3-70b) and would like the full response to be returned in a single chunk (i.e. not streamed). I attempted to set "stream": false in the model config like this: "model": { "provider": "custom-llm", "model": "llama-3.3-70b", "url": "{url}/chat/completions", "temperature": 0.3, "maxTokens": 250, "stream": false, "messages": [ { "role": "system", "content": "{systemPrompt}" } ] } However, I'm getting an error saying that the stream parameter is not accepted. Could someone please clarify how I can disable streaming and send the entire response in one go?
v
Hi there! 👋 I understand you're trying to disable streaming with your Cerebras custom LLM setup and receive the full response in one go. While I don't have specific information about the exact error you're encountering with Cerebras, I can share what I know about Vapi's custom LLM streaming expectations. According to the [documentation](https://docs.vapi.ai/customization/custom-llm/fine-tuned-openai-models), Vapi is designed to work with streaming completions: "For best results, your custom LLM endpoint should support streaming completions." Vapi specifically expects responses to be streamed in chunks following the OpenAI streaming format. Here are some alternative approaches you might consider: 1. Check the Custom LLM Server Guide - A [Notion guide](https://vapi.notion.site/Custom-LLM-Server-Guide-14d008c8393f807f86dbf61dcb0767ee?pvs=4) explains the format VAPI expects for custom LLM responses. As one user mentioned, "As long as I return responses following that format, VAPI will be able to process them." 2. Try using webhooks - Instead of using the VAPI module directly, you might consider [using a webhook](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1334889631333875793) as an alternative approach. 3. Use the
triggerResponseEnabled
flag
- This flag [controls whether a system message triggers an LLM response](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1338730919904673853), which might help you manage the response behavior. 4. Ensure correct response format - Make sure your custom LLM is returning responses in the expected format. Vapi expects each chunk to be in a specific delta format like:
Copy code
{
  choices: [
    {
      delta: {
        content?: string,
        tool_calls?: Array<...>,
      },
      finish_reason?: string
    }
  ]
}
Would you like me to provide more details on any of these approaches or help troubleshoot further? Sources:
m
@Shubham Bajaj can you please check this
d
@Kings_big💫 can you help me?
On this one
k
Hey, you need to return the response with streaming set to true. As of now, we only accept stream responses. In case you want to send a single chunk, then you have to set it true on VAPI and handle it on your server URL from there. You can send a single stream of chunks with content and then the done signal. But you cannot set it to false on our side.
Let me know if you require further.
m
@Shubham Bajaj When I set stream: true and send the complete response in a single chunk, VAPI doesn’t speak the entire sentence it stops midway.
k
To help track down this issue, could you share: - The call ID This would really help us figure out what went wrong!
2 Views