Differences between Vapi calls vs calling OpenAI A...
# support
b
Hi, I'm working on some voice call solutions in my company being implemented with Vapi and had some questions around some differences in behavior I'm seeing. My main focus right now is setting up a testing harness to reproduce textual interactions based on the given Vapi transcript so that we can more triage issues and iterate on prompt and tool definitions more quickly. Currently, I can pull in messages from the Vapi transcript and run them through the OpenAI API. The issue is that I'm seeing very different behavior between the Vapi agent and what OpenAI is returning directly. The agent consistently seems to do the wrong thing when interacting through Vapi over the phone, but consistently does the correct thing when testing directly with OpenAI. I'm wondering if anyone could help explain the differences in behavior we are seeing. Also, do you have any recommendations for setting up these kinds of testing harnesses to maximize the chance of parity?
For reference, I'm testing with GPT 4.1 and a temperature of 0.
k
This often stem from transcription inaccuracies, prompt delivery methods, model configuration mismatches, and audio timing issues. To improve consistency, standardize model settings, use Vapi Test Suite, review transcripts for errors, and structure prompts carefully.
b
If i'm using the messages directly from the vapi call log, shouldn't that be consistent?
The model settings are also the same between vapi and the test harness.
k
It's should be consistent
b
i've seen in the general chat that 4.1 may not actually be working right now and falling back to some other model. that may be my actual issue. is there any way to tell if that's happening?
k
Currently the dev are working on resolving the issue
b
i appreciate the responses. i'll keep an eye out for an update in the chat. is there currently a way to see when vapi uses a fallback model?
2 Views