Problems with voice transcription and unexpected l...
# support
j
Call-id: f1e8ae95-322f-4d67-b1ae-6a08d5301121 call_id: a76bcbc5-54b1-4038-be35-e18abbdb5c2d Call_id: 51061192-4aab-415a-9bd9-2ed4f09095f5 1. In multiple calls, the assistant correctly pronounces the user's phone number, but when transcribing it (voice to text), it is captured incorrectly, which generates errors in the execution of functions. 2. When repeating or confirming phone numbers, the assistant's voice changes to what appears to be Portuguese, even though the entire workflow is set to Spanish. I appreciate your help in resolving these issues. I look forward to your response or recommendations for improving system performance.
v
Hi there! đź‘‹ I understand you're experiencing some frustrating issues with your voice AI system. While I don't have access to the specific call recordings from the IDs you shared, I can definitely help with both the phone number transcription issues and the unexpected language changes you're experiencing. ## Improving Voice-to-Text Accuracy for Phone Numbers For the phone number transcription issues, here are some adjustments you can make: 1. **Adjust Transcription Endpointing Settings**: Increase the
onNumberSeconds
parameter to give the system more time to process number sequences accurately:
Copy code
json
{
  startSpeakingPlan: {
    transcriptionEndpointingPlan: {
      onNumberSeconds: 1.0  // Increase from default 0.5s
    }
  }
}
[Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1342583376707850250) 2. **Adjust
numberToDigitsCutoff
Setting**: This determines when numbers are transcribed as words or digits. Setting it higher than expected numbers ensures they're converted to words:
Copy code
json
{
  numberToDigitsCutoff: 100000  // For numbers like "12345" to be spoken as words
}
[Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1342349738032500736) 3. **Consider DTMF as Alternative**: For critical number capture, enabling keypad input (DTMF) can be more accurate than voice transcription [Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1341108728115757138). ## Preventing Unexpected Language Changes For the language switching issue (Spanish to Portuguese): 1. **Explicitly Set the Language**: Use the
language
property in your voice configuration:
Copy code
json
{
  "voice": {
    "provider": "azure",
    "voiceId": "es-ES-ElviraNeural",
    "language": "es-ES"
  }
}
[Source](https://docs.vapi.ai/changelog/2024/11/24)
2. **Ensure Voice Model Language Consistency**: Make sure the voice model's language matches your desired language (Spanish) [Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1270786139355090995). 3. **Configure Voice Fallback Plan**: If using fallback voices, ensure they also use Spanish to prevent unexpected language switches [Source](https://docs.vapi.ai/changelog/2024/11/24). 4. **Disable Smart Endpointing**: Set
smartEndpointingEnabled
to
false
as it only works for English transcriptions [Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1334385998423130174). 5. **Configure chunkPlan for Spanish**: Ensure
chunkPlan.enabled
is set to
true
to improve sentence boundary detection for Spanish responses [Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1343543475383046185). ## General Performance Improvements 1. **Adjust Voice Recognition Settings**: Fine-tune wait time and sensitivity settings to improve understanding [Source](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1333067714729607228). 2. **Optimize Prompt Engineering**: Structure prompts effectively for voice interactions and add clear error handling [Source](https://docs.vapi.ai/assistants). 3. **Test and Iterate**: Use Voice Testing to automate evaluation of your AI voice agents [Source](https://docs.vapi.ai/test/voice-testing). Would you like me to elaborate on any specific aspect of these recommendations or help with implementing any particular solution? Sources: - [Discord Thread](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1342583376707850250) - [Discord Thread](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1342349738032500736) - [Discord Thread](https://discord.com/channels/1211482211119796234/1211483291191083018/threads/1341108728115757138)
j
Thanks, Why does Neha from Vapi's voice change language when saying numbers?
@User Hi
k
Check multilingual settings, make sure that the transcriber and corresponding language settings are correctly configured to match the desired language for both text input and speech output and choosing a voice that aligns with the intended language can help maintain consistency in pronunciation
s
To help track down this issue, could you share for each call ID: - When exactly this happened (the timestamp) - What response you expected to get - What response you actually got instead This would really help us figure out what went wrong!
I know you have already shared the most of information but sharing the following will help me investigate it without any confusion/assumptions.
j
CALL-ID: f1e8ae95-322f-4d67-b1ae-6a08d5301121 4:15 - 4:27 -> The user enters their cell phone number and the transcription is fine. 4:28 - 4:39 -> The Assistant says the correct cell phone number but in the transcription it omits a number (4) *Expected response - 304(4)126258: Tres cero cuatro CUATRO uno dos seis dos, cinco ocho. *Real response - 304126258: Tres cero cuatro. Uno dos seis dos cinco ocho. 4:43 - 5:05 -> Since the number was not transcribed correctly, the function returns an error because the required 10 digits are missing (even though the number was correctly confirmed to the user). Re-enter the response and correct the error.
CALL_ID: a76bcbc5-54b1-4038-be35-e18abbdb5c2d 3:19 - 3:30 -> The user says the document number and the transcription does not capture two numbers. *User response - 986(6)(9)538 : Cédula de ciudadanía, nueve ocho seis SEIS NUEVE cinco tres ocho *User Transcription - 986538: Cédula de ciudadanía, nueve ocho seis cinco tres ocho 3:30 a 3:42 -> The assistant says the document number and the transcription does not capture a number. *AI response - (9)86538: Tipo de documento, cédula de ciudadanía, número de documento. NUEVE Ocho seis cinco tres ocho. Es correcto. *AI Transcription - 86538: Tipo de documento, cédula de ciudadanía, número de documento. Ocho seis cinco tres ocho. Es correcto. 4:03 a 4:14 -> The assistant says the user's number correctly but the transcription does not capture a number. *AI response - 9866953(8): Tipo di documento, cédula de ciudadanía, número de documento, nueve ocho seis seis nueve cinco tres OCHO, es correcto? *AI Transcription - 9866953: Tipo di documento, cédula de ciudadanía, número de documento, nueve ocho seis seis nueve cinco tres correcto? 4:52 a 5:05 -> The user says the phone number but the transcriber does not capture a number. *User response - 30441262(5)8: Tres cero cuatro cuatro uno dos seis dos CINCO ocho. *User Transcription - 304412628: Tres cero cuatro cuatro uno dos seis dos ocho. 5:41 a 5:51 -> The assistant says the phone number but the transcriber doesn't pick up a number. *AI response - 304412625(8): Número de celular. Tres cero cuatro cuatro uno dos seis dos cinco OCHO¿Es correcto? *AI Transcription - 304412625: Número de celular. Tres cero cuatro cuatro uno dos seis dos cinco ¿Es correcto? 5:55 -> In this call, although the transcriber did not capture the complete numbers, he did send the correct number in the appointment creation function.
CALL-ID: 51061192-4aab-415a-9bd9-2ed4f09095f5 3:31 a 3:45 -> The assistant correctly repeats the user's document number, but the transcriber does not capture three numbers in different positions. *AI response - 986(6)(9)5(3)7: Claro, tipo de documento. Cédula de ciudadanía, número de documento, nueve ocho, seis SEIS NUEVE cinco. TRES Siete. ¿Es correcto? *AI Transcription - 98657: Claro, tipo de documento. Cédula de ciudadanía, número de documento, nueve ocho, seis cinco. Siete. ¿Es correcto?
NOTES: 1- Add the numbers in digits to make it easier to understand the error. 2- You can notice the voice changing to a Portuguese accent at most of the specified moments. I opted to change the voice to an Azure one, which wouldn't present these issues. 3- After posting this ticket, I tested the two best versions of Deepgram for Latin American Spanish: Nova 2 (es-419) and Nova 3 (multi). They continued to present issues, and I preferred to leave the Nova 3 (multi) model with a Confidence Threshold set to 0.23; it still works with difficulty. 4- I appreciate any help that allows me to lower the failure rate when capturing numbers, as these are critical processes within my workflow. The DTMF keyboard option is not something we want to implement. Thanks in advance for the help. I'm excited about what we are creating and will create with VAPI.
s
Looking into it
j
Checking in to see if you've had a chance to review the issue. I’d really appreciate any updates or input
@Vapi @Shubham Bajaj
s
@juliointheworld Hey, sorry I was out of office so couldn't reply to you earlier. I'll take a look at all of your call IDs now
@juliointheworld Your callers do say every digit, and the assistant does pronounce every digit, but the STT (speech‑to‑text) engine often collapses two identical digits that come back‑to‑back (“44”→“4”, “66”→“6”, etc.). Because your downstream function expects exactly 10 digits, any missing digit produces validation errors and a painful correction loop
What you can do here:- • SSML Parsing for 11Labs: Enable SSML parsing for 11Labs and the assistant, then pass digits to the 11Labs voice provider with break time or pause time between digits. • Number Chunking: Instruct the assistant to break two-digit groups into individual digits during output generation. • * Automatic Self-Critique:* Implement an automatic self-critique that compares the tool call response. If an error is detected, the assistant should apologize, re-spell the digits slowly, and confirm with the user, and then retrigger the tool.
3 Views