Japanese support
# support
b
I'd like to better understand what aspects of the Orchestration layer are expected to work with Japanese and how well: - Endpointing - "Based on both the user’s tone and what they’re saying, it decides how long to pause before hitting the LLM." does it understand Japanese for this purpose? - Interruptions (Barge-in) - does it work with Japanese words for interuptions - Background Voice Filtering - is it trained on any Japanese language voices? - Backchanneling - can it avoid interruptions when backhanneling in Japanese - emotion detection - is it based on language or tone of voice (would it work with Japanese) - Filler Injection - can it add Japanese fillter words
@User
s
@Ben - Endpointing: You can use the Deepgram endpointing model for this. - Interruptions: Yes, it handles interruptions. - Background Voice Filtering: It focuses on isolating the user's voice from background noise, prioritizing user speech audio signals. - Backchanneling: By default, it's in English, but based on your voice language, it will be voiced in Japanese. - Emotion Detection: Works on the text model, but is not very effective. - Filler Injection: This determines whether fillers are injected into the model output before inputting it into the voice provider.
Do let me know if you have more questions.
b
OK thank you @User ! So just to clarify: - Endpointing, Interuptions and Background noise filtering all good with Japanese - Backchanneling: if the user says in a Japanese equivelent of "uh-huh", e.g. そうだね it would NOT recognise that as it recognises only English, but the vice versa (the agent backchanneling the user) - could respond with そうだね (as it translates the English "uh-huh" to Japanese) - Emotion Detection: It does work on Japanese text in principle, but not trained for that (not very effective) - Filler Injection: Does it know Japanese filler words (umm... in Japanese would be e.g. あのう。。。) to be able to achieve this? Are you able to explain a little more about how it works as that would help me understand I think (i.e. is it's taking the text from the LLM and using another LLM to inject filler into output before it gets text to speech-ed? in which case it would depend on that post-processing LLMs support for Japanese I suppose) The other related question I had was about latency in Japan; I can't see any options with Vapi to specify the region - I guess Vapi using a mixture of regions based on the region of the individual APIs selected by users so it would be hard to say what region(s) it's currently based in? I ask as, for example, I think Azure's text to speech can in principle be hosted in Japanese region for lower latency, but with Vapi is it right we would not be able to use that?
s
Regarding regions, as of now, you need to use your own keys and specify the required regions for STT through the provider credentials section only. - Backchanneling: I meant that it is generated in English but can be spoken in the voice's language based on the input voice used. - Emotion Detection: It works on text regardless of the language. - Filler Injection: This determines whether fillers are injected into the model output before it is input into the voice provider.
5 Views