OK thank you @User !
So just to clarify:
- Endpointing, Interuptions and Background noise filtering all good with Japanese
- Backchanneling: if the user says in a Japanese equivelent of "uh-huh", e.g. そうだね it would NOT recognise that as it recognises only English, but the vice versa (the agent backchanneling the user) - could respond with そうだね (as it translates the English "uh-huh" to Japanese)
- Emotion Detection: It does work on Japanese text in principle, but not trained for that (not very effective)
- Filler Injection: Does it know Japanese filler words (umm... in Japanese would be e.g. あのう。。。) to be able to achieve this? Are you able to explain a little more about how it works as that would help me understand I think (i.e. is it's taking the text from the LLM and using another LLM to inject filler into output before it gets text to speech-ed? in which case it would depend on that post-processing LLMs support for Japanese I suppose)
The other related question I had was about latency in Japan; I can't see any options with Vapi to specify the region - I guess Vapi using a mixture of regions based on the region of the individual APIs selected by users so it would be hard to say what region(s) it's currently based in?
I ask as, for example, I think Azure's text to speech can in principle be hosted in Japanese region for lower latency, but with Vapi is it right we would not be able to use that?