Here's a recommendation, when generating static audio files for playback, use 16Khz sampling with 64kbps encoding. This complies with  the g722 codec, and will ensure that the Jambonz transcoder that VAPI uses won't try to do some funky stuff. 
Telephony in general is somewhat retarded, as the standard is 8Khz and 64kbps mono. The files you created are most probably 44.1Khz on 128kbps and stereo.