UPDATE 29th April : Today my YouTube app is behaving slightly differently using different algorithms for cardboard and non-cardboard modes (mono vs stereo video) but with both using W, X, Y & Z channels. I’ll retest soon and update this post accordingly when I get a chance as yesterday the audio was exactly the same in both modes with the only difference being Android (Ambisonic) vs non-Android (W/Y Stereo) modes detailed below.
So, I’ve been experimenting with YouTube’s Ambisonic to Binaural VR videos. They work, sound spacious and head tracking also functions (although there seems to be some lag, compared to the video – at least on my Sony Z3), but I thought I’d have a dig around and test how they’re implementing it to see what compromises they’ve made for mobile devices (as the localisation could be sharper…)
Cut to the chase – YouTube are using short, anechoic Head Related Transfer Functions that also assume that the head is symmetrical. Doing this means you can boil down the Ambisonics to Binaural algorithm to just four short Finite Impulse Response Filters that need convolving in real-time with the B-Format channels (W, X, Y & Z in Furse Malham/SoundField notation – I know YouTube uses ambiX, but I’m sticking with this for now!). These optimisations are likely needed to make the algorithm work on more mobile phones.
So, how do I know this? I put a test signal (log sine wave sweep) on each of the B-Format channels and then I recorded back the stereo signals allowing me to measure a left and right response for each of the four channels individually. I carried this out when the phone was both facing front and at +90 degree to check the rotation algorithm was working. Below are the Head Related Impulse Responses (HRIRs) I got back (click for higher res) – these will also contain any filtering etc. from my phone and computer, but have turned out pretty well considering I had to hold the phone still in the correct position!
The fact that the left and right HRIRs are identical (or polarity inverted) show that they’ve used the symmetrical head assumption and the X and Y channels swapping between facing 0 and 90 degrees shows the rotation being carried out on the Ambisonic channel signals. Once you’ve got these HRIRs, generating the Left and Right headphones signals in the phone is (where the multiply with a circle indicates convolution). However, this is likely to be carried out in the frequency domain where it’s more efficient.
Below is also the frequency response of the four Ambisonic HRTFs where you can see YouTube cutting off the response at around 16.4kHz (again, click for higher resolution).
Once the W, X, Y & Z filters were obtained, although I sent log sine sweeps every 12 seconds (for a 10 second sweep), due to the fact I was recording in the analogue domain (no clock sync between phone and computer) some clock differences have caused the filters to be slightly mis-aligned. This can be most easily seen if a source is simulated at 90 degrees with respect to the head of the listener, as this should exhibit the greatest amplitude difference between the ears once the Ambisonics to Binaural algorithm is implemented. This was achieved when the an extra 2 samples (2/48000 of a second) delay was added to each 12 second ‘jump’. The frequency plots for a front facing head and a source at +90 degrees, and a +90 degrees facing head and a source at 0 degrees are shown below (so in the first plot, the left ear should be loudest, and in the 2nd plot the right ear should be loudest). I’ve also trimmed the HRIRs to 512 samples and windowed the responses using a hanning window. It is likely that the actual clock difference isn’t a multiple of 1 sample, so this method isn’t quite ideal!
These plots should be identical, but measurement inconsistencies can be noted between the far ear responses to the source (as these are lower level, they’re more prone to error), and by inspection it seems like the head at +90 degrees is a slightly better capture at this point. Also, remember that the response above 16.4kHz is not worth worring about as YouTube filters out frequencies above this value.
More to follow…
Ambisonics To Stereo
An aside on stereo. I’ve not put any plots up yet, but it seems like the Ambisonics to Stereo algorithm used (on non-android YouTube) is simply:
Google should really look into using UHJ for their Ambisonics to Stereo conversion…for an example of the difference, listen to the audio on these two videos. The first one is Ambisonics to UHJ, the second will be YouTube’s Ambisonics->Stereo algorithm detailed above (to carry out this test DO NOT use the Android YouTube app!)
Wiggins, B. Paterson-Stephens, I., Schillebeeckx, P. (2001) The analysis of multi-channel sound reproduction algorithms using HRTF data. 19th International AES Surround Sound Convention, Germany, p. 111-123.
Wiggins, B. (2004) An Investigation into the Real-time Manipulation and Control of Three-dimensional Sound Fields. PhD thesis, University of Derby, Derby, UK. p. 103
McKeag, A., McGrath, D. (1996) Sound Field Format to Binaural Decoder with Head-Tracking. 6th Austrailian Regional Convention of the AES, Melbourne, Austrailia. 10 – 12 September. Preprint 4302.
McKeag, A., McGrath, D.S. (1997) Using Auralisation Techniques to Render 5.1 Surround To Binaural and Playback. 102nd AES Convention in Munich, Germany, 22 – 25 March. preprint 4458
Noisternig, M. et al. (2003) A 3D Ambisonic Based Binaural Sound Reproduction System. Proceedings of the 24th International Conference on Multichannel Audio, Banff, Canada.
Leitner et al (2000) Multi-Channel Sound Reproduction system for Binaural signals – The Ambisonic Approach. Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00., Verona, Italy, December, p. 277 – 280.