YouTube 360 VR Ambisonics Teardown!

UPDATE : 4th May 2016 – I’ve added a video using the measured filters. This will be useful for auditioning the mixes before uploading them to YouTube.

So, I’ve been experimenting with YouTube’s Ambisonic to Binaural VR videos. They work, sound spacious and head tracking also functions (although there seems to be some lag, compared to the video – at least on my Sony Z3), but I thought I’d have a dig around and test how they’re implementing it to see what compromises they’ve made for mobile devices (as the localisation could be sharper…)

Cut to the chase – YouTube are using short, anechoic Head Related Transfer Functions that also assume that the head is symmetrical. Doing this means you can boil down the Ambisonics to Binaural algorithm to just four short Finite Impulse Response Filters that need convolving in real-time with the B-Format channels (W, X, Y & Z in Furse Malham/SoundField notation – I know YouTube uses ambiX, but I’m sticking with this for now!). These optimisations are likely needed to make the algorithm work on more mobile phones.

So, how do I know this? I put a test signal (log sine wave sweep) on each of the B-Format channels and then I recorded back the stereo signals allowing me to measure a left and right response for each of the four channels individually. I carried this out when the phone was both facing front and at +90 degree to check the rotation algorithm was working. Below are the Head Related Impulse Responses (HRIRs) I got back (click for higher res) – these will also contain any filtering etc. from my phone and computer, but have turned out pretty well considering I had to hold the phone still in the correct position!

The fact that the left and right HRIRs are identical (or polarity inverted) show that they’ve used the symmetrical head assumption and the X and Y channels swapping between facing 0 and 90 degrees shows the rotation being carried out on the Ambisonic channel signals. Once you’ve got these HRIRs, generating the Left and Right headphones signals in the phone is (where the multiply with a circle indicates convolution). However, this is likely to be carried out in the frequency domain where it’s more efficient.

$L=W\otimes W_{hrir}+X\otimes X_{hrir}+Y\otimes Y_{hrir}+Z\otimes Z_{hrir}$ $R=W\otimes W_{hrir}+X\otimes X_{hrir}-Y\otimes Y_{hrir}+Z\otimes Z_{hrir}$

Below is also the frequency response of the four Ambisonic HRTFs where you can see YouTube cutting off the response at around 16.4kHz (again, click for higher resolution).

Once the W, X, Y & Z filters were obtained, although I sent log sine sweeps every 12 seconds (for a 10 second sweep), due to the fact I was recording in the analogue domain (no clock sync between phone and computer) some clock differences have caused the filters to be slightly mis-aligned. This can be most easily seen if a source is simulated at 90 degrees with respect to the head of the listener, as this should exhibit the greatest amplitude difference between the ears once the Ambisonics to Binaural algorithm is implemented. This was achieved when the an extra 2 samples (2/48000 of a second) delay was added to each 12 second ‘jump’. The frequency plots for a front facing head and a source at +90 degrees, and a +90 degrees facing head and a source at 0 degrees are shown below (so in the first plot, the left ear should be loudest, and in the 2nd plot the right ear should be loudest). I’ve also trimmed the HRIRs to 512 samples and windowed the responses using a hanning window. It is likely that the actual clock difference isn’t a multiple of 1 sample, so this method isn’t quite ideal!

These plots should be identical, but measurement inconsistencies can be noted between the far ear responses to the source (as these are lower level, they’re more prone to error), and by inspection it seems like the head at +90 degrees is a slightly better capture at this point. Also, remember that the response above 16.4kHz is not worth worring about as YouTube filters out frequencies above this value.

More to follow…

Ambisonics To Stereo

An aside on stereo. I’ve not put any plots up yet, but it seems like the Ambisonics to Stereo algorithm used (on non-android YouTube) is simply:

$L=W+Y$
$R=W-Y$

Google should really look into using UHJ for their Ambisonics to Stereo conversion…for an example of the difference, listen to the audio on these two videos. The first one is Ambisonics to UHJ, the second will be YouTube’s Ambisonics->Stereo algorithm detailed above (to carry out this test DO NOT use the Android YouTube app!)

UHJ : https://youtu.be/d2rrsjt44rs
W/Y Stereo : https://youtu.be/i3TjIiaKDVU

How do they Sound?

Here’s video using the extracted filters in Reaper in order to convert the 1st order Ambisonic audio to binaural.

Wider Reading:

Wiggins, B. Paterson-Stephens, I., Schillebeeckx, P. (2001) The analysis of multi-channel sound reproduction algorithms using HRTF data. 19th International AES Surround Sound Convention, Germany, p. 111-123.

Wiggins, B. (2004) An Investigation into the Real-time Manipulation and Control of Three-dimensional Sound Fields. PhD thesis, University of Derby, Derby, UK. p. 103

McKeag, A., McGrath, D. (1996) Sound Field Format to Binaural Decoder with Head-Tracking. 6th Austrailian Regional Convention of the AES, Melbourne, Austrailia. 10 – 12 September. Preprint 4302.

McKeag, A., McGrath, D.S. (1997) Using Auralisation Techniques to Render 5.1 Surround To Binaural and Playback. 102^nd AES Convention in Munich, Germany, 22 – 25 March. preprint 4458

Noisternig, M. et al. (2003) A 3D Ambisonic Based Binaural Sound Reproduction System. Proceedings of the 24^th International Conference on Multichannel Audio, Banff, Canada.

Leitner et al (2000) Multi-Channel Sound Reproduction system for Binaural signals – The Ambisonic Approach. Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00., Verona, Italy, December, p. 277 – 280.

4 Replies to “YouTube 360 VR Ambisonics Teardown!”

Angelo Farina says:

July 27th, 2016 at 11:02 pm

Hi Bruce,
You may find useful to acces the origional HRTF set used by Goggle for their 1st-order A,mbsionics decoder.
You can access to them here:
https://github.com/google/spatial-media/tree/master/support/hrtfs/cube
As you will see, the decoder is based on 8 binaural IRs, corresponding to 8 virtual loudspeakers located at the vertexes of a cube. No loudspeakers at hear height…
The GIT also contains the coefficients employed for computing the speaker feeds from the Ambix signals:
0.125 0.216495 0.21653 -0.216495
0.125 -0.216495 0.21653 -0.216495
0.125 0.216495 -0.21653 -0.216495
0.125 -0.216495 -0.21653 -0.216495
0.125 0.216495 0.21653 0.216495
0.125 -0.216495 0.21653 0.216495
0.125 0.216495 -0.21653 0.216495
0.125 -0.216495 -0.21653 0.216495
Hence the virtual microphone feeding each virtual loudspeaker is some sort of Hypercardioid, as fixing to 1 the gain in frontal direction, the rear negative lobe has a gain of – 0.5.
It also appears that the directivity is frequency independent.
Unfortunately, only the decoding coefficients for 1st order are provided. It would be nice to see what are the 3rd-order coefficients employed in Jump Inspector.

You will also see that the 4 IRs are sampled at 48 kHz, 16 bits, and that they are 512 samples long.
Bruce Wiggins says:

July 28th, 2016 at 11:05 am

Thanks, Angelo. Yes, they released them not long after I’d measured Google’s B-format IRs, but haven’t had a chance to look at their implementation. I’m hoping to take a look soon 🙂
JJ Wiesler says:

March 20th, 2017 at 7:55 pm

Do you know if Jump Inspector uses the same ambi decoder?
Bruce Wiggins says:

March 20th, 2017 at 8:22 pm

I haven’t checked, I’m afraid. Logically, it should (as an offline checker)……