YouTube, Ambisonics and VR

Introduction

So, last week Google enabled head (phone!) tracked positional audio on 360 degree videos.  Ambisonics is now one of the defacto standards for VR audio.  This is a big moment!  I’ve been playing a little with some of the command line tools needed to get this to work, and also with using Google PhotoSphere pics as the video as, currently, I don’t have access to a proper 360 degree camera.  You’ll end up with something like this:

https://youtu.be/VX_gOGFgt14&w=520

So first, the details.

Make an Ambisonic Mix

Google are using Ambisonics as the way of storing and steering the audio.  Ambisonics is a useful format for VR/gaming as, once encoded, it can still be manipulated/rotated without having to have access to all the audio sources/tracks individually.  This is great for applications that need to steer the audio render in real time.  So, the Ambisonic signals can then be decoded to loudspeakers or, in this case, head related transfer functions (HRTFs) instead, resulting in signals ready for headphones.  Now, instead of rotating the speakers (HRTFs) when the head moves, the Ambisonic scene is rotated instead, which is a much simpler transform.  One early paper on this technique is McKeag and McGrath (1996) and I also looked at it in my PhD report too (page 103 of Wiggins 2004).  YouTube currently utilises 1st Order Ambisonics (4 channels) and uses the ACN channel sequencing and SN3D normalisation scheme.  WARNING: THIS ISN’T WHAT ANY MICs or MOST PLUG-INS OUTPUT!  However, it is a more extendable format (channel sequence wise).  A simple, but necessary channel remapping and gain change MUST be done before saving the 4 Channel B-Format Bus.  I’ve made a JS (Reaper) plug-in that does this.  It can be downloaded below (it’ll work up to 3rd Order, 16 channels, but YouTube only accepts 1st order…for now 😉  As always, other people’s tools are also available to do this, but as a learning exercise (so I can teach it at Derby!) I’ve made my own. (How to install a JS Effect)

WigAmbiRemap JS Effect

So, at this point, let’s pretend you’ve made/recorded an Ambisonic mix (perhaps after watching these videos).  You then need to put the plug-in above on the Total B-Format Bus track (or one that is routed from it, ideally, so it doesn’t mess up your speaker decode!).  You can then set the plug-in as shown below and it’ll turn your FurseMalham/SoundField B-Format into AmbiX B-Format.

AmbiSignalRemap

This should now be rendered/saved as a 4-channel WAVE file (again, see the videos on the teaching page if you’re not sure how to do this.  I’ll make a new video covering all of this soon!)

Generate a 360 Photo

Ok, so what about a stereoscopic, 360 degree video?  Well, I don’t have a camera for these, so instead, I’ve used a jpeg made by the excellent Google Camera app on Android (a PhotoSphere).  Note, that just as this feature has become really useful, Google has taken Google Camera off the Play Store!!!!  It can still be found on APK mirror sites, though (see here, for example, but my phone is stuck on V2.5.052 – I think V3+ is only for Nexus devices, hence why I can no longer see it on the play store).

So, let’s assume we now have the following:

  1. An ambiX 4 channel B-Format Wave File (ambiX.wav)
  2. A high resolution photosphere jpg (PS.jpg)

The PhotoSphere jpeg is an equirectangular mapped 360 degree image.  This is handy, as it’s the exact format that YouTube wants.  However, for it to work, you may need to reduce the resolution of the image so that it’s a maximum of 3840 x 2160 (my images are higher res than that, but YouTube wouldn’t process them!)  Details on video formats/tips can be found at this google help page.

Now, you’re gonna need a few more tools.

  1. FFMPEG – we need this to glue our video (well, picture!) to our audio
  2. Python
  3. The Google Spatial Media Python tools to then tag this video as being spherical AND Ambisonic.
  4. Note – I used homebrew to install both Python and FFMPEG with a load of libraries enabled on my Mac

Make the Video

So, now we’ve got our jpeg (say PS.jpg) and our 4 channel Wave file (say ambiX.wav) in the same directory.  If I want to glue the jpg, as a video, and the wave file together in a .MOV container (the best combination I’ve found as yet!), then type this, replacing the filenames with yours (from the command line/terminal):

ffmpeg -loop 1 -i PS.jpg -i ambiX.wav -map 1:a -map 0:v -c:a copy -channel_layout 4.0 -c:v libx264 -b:v 40000k -bufsize 40000k -shortest PSandAmbiX.mov

NOTE: the above used to say … -channel_layout quad … however, that’s not correct and is no longer accepted by YouTube.  Thanks Dillon for the tip…

This will (in order).  Loop the jpg input and use the wave input mapping the video from the 1st and audio from the other in a quad audio video using the libx264 library, setting the bitrate and how often to calculate the average bitrate, and specifying the length of the video as the shortest of the two  (the jpg loops for ever!)  Phew!

Tag the Video

This video now needs tagging as a spherical/Ambisonic video.  Again, from the command line/terminal, start the google spatial tools:

cd <location of your spatial media tools folder>
python gui.py

Then, select the options as shown below and save the new video file (it’ll automatically add _injected to your file name) as a separate file:

Google Spatial Media Tools GUI

Now, you should have a video file called PSandAmbiX_injected.mov.  Your video is now ready to be uploaded to YouTube.  Note that the audio won’t react to head movement straight away – it can take several hours for the Google servers to process the audio, and it also may look lower res, than it’ll eventually become, while is processes the video too!

Happy VR’ing!

Acknowledgements

Special thanks to Dillon Cower and Paul Roberts over at Google and especially Albert Leusink and the Spatial Audio in VR facebook group for tips and info on, in particular, FFMPEG usage and video container formats.

References/Links

27 Replies to “YouTube, Ambisonics and VR”

  1. Hi Bruce,

    Maybe to avoid confusion between “B-Format orders” and “B-Format channel ordering”, another word than “ordering” would be preferable? I’m using “sequence”, but maybe there’s something better.

  2. Hi Bruce,

    Thanks so much for this information and the JS plugin. Is there any reason not to render straight out of Reaper?If you have an appropriate video file?

    Thanks,
    Ian

  3. No reason at all, except that I’ve not tested it yet. I’m planning on using it to glue static JPEGs together to make a simple video soon, so I’ll let you know 😉

  4. Hi there!

    Gonna keep it short, YouTube wasn’t accepting my mp4 with multichan AmbiX .wav. “Error processing”

    So I decided to stick to their delivery guidelines verbatim, MP4 video and AAC 4.0 audio. I ffmpeg’d the WAV to AAC and compared it in Reaper only to find that the channel order had moved, specifically channels 3 and 4 were swopped. After some digging around it would appear that FFMPEG’s AAC encoder tries to guess what the input order is by the channel count, then rearranges them to it’s own standard. Since 4.0 audio isn’t really a mainstream thing I presume it was trying to turn the 4.0 Ambi mix into a 5.1.

    Long story short I managed to fix it somehow, through blind luck with the following lines in FFMPEG

    -c:a aac -b:a 384k -map_channel 0.0.0 -map_channel 0.0.1 -map_channel 0.0.3 -map_channel 0.0.2 -channel_layout quad

    Would be great if you could try it out sometime, I’m only two days into Ambi and one day into FFMPEG so I may be missing the mark completely.

  5. The best way is to use a .mov container, which includes the H264 equirectangular video and the 4-channels Ambix wav (48 kHz, 16 bits).
    An .mp4 container cannot use plain uncompressed wav audio.
    AAC is quite bad for Ambisonics, as the phases are not guaranteed.
    Furthermore, if the .mov file is named properly (with a filename ending with .360.mono.mov, if the video is 2D-monoscopic, and .360.mov if the video is 3D-sterescopic in up-down format), then you can preview it locally on your Android phone with Google Cardboard viewer, before uploading it to Youtube, employing the free Jump Inspector app from Google.
    Please note that Jump Inspector already supports also 3rd order Ambisonics 16 channels).
    You can download some of my recordings for comparng 1st and 3rd orders:
    http://www.angelofarina.it/Public/Jump-Videos/

  6. Hi Bruce,
    Thanks for the info! Is your WigAmbiRemap JS Effect only available for Mac/iOS? I run Reaper via Windows 10 and I’m not sure what to do with the WigAmbiRemap file. It’s’ not a .dll like the rest of the plugins. Where does it go, how do I use this file? Could you please shed some light on this for me? I’ve downloaded and installed your other plugins but this Remap doesn’t seem to be a .dll

    Thanks,

  7. Hello Bruce,
    Thanks for your nice tips. I’m quite into Ambisonic mixing and it works great. But To you have a process to re-sync my 4ch mix to a VR video so I can publish on youtube ? I could not get it work with ffMPEG, I guess I’m writing the wrong command.
    Thanks !
    Hope to meet you one day and talk about this research. 🙂

  8. what command are you currently using? For video type stuff I use something like:
    ffmpeg -i JumpVRTestAmbiX.wav -i 001.360.mono.mov -map 0:a -map 1:v -c:a copy -channel_layout quad -c:v copy 001test5.mov
    if the video just needs muxing

  9. Thanks Bruce !
    I really appriciate that you answred my question 🙂
    Unfortunatly I’m still blocked. Sorry. I’m investigating….
    Have a great day !
    Rom.

  10. Hello Bruce,

    First of all thanks a lot for your detailed blog! I could create and upload 360 degree video in couple of minutes.

    I have a doubt here. Any inputs would be greatly appreciated. I have 4 omni-directional inputs to my Ambisonics encoder, so I get 4 B-format outputs. How do I encode 4 B-format signals to single B-format signal before sending it to my AAC encoder?

    Thanking you in advance.

    BR.

  11. You just need to sum the four corresponding b format channels if I understand your question correctly) . However, I’d encode to wav rather than AAC if I were you.

    Cheers

    Bruce

  12. Thanks a lot for quick reply, Bruce. Appreciate it.

    So do I add all Ws, Xs, Ys and Zs? But if these channels have sufficient amplitudes then they would saturate on adding, right? Am I missing something here?

    Right now I’m encoding it to wav but plan is to move to AAC to get some compression. Any reason to stick to wav going further also?

    BR.

  13. Can’t thank enough for both of your replies and your time, Bruce.

    Sorry to bother you, just keen on understanding this further. Will I lose the benefit of having 4 microphones if I only use addition? Is there any way to exploit availability of 4 omni-directional microphones?
    I also checked out Reaper with JS plugin blog. In case of 4 input – 4 output, which is similar to my setup, I could see settings for spacing between microphones and size of microphones. I think I need to add such complexities to get a good quality directional effect. Any pointers on this? Just want to include this when I record audio for my 360 degree videos. Any reference implementation or documentation to understand and include such settings in my current B-format converter, which BTW takes omni-directional and applies SN3D scaling. Currently, I use Matlab implementation of B-format converter to generate ambiX.wav and then follow rest of the steps mentioned in your blog.

    To sum up my query – how to use my 4 omni-directional setup to generate better quality directional audio using microphones spacing and size information? I think what I’m trying to achieve is tetrahedral setup using omni-directional microphones.

    BR.

  14. Hello Bruce,

    Sorry if my previous query was vague or something. Please let me know if you need more details to help me with my query.

    Thanks & Regards,
    Mahantesh

  15. Hello Bruce,

    I’ve been working with Your reaper projects, and I am trying to figure something out. You have implemented 8 channel decoder for 8 speakers, For Youtube I have changed your decoder to binaural decoder which outputs 2 channels for left and right signals that of course includes WXYZ. I am wondering if my master track (or decoder track that I am rendering) needs to be 8 channels or 4 channels is enough ? (my thiervs ambix binaural decoder also takes 8 channels into account, I am confused here, why would it set master for 8 channel track taking only YouTube into consideration ?

  16. ah, ok. So you’re trying to generate B-format from your omnidirectional microphone signals, not just pan them into a scene. This is more complex! Basically, take the difference and time integrate. The spacing will affect where the first null will occur, in the frequency spectrum, and the level of LF. Closer spacing, higher null (which is good), but closer spacing also means lower level and more noise at LF. You’ll have to look through the literature for more details!

  17. For tracks that contain 1st order B-format, you’ll need 4 channels per track. For tracks that decode (or transcode) this to another format, you need a channel count that is the number of channels at the output of that process (but with a minimum of 4 assuming it’s taking in the B-format input).

  18. Thank a lot Bruce for your reply.

    So, it looks like I have to go for Beamforming or something like that and have lobes facing in particular directions. Yeah, it’s going to be complex for me at this stage. I think I’ll drop this idea.

    I took some time for me to understand what you meant by panning in scene using omni-directional microphones. This has created new interest in me. How can I achieve this with my setup of 4 omni-directional microphones? I feel just adding all 4 WXYZs may not help. Just a small experiment I did to see how my setup behaves in case of 2 microphones for a moving source. I generated WXYZs from each microphone and added them. I have a slow moving signal source around the microphones. But I couldn’t get the feeling of moving source distinctly when I uploaded to YouTube and heard it back. Am I missing something here?

    I could see Ambisonics panning happens at decoder side based on loudspeaker setup in couple of Ambisonics tools (I maybe wrong). So my question is – Can we do panning on encoder side also to get the feel of moving source? If yes, how to achieve this?

    I’m new to 3D audio, so please bear with me 🙂

    BR.

  19. Hello Bruce,

    Adding to my previous post, I tried encoding with Reaper for a signal with moving source content. But same results as my Matlab encoder. I do not feel the moving source distinctly. Could it be because if 1st Order Ambisonics and Binaural rendering of YouTube?

    Is there any way I can make my YouTube videos have this effect of moving source? Sorry for bugging you with many posts 🙂

    BR.

  20. Hello Bruce,

    Maybe I was moving in a wrong direction trying to encode a single omni-directional microphone for my moving source experiment.

    I re-recorded using 2 omni mics but by adding 2 sets of WXYZs, the effect is lost. Is the Tetrahedral mic only solution here to capture moving sources? It is really frustrating having spent 2 weeks on this. My YouTube videos sound very bad without directional cues. I’m missing something very basic I feel. Any help would be highly appreciated.

    BR.

  21. Hello Bruce,

    Finally it works now! For some reason both mics were picking same angles and hence stereo effect was lost.

    Thanks & Regards,
    Mahantesh

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.