live events

Why Nvidia's ML Model is a game-changer when Teleporting to Live Events With VR

Nvidia recently published a machine learning model which paves the way for people to effectively “teleport” themselves into the crowd of any live event. Here’s a quick overview of why we’re excited about this technology.

It’s widely agreed that virtual reality (VR) coupled with fast internet and video cameras can give people the feeling that they are in a different location in real-time from anywhere else in the world. For example, by putting on a headset you could move, dance, and sing along in a crowd at a Taylor Swift concert in New York while standing in your house in Sydney. 

The problem, however, is that there isn’t a scalable and non-intrusive solution for the video camera(s) at the target location. For example, you could have a camera to mimic the head movements of each virtual attendee (with very low latency to avoid motion sickness). The low latency tracking could be possible with some wide-lens/360 video workarounds, but you’d still need a camera for every single person, which doesn’t scale and could be weird if there are multiple moving camera-robots at a live music show. 

Another option would be to have several strategically-placed static video cameras in the target location. The video frames from all these cameras would be used to generate a 3D scene in near real-time. But the number of cameras required could be prohibitive and you’d still have parts of the environment that are occluded because someone is always behind someone else from the camera’s point of view. (Generating the 3D scene in real-time is also not computationally trivial as our friends at DroneDeploy keep reminding us.)

Why not simply use a 360 video? 

360 videos are great, but if you’re used to the freedom of moving in and interacting with 3D worlds in VR, the passivity of 360 videos can quickly make you feel trapped and imprisoned. Almost like being taken to a forest teeming with wildlife, and tying yourself to a tree upon arrival. Thus, the key is having a 3D world and being able to move around and explore any part of it.

A recent machine learning model developed by Nvidia makes it easier to create 3D worlds which can be freely explored. The model is called Instant NeRF and is able to render 3D scenes almost instantly out of a couple of images. i.e. you capture 10 images of a person in a room, feed it into the model, and it returns a 3D scene. VR experiences are all made up of 3D scenes – millions of 3D dots that are carefully combined to represent places such as a street in Paris. Essentially, the Instant NeRF model consumes a few real-world images and generates a virtual world replica that can be explored from every angle or 3D point.

This means we need significantly fewer video cameras at the Taylor Swift concert and can render the complete 3D scene along with the dancing crowd in real-time. People entering the scene in VR can walk or teleport anywhere in the crowd, even onto the stage, and experience the show as if they were right there in the moment. Mapping the spatial audio of the crowd and Taylor at the show would also need some engineering, but is largely possible with existing technology.

Experiencing a live music show in person is a lot different than watching it on Youtube, and the technology outlined above would make in-person experiences more affordable and accessible. One drawback is that real-world attendees cannot see or hear you – making it a one-way experience. 

For other live experiences, however, this drawback is a huge bonus. For example, football matches; spectators always want to be as close as possible to the game, the players, the sweat. So much so, that they’re willing to pay over $14k for front-row seats. What if you can give them a seat that puts them right next to the player on the field. Without interfering with the game, VR spectators could stand behind Messi when he takes a shot at the goal.

Eventually, when AR glasses and lenses become a reality, the real-world attendees would be able to see and hear the virtual attendees. i.e. you’d be able to join your friends for the Africa Burn festival from your apartment in New York and explore, party, and laugh with them. Add smart clothing with haptic feedback (the sense of touch) and in every sense of the concept, we’d be able to teleport around the world.

Nvidia recently published a machine learning model which paves the way for people to effectively “teleport” themselves into the crowd of any live event. Here’s a quick overview of why we’re excited about this technology.

 

It’s widely agreed that virtual reality (VR) coupled with fast internet and video cameras can give people the feeling that they are in a different location in real-time from anywhere else in the world. For example, by putting on a headset you could move, dance, and sing along in a crowd at a Taylor Swift concert in New York while standing in your house in Sydney. 

 

The problem, however, is that there isn’t a scalable and non-intrusive solution for the video camera(s) at the target location. For example, you could have a camera to mimic the head movements of each virtual attendee (with very low latency to avoid motion sickness). The low latency tracking could be possible with some wide-lens/360 video workarounds, but you’d still need a camera for every single person, which doesn’t scale and could be weird if there are multiple moving camera-robots at a live music show. 

 

Another option would be to have several strategically-placed static video cameras in the target location. The video frames from all these cameras would be used to generate a 3D scene in near real-time. But the number of cameras required could be prohibitive and you’d still have parts of the environment that are occluded because someone is always behind someone else from the camera’s point of view. (Generating the 3D scene in real-time is also not computationally trivial as our friends at DroneDeploy keep reminding us.)

 

Why not simply use a 360 video? 

360 videos are great, but if you’re used to the freedom of moving in and interacting with 3D worlds in VR, the passivity of 360 videos can quickly make you feel trapped and imprisoned. Almost like being taken to a forest teeming with wildlife, and tying yourself to a tree upon arrival. Thus, the key is having a 3D world and being able to move around and explore any part of it.

 

A recent machine learning model developed by Nvidia makes it easier to create 3D worlds which can be freely explored. The model is called Instant NeRF and is able to render 3D scenes almost instantly out of a couple of images. i.e. you capture 10 images of a person in a room, feed it into the model, and it returns a 3D scene. VR experiences are all made up of 3D scenes – millions of 3D dots that are carefully combined to represent places such as a street in Paris. Essentially, the Instant NeRF model consumes a few real-world images and generates a virtual world replica that can be explored from every angle or 3D point.

remio team

This means we need significantly fewer video cameras at the Taylor Swift concert and can render the complete 3D scene along with the dancing crowd in real-time. People entering the scene in VR can walk or teleport anywhere in the crowd, even onto the stage, and experience the show as if they were right there in the moment. Mapping the spatial audio of the crowd and Taylor at the show would also need some engineering, but is largely possible with existing technology.

 

Experiencing a live music show in person is a lot different than watching it on Youtube, and the technology outlined above would make in-person experiences more affordable and accessible. One drawback is that real-world attendees cannot see or hear you – making it a one-way experience. 

 

For other live experiences, however, this drawback is a huge bonus. For example, in football matches; spectators always want to be as close as possible to the game, the players, the sweat. So much so, that they’re willing to pay over $14k for front-row seats. What if you can give them a seat that puts them right next to the player on the field. Without interfering with the game, VR spectators could stand behind Messi when he takes a shot at the goal.


Eventually, when AR glasses and lenses become a reality, the real-world attendees would be able to see and hear the virtual attendees. i.e. you’d be able to join your friends for the Coachella festival from your apartment in New York and explore, party, and laugh with them. Add smart clothing with haptic feedback (the sense of touch) and in every sense of the concept, we’d be able to teleport around the world.

Jos van der Westhuizen

Leave A Comment

Your email address will not be published. Required fields are marked *

Hello there!

You’re about to enter the Remio Headquarters as a browser-spectator.

 

This spectator mode will give you a sense of what is available in VR.

Please see the on-screen instructions on how to move around.