Thanos (Josh Brolin). Photo: Film Frame .. © Marvel Studios 2018
Masquerade 2.0 is the latest version of Digital Domain's internal face detection system. Version 2.0 was designed from the ground up to incorporate feature film quality aspects into new and often real-time applications such as next-generation games, episodic television, virtual production and advertising. At its core, Masquerade is the same technical digital domain that Thanos was used to create in Avengers: Infinity War. However, it has been expanded to use machine learning and process faces in much faster production environments.
Masquerade v1 was born out of Digital Domain's work on Avengers movies, trying to find a way to get the helmet cameras to get an accurate representation of the actor's facial performance. After Avengers, the team looked back and tried to figure out what was coming next. Prior to Thanos, Digital Domain had been doing extensive testing and development for three or four months, and Thanos was hailed as one of the strongest digital characters to date, but the large Digital Domain movie pipeline was labor-intensive and time-consuming. See our Greenscreen story here.
Digital Domain has offices in Los Angeles, Vancouver, Montreal, Beijing, Shanghai, Shenzhen, Hong Kong, Taipei and Hyderabad. For the past quarter century, Digital Domain was a leader in visual effects, expanding globally into digital people, virtual manufacturing, pre-visualization, and virtual reality. It also created images for commercials, game cinemas, and an impressive list of film projects.
We spoke to Darren Hendler, Director of the Digital Humans Group at Digital Domain, and Dr. Doug Roble, Senior Director, Software R&D. Both are isolated from COVID, but both are in California. Digital Domain has already started applying this new technology to several next-generation game projects. It creates over 50 hours of game material 10 times faster than their previous methods, and they're not done yet.
Unlike methods that take more than a year to complete, Masquerade 2.0 can deliver a photorealistic 3D character and dozens of hours of performance in just a few months. Using machine learning, Masquerade was trained to pinpoint expressive details, down to the wrinkles on an actor's face, without restricting movement on set. That freedom has helped Digital Domain bring emotional characters into a number of high-end projects, including a recent replica of Dr. Martin Luther King Jr., which we covered here.
Masquerade 2.0 controls one of the highest quality face detection systems in the world. The system was developed over two years. Masquerade 2.0 now targets tiny facial movements, including more accurate lip shapes and subtle movements around the eyes than version 1.0. With this new data, Digital Domain can add more emotional performances to their projects and bring the intricacies of human facial dynamics to their digital characters.
"Masquerade 2.0 is powerful enough for a feature film, but accessible enough for games and other digital character applications to overcome the limitations that held us back in the past," explains Hendler. “With new tweaks, Masquerade can deliver a one-on-one version of an actor's performance almost right outside the door, so we can keep a lot of details that we used to have to throw away. This completely revolutionized our delivery process and allowed us to flip assets faster than anyone else. In the past, 50 hours of footage meant 600,000 hours of artist time. Today Masquerade reduces that by 95%. "
Aside from time, the new Masquerade system is still a marker-based system, but it is a much more accurate system. The other major problem is how the team is tracking the markers. “In that first Avengers film, we used masquerade with marker trackers, but all face markers were tracked manually or semi-manually,” Hendler points out. "So there is a lot of work involved. With the new Masquerade 2.0 we really took a look at how we could make the actor's face much more accurate and how we could almost completely automate the whole process." Masquerade 2.0 really became the next generation of Projects at Born Digital Domain. They're doing a lot of specific game productions right now. A project consists of 30 to 50 hours of articulated face capture. "That was just unimaginable with our previous system, and it's all data that nobody touches," adds Hendler. Under the old system, this would be too much performance data to manage and keep on schedule as their process is almost completely automated and only between 2% and 5% of the shots come back for some moderate or small amounts of change in user lead.
How it works (the simple version)
The process begins with the actor in a seated rig providing a set of high resolution 4D data along with very high resolution solid scans. This provides a training set of data on how the actors move.
The actor later puts on a vertical stereo helmet camera rig (HMC) with markings on the face. Digital Domain has a clever machine learning system for KI or GAN (Generative Adversarial Network) that works out what a marker is or not. GANs are useful for knowing what is a marker and what is the iris of the eye, for example. With some other clever AI tools that use the training data from before, the system labels and tracks these markers. It is now very fast and very robust. “In addition, our computer vision and other systems ensure that these predicted markers move exactly as they should, thanks to a very accurate camera model,” says Hendler. “We also have mechanisms to find out if a particular marker is really where it should be, or if something funny is going on? Also, what would happen if I removed this, what would I fill this data with? ". In short, there are quite a few different modules in Masquerade 2.0, all of which are based on identifying, accurately predicting, and tracking all of the different markers in active space.
The next goal of Masquerade 2.0 is to create a moving web that exactly and accurately matches the images of the actor from both HMC views. After Avengers, the team reconstructed its pipeline and questioned all the data and results. "We sat and looked at how the lip curls came through and when they didn't come through properly we looked at why they weren't working. And so there is this whole process of creating a moving web that fits the actor exactly," emphasizes Hendler, adding, "From there we have a whole bunch of different mechanisms that allow us to get more artist-friendly and artist-friendly setup – to vary things according to the needs of different customers."
The output of the Masquerade 2.0 process is a moving web of the face with fully animated eyes that the team did not capture in the original Masquerade version. During the process, this network is also loosened on a kind of face rig. In addition, the team retains any differences that may exist between the source and the moving 3D mesh. These deltas or differences are then used to more accurately determine what the actor's performance is doing.
Digital Domain serves a number of different customers who each may need different expenses. Some customers just want the raw 3D-4D meshes, others are solved on the customer's rigs. Still others, explains Hendler, "we generate and give them a complete facial rig with which we can achieve the quality of performance fairly accurately, but which is also optimized for operation in a real-time engine." This rig is a proprietary Digital Domain UE4 facial rig. This can sometimes work better as a client's rig may not have the correct shapes in their rig. In other words, the Masquerade system gives very accurate results, but the client's rig cannot reproduce these facial expressions due to combinatorial networks and other limitations.
For their own work, Masquerade can feed the high-resolution film pipeline or the real-time UE4 pipeline of the Digital Doug project at SIGGRAPH 2019. The two main differences between the Digital Doug project and the previous MEET MIKE UE4 project are that Masquerade uses 4D training data and, after advanced levels of animation, uses complex machine learning.
Machine learning background.
In machine learning, there are generally two types of artificial neural networks: those that are used for pattern recognition are typically convolutional neural networks (CNNs) and those that are used for things that are temporal (time-based) are recurring neural networks Networks (RNN). Reading a picture and identifying that a cat is in the shot, for example, is a CNN problem. Understanding the voice commands to your Alexa or Siri is usually an RNN problem. A key difference is that CNN has no memory of the last picture viewed. In particular, an LSTM-RNN (Long Short-Term Memory) builds on what it has learned in the last image in the sequence under consideration, which could give the impression that Digital Domain is solving its masquerade pipeline using RNN deep learning, this However not . Both are AI tools and both are advanced machine learning approaches, which means they use training data to build the solutions they provide. However, Doug Roble points out, "Training repetitive neural networks is not an easy task," he jokingly added. ”
RNNs differ from conventional "feed-forward" neural networks. Feedforward neural networks were the first and simplest type of artificial neural network, and as the name suggests, they are not switched back. The information only moves forward within the network. This is not to be confused with what they are doing internally, that is, by returning optimizations internally (within the network) to reduce errors in a technique known as back propagation. This approach of carefully finding a path to fewer and fewer errors, called gradient descent. In general, the problem of teaching a network to perform better and better is a fairly subtle problem that requires additional techniques. "We use a lot of temporal information in combination with the statistical model to support simpler feed-forward networks. Because it works better if we know what happened in the previous frame," explains Roble.
Much of what can normally be achieved with an RNN can be done using CNNs in combination with an additional aspect of storage. "We don't quite do this, but it's similar to what we do, and we have that kind of timing that goes with this (deep learning) stuff." Roble is leaving the details to his team to explain in the future as they want to publish formal SIGGRAPH papers on their specific implementations.
Marker tracking mechanism
The marker tracking mechanism is a good example of this. The new automated tracking mechanism in Masquerade doesn't work with old-fashioned "Harris Corner" pattern matching or just frame CNNs. Some time-based data is used, but not as an RNN. “We have a lot of statistical data on the face, how the face moves, the slopes and speeds of different things on the face. All of this is used in our prediction of markers and especially in predicting when markers are also incorrect. We have to make sure that these markers are always statistically correct – in terms of movement, ”explains Hendler. Actor tracking is not just a math error for the tracking mechanism to take into account. Actors can put their hand on their face and partially block the camera. In these situations, the system needs to keep a plausible idea of where all of the trackers are, even if the audience doesn't specifically see that part of the actor's face. However, the system has yet to statistically fill in the missing data and predict the hidden part of the face so that the rest of the face is correct and the markings can be resumed when the hand recedes.
Digital Domain typically uses 150 markers on the face. The number is intended to provide enough data when some markers may be obscured. Digital Domain doesn't necessarily have to do this entire approach with markers, says Hendler. It's just that every time we look at markerless approaches, the attention to detail in the framing, the focus, and the ability to have any kind of tolerance to helmet displacement make it really very difficult to come up with a robust system have why we keep coming back to the marking detail of the facial attachment ”. The markers don't have to be accurate from day to day on set. The system adapts to the daily changes in making the markings on the actor's face.
Digital Domain does a lot of IR face capture work. The IR cameras have a shallow depth of field and the very high-contrast points provide precise information that only skin pore details would not reliably provide. The cameras run with a resolution of 2K. SIGGRAPH's real-time live digital Doug used the Masquerade Live system to stream the RGB camera data. With the normal Masquerade system, the data is recorded with a WiFi live stream for monitoring on the set on the on-board computer.
The raw masquerade output is compared not only to the training data, which may have come from a seated Medusa session or a DI4D training data session, but also from some much higher level ICT light stage scans or some other very accurate photogrammetry session. For example, an actor may have 12 light level expression scans performed in addition to recording training data. These are static but extremely accurate. The Masquerade process feeds this into the process and uses Digital Domain's machine learning approaches. It fills in missing data that is not present either in Disney Studios' Research Medusa or in the static scans of the USC-ICT Light phase when looking at the new performance of a recorded actor.
"We have a process through which we regenerate some form of shape for the actor, for our actor rig," explains Hendler. “But we also use this delta data to change and adapt these forms. And so these forms of the actor's likeness come closer and closer. "Understandably, there will always be some form of delta between the two, some mathematical differences, and Masquerade has several ways of maintaining that delta or error and using it correctly later in the process.
A huge self-correcting feedback correction system
Masquerade works in a similar way to before, but with more automation, much higher speed, and improved with every single step, while basically maintaining the same approach. Masquerade is a very modular system. “The cool thing about Masquerade is that each of these modules learns to get better,” explains Doug Roble. There's this module that keeps track of the marks and, based on what it knows about the shapes, predicts where the marks should be. And it learns based on what it sees and the corrections in the process. Then it gets smarter. It's basically figuring out where those markings should be, which improves tracking, and then feeds into shape prediction. "
This makes masquerade better, and mechanically, into a huge feedback loop. "And the artists can put those modules together and say, okay, let's spend a little more time doing this one thing better – and then the effects get carried through the entire pipeline," he adds. All of this makes the entire masquerade process sound like it works much like a neural network. This is what makes the work of Digital Domain so difficult and impressive. Just like actual neural networks, the theory of self-improvement is one thing, but within an AI, when a system is descending a gradient to reduce errors in back propagation, it can drift infinitely or crash to zero. Massive pipelines that share data to reduce errors and thereby become smarter are incredibly difficult to build and even more difficult to get to work reliably without human intervention and constant monitoring. "One of the cool things about Masquerade is that when we try to make this stuff 'better' we have a lot of different ways to measure losses. For example, the system also uses an optical flow. The optical flow has some kind of bug in it If we use this in conjunction with actual OpenCV tracking of markers and then overlaying the optical flow, we can support OpenCV tracking of markers, "explains Roble." And then we have several head-mounted cameras to help us with 3D reconstructions can begin – so that we can get 3D losses there too. "And so Roble explains the superposition of all these different loss functions on each of these different things:" Makes it easier and easier to do the back propagation, since we are scooter in these loss functions that help the things along the way, and that was one of the most important ingenious things that the authors of this Ma terials have come up with. It uses so much information from computer vision and doesn't just try to do pure machine learning to solve the problem. "
Another innovation that Digital Domain is working on is how FACS, the coding system for facial actions, works. The system is called Shape Propagation. A 3D version of your actor's head can be provided in the system. Digital Domain can then simulate a full set of FACS shapes, 1500 different shapes based on average FACS shapes. "They are all different expressions, but they are not that person's expressions. It has a smile, but it is not that person's smile," says Hendler. In this way, Digital Domain creates a series of shapes for your face by transferring it, without having to do a FACS scanning session, Digital Domain could then solve that for the 4D data coming from the Masquerade system if the actor is wearing the HMC, but it wouldn't be the actor's unique expression – it wouldn't Your exact smile. The team automatically re-introduces all of these FACS shapes as part of their new shape expansion to ensure that the shape expansion system comes as close as possible to the actor's actual performance. It actually updates and modifies the entire masquerade. "It's coming all back to this modular system that can refine any part of the module to adapt to what it has, "says Roble." If wi So r you have in Australia without a special 4D system or anything like that, we could have you bet on the HMC – put a few dots on your face and make a plausible version of you just from what we recorded. “As the team gathers more information about the location of the vertices and the shapes of that person, we can tighten them up as well. We're going to find out what your face is doing just by watching what your face is doing. "
As Greenscreen has discussed in detail, face reconstruction and simulation is a research area with neural representation. Digital Domain is cage to discuss the subject. “So we are definitely involved in neural rendering and hybrid situations and the possible application in feature films. I just don't think we're ready to talk about it yet, ”joked Hendler. Digital Domain is already working with its clients on new approaches, modified filming processes, special acquisitions of custom actors and much more. This is because "the hybrid approach already exists and is being used in production," says Hendler's file. He would not go into any details.
"Digital Domain innovations are responsible for some of the most memorable images in the past 27 years," said Daniel Seah, executive director and chief executive officer of Digital Domain. Digital Domain's VFX team has brought art and technology to films like Titanic, The Curious Case of Benjamin Button and Blockbuster Ready Player One, Avengers: Infinity War and Avengers: Endgame. "Masquerade 2.0 will not only make digital people more accessible to our customers, but also more realistic and engaging for audiences, which will increase the reputation and credibility of our industry around the world," he added.
“Our innovation strategy at Digital Domain is about improving our internal pipelines and serving our customers. Delivering credible, people-driven performance is critical not only to our feature customers, but also to our gaming, episode and streaming, experience and advertising customers, ”said John Fragomeni, VFX global president. "As audiences increasingly expect almost flawless images for all content, we are constantly striving to develop solutions that are qualitatively superior, faster and adaptable to any form of entertainment."
Both Darren Hendler and Doug Roble will chat SIGGRAPH Asia as part of the window: Digital people are back! Creating and Using Credible Avatars in the Age of Covid – Part 2
Date and time (SG time, GMT +8): December 11, 11 a.m. – 12.30 p.m.
They will be joining Christophe Hery (Facebook), Hao Li (Pinscreen) and Mike Seymour from Greenscreen. This is a virtual event. For more information, see sa2020.siggraph.org