This week at SIGGRAPH's Real Time Live (RTL), Volumetric Human Teleportation or Monoport received the jury award “Best in Show”. (Two-way connection with interactive style transfer on live video streams). The monoport team consisted of USC members Ruilong Li, Zeng Huang, Kyle Olszewski, Yuliang Xiu and Shunsuke Saito, as well as former USC professor Prof. Hao Li, who now works full-time at Pinscreen, a company he uses to develop digital Founded characters. Not only was Pinscreen part of that successful presentation, but they also had a second entry on RTL that focused on real-time digital people.
Volumetric human teleportation SIGGRAPH 2020 RTL
The winning presentation enables the capture and display of people in real time. The system isolates the person, samples their texture, and uses machine learning (ML) to create a 3D person. This digital version can be viewed from any angle, including from behind, which the program has never seen. This is done with the help of deep learning and builds on the team's previous work entitled PIFu: Pixel-Aligned Implicit Function for the high-resolution digitization of dressed people (see below), which was first shown at ICCV 2019, but now in real time is performed .
The program does not type in the target person from the background, nor does it use conventional voxel representations or requires multiple cameras, since it does not use photogrammetry. It uses ML to segment the person even in loose and detailed clothing and then infer an outcome. As the name suggests, it performs a pixel-aligned implicit function to align the 2D image in the corresponding 3D object. It is an end-to-end deep learning method that can deal with very complicated shapes such as hairstyles, clothes as well as their variations and deformations and digitize them in a uniform way. The system has only been shown with a standard webcam, a domestic RGB USB camera, but nothing prevents future work from working at much higher quality and a higher resolution camera. Similarly, the background wall shown in the video was flat in color, but this is not a key aspect of the solution, the background is not significant. The lighting doesn't have to be professional or special either. No polarized light sources or special filters are used.
Zeng Huang's live PhD defense.
At a time when immersive technologies and sensor-based systems are becoming more common, and in a COVID-19 world that is rushing to zoom and video conferencing, this virtual 3D presence caught the imagination of the SIGGRAPH jury. Monoport was a central part of Zeng Huang's Ph.D. Defense. Rather than presenting his final submission to his advisors and reviewers as a normal zoom session, Huang "Huoported" himself in his own PowerPoint slide deck and presented USC as a monoport digital volumetric digital person.
Video conferencing with a single camera remains the most common approach to face-to-face communication over long distances, despite recent advances in virtual and augmented reality, as well as 3D displays, which are enabling far more immersive and compelling interactions. While Monoport could easily have been showcased as a VR implementation at RTL, the team decided to use a mobile device to show the interactivity and the ability to move the camera in real time rather than using a VR headset. Both approaches are valid, but the non-VR solution highlights the convenience and ease of use of the main monoport system.
The system is robust against changing light conditions
Successfully reconstructing not only a person's geometry but also a person's texture from the perspective of a single camera is a significant challenge due to the ambiguity of depth, changing topology, and severe occlusions. To address these challenges, the team took a data-driven approach Use of deep, high capacity neural networks. The approach is so successful that two people can be captured from the same live action feed as was shown during RTL. Central to the success of the process is the type of deep learning, in which the training can take some time, but the software is very fast to execute. The training data the team used was a combination of a library of photogrammetrically captured numbers and synthetic data that the team created on that basis. Because some of the library figures were manipulated, the team was able to produce additional figures by both varying the original poses and applying different lighting to them.
One of the most impressive aspects of the system is seeing how well it handles people carrying extra things like backpacks or removing clothing like jackets or hoodies without the system falling over. “The beauty of deep learning is that it implicitly defines what it is based on what you feed it with,” explains Li. “For example, you define in a traditional computer vision method or a model-based approach Basically what the human body is – usually via a parametric template body, which is a naked body with no clothes on. " The monoport doesn't work with a human base model, so it's both faster, more accurate, and takes up less storage space.
Below you can see the original video PIFu: Pixel-Aligned Implicit Function for the high-resolution digitization of dressed people (ICCV 2019).
High resolution version: PIFuHD
A higher-resolution (not in real-time) version of PIFuHD: High-Resolution was also investigated as part of an internship at Facebook Reality Labs / Facebook AI and recently presented by Shunsuke Saito at CVPR2020. However, this higher resolution version is not just a non-real-time solution, it just focuses on the geometry and does not provide the aligned textures.
AI-synthesized avatars: From real-time deepfakes to the photoreal AI virtual assistant
The other presentation that Pinscreen had on RTL showed “deep fakes” in real time and the work that Pinscreen has done in creating a consistent digital human being. That includes the basic AI engine that powers the character, in this case a digital Frances based on Li's own wife. Pinscreen developed the entire digital agent pipeline. "The only third-party component that we use is speech recognition, which uses the Google API, and speech synthesis, which Amazon Polly uses," explains Li.
All of Digital Frances processing was done in the cloud and was completely live. During the actual RTL presentation, Hao Li spoke to his digital wife. Since the digital agent does not have a script, her interactions and responses are never the same twice. So, contrary to the rehearsals, Digital Frances suggested that her real husband may need to get a regular job! “During rehearsals, she always asked different things, but it was really different during the show.” Li comments, “It was the weirdest questions she asked. She asked things like, "Oh, you should get a job, how old are you?" And I said, "Man, why is she asking me these things?" But then I thought, you know, it was kind of good because when if it were perfect, it might not have looked real or unwritten. "
In an upcoming Greenscreen story, we will speak to Frances directly.
This other part of the demo was a real-time deep learning-based facial synthesis technology for real-time photo-realistic AI avatars. With this technology, a user can create their own 3D face model and transform themselves into the face of an actor, athlete, politician, musician, or any other person. (Fxguide has covered this technology before).