The COVID-19 pandemic has driven companies to accelerate advances in video conferencing and the use of AI and machine learning. Two companies at the forefront of these advances are Pinscreen and NVIDIA. Both use advanced machine learning and generative adversarial networks (GANs) to push what is possible and approach the problem from very different angles.
At SIGGRAPH, Pinscreen presented two advances in real-time graphics at Real-Time Live (RTL). The work in scanning virtual bodies with a single camera or monoport was jointly awarded as Best of Show (see our story here). The second demo was from her fully digital agent, Frances, who was interviewed live during the event.
At Greenscreen, we really wanted to try a video chat with Digital Frances ourselves, but we thought it would be more fun if Digital Michael interviewed Digital Frances. In this exclusive video below, we speak to founder Hao Li, and a neural Digital Michael interviews Digital Frances.
The Digital Michael is a digital UE4 avatar that is rendered in real time, into which an additional GAN-generated face is then integrated, which is controlled by the UE4 face. Digital Frances is a fully virtual assistant that answers questions and is also rendered in real time in UE4. This includes her simulated digital hair and spontaneous, non-written responses.
Pinscreen uses its own technology based on the paGAN software to power the UE4 characters and avatars. Their AI technology is independent of NVIDIA's new Maxine software (below), but uses NVIDIA's GPU card technology.
New AI breakthroughs were shown as part of NVIDIA Maxine. This is a suite of new software designed to improve video conferencing.
There are a number of new AI SDKs and innovations around video conferencing. For example, a cloud-native video streaming AI SDK that drastically reduces bandwidth usage while also allowing faces to be re-animated, gaze corrected and characters to be animated for immersive and engaging meetings.
Facial resuscitation: gaze correction
One of the biggest problems with videoconferencing is that you are looking at the screen, not the camera, so no one is making eye contact. In an attempt to reduce bandwidth by only streaming neural network and audio from the speaker, NVIDIA can simulate much higher quality video conferencing and also solve the eye line problem. It does this by inferring where to look and using neural rendering to recreate your face instead of just streaming video. This uses AI video compression to reduce bandwidth by up to 90% to just one tenth of H.264.
With new AI investigations, NVIDIA can identify key points of each person on a video call and then use those points with a still image to infer the face of someone on the other side of the call using GANs. These important points can be used for face alignment, which involves rotating faces so that people face each other during a call, as well as eye correction to simulate eye contact even when a person's camera is not aimed at their screen. NIVIDA hopes that developers will also add features that will allow callers to choose their own avatars, realistically animated by their voice and emotional tone in real time.
This work by NVIDIA is part of the larger Maxine program.
NVIDIA Maxine is a fully accelerated platform for developers to create and deploy AI-powered capabilities in video conferencing services using state-of-the-art models that run in the cloud. Maxine includes the latest innovations from NVIDIA research such as real-time automatic translation, face alignment, gaze correction and face lighting, as well as features like super resolution, noise reduction, subtitles and virtual assistants. These functions are fully accelerated on NVIDIA GPUs to run in real-time video streaming applications in the cloud. Because Maxine-based applications run in the cloud, every user can enjoy the same functionality on any device, including computers, tablets and phones. Because NVIDIA Maxine is cloud-native, applications can easily be deployed as microservices that can scale to hundreds of thousands of streams in a Kubernetes environment.
Maxine-based applications can use NVIDIA Jarvis, a fully accelerated AI framework for conversations with state-of-the-art models optimized for real-time performance. With Jarvis, developers can integrate virtual assistants to take notes, set action items, and answer questions with human-like voices. Additional conversation AI services like translations, subtitles and transcriptions are possible.