Bringing Telepresence to Every Desk

Abstract

In this paper, we work to bring telepresence to every desktop. Unlike commercial systems, personal 3D video conferencing systems must render high-quality videos while remaining financially and computationally viable for the average consumer. To this end, we introduce a capturing and rendering system that only requires 4 consumer-grade RGBD cameras and synthesizes high-quality free-viewpoint videos of users as well as their environments.

Experimental results show that our system renders high-quality free-viewpoint videos without using object templates or heavy pre-processing. While not real-time, our system is fast and does not require per-video optimizations. Moreover, our system is robust to complex hand gestures and clothing, and it can generalize to new users. This work provides a strong basis for further optimization, and it will help bring telepresence to every desk in the near future. The code and dataset will be made available.

Overview

Our system utilizes 4 RGBD cameras to render high-resolution (1280x960) free-viewpoint videos. We propose Multi-Layer Point Cloud (MPC), a new volumetric representation for RGBD inputs which enables more efficient and accurate reconstruction than conventional novel view depth sweep volumes. We further improve the stability and accuracy via a temporal renderer and Spatial Skip Connections. Please refer to the paper for more details.

Quantitative Results

Free-Viewpoint Videos Test Results

Target User (in training data), New Clothing (not in training data)

New User (not in training data)

New & Target Users

Comparisons

Comparisons with Microsoft VirtualCube, t-NeRF+DSNeRF, DynamicNeRF+DSNeRF, ENeRF, and ENeRF+Depth. Recent Approaches Suffer due to sparse viewpoints with wide baselines