Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

A recent study on proposes a method for interactive novel view synthesis. In this task, given a few RGB images of a previously unseen scene, novel views of the same scene are synthesized at interactive rates and without expensive per-scene processing.

Example of automated object detection and object recognition. Image credit: MTheiler via Wikimedia, CC-BY-SA-4.0

The researchers employ an encoder-decoder model built on transformers to learn a scalable implicit representation. Explicit geometric operations are replaced with learned attention mechanisms. The proposed model has the advantage of stronger generalization abilities and is more efficient than previous models, as global information is processed once per scene rather than once or hundreds of times per rendered pixel.

Evaluation of the model on several datasets shows that it is scalable to complex scenes, robust to noisy camera poses, and efficient in interactive applications. It can be useful for the virtual exploration of urban spaces and other AR/VR applications.

A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene.
In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a “set-latent scene representation”, and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error.
We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.

Research paper: Sajjadi, M. S. M., “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, 2021. Link: