LargeSpatialModel (LSM)

Description

Large Spatial Model (LSM) is a groundbreaking approach in the field of computer vision that aims to reconstruct and understand 3D structures from a limited number of unposed images. Traditional methods for dense reconstruction typically involve complex mappings between different data representations, leading to significant processing times and engineering complexity.

LSM, on the other hand, directly processes unposed RGB images into semantic radiance fields in real-time. By simultaneously estimating geometry, appearance, and semantics in a single feed-forward pass, LSM can synthesize versatile label maps and interact through language at novel views. This is achieved by integrating global geometry via pixel-aligned point maps and enhancing spatial attribute regression through local context aggregation with multi-scale fusion.

One of the key innovations of LSM is the incorporation of a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. This allows for natural language-driven scene manipulation and enables supervised end-to-end learning through an efficient decoder that parameterizes semantic anisotropic Gaussians.

Comprehensive experiments on various tasks have demonstrated that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time. By utilizing a generic Transformer to regress pixel-aligned point maps and predict point-based scene parameters, LSM minimizes the loss function through comparisons against ground truth data.

Method Overview

LSM utilizes input images to regress pixel-aligned point maps using a generic Transformer. Point-based scene parameters are then predicted through another Transformer, facilitating local context aggregation and hierarchical fusion. The model elevates 2D pre-trained features to facilitate a consistent 3D feature field, enabling supervised end-to-end learning for real-time semantic 3D reconstruction.

Conclusion

Large Spatial Model represents a significant advancement in the field of computer vision, offering a streamlined approach to 3D reconstruction from unposed images. By integrating global geometry, local context aggregation, and language-driven scene manipulation, LSM achieves real-time semantic 3D reconstruction and unifies multiple 3D vision tasks in a single feed-forward pass.

LargeSpatialModel (LSM)

Visit largespatialmodel.github.io