Categories
Research

Introduction to 3D Gaussian Splatting

Project Mentors: Ilke Demir, Sainan Liu and Alexey Soupikov

Among novel view synthesis methods, Neural Radiance Fields, or NeRFs, revolutionized implicit 3D scene representations by optimizing a Multi-Layer Perceptron (or MLP) using volumetric ray-marching [1]. While the continuity in these methods helps the optimization procedure, achieving high visual quality requires neural networks that are costly to train and render.

3D Gaussian Splatting for Real-Time Radiance Field Rendering

A recent work [2] addresses this issue by proposing a 3D Gaussian representation that achieves equal or better quality than the previous implicit radiance field approaches. This method builds on three main components, namely (1) Structure-from-Motion (SfM), (2) optimization of 3D Gaussian properties, and (3) real-time rendering. The proposed method demonstrates state-of-the-art visual quality and real-time rendering on several established datasets.

1. Differentiable 3D Gaussian Splatting

As with previous NeRF-like methods, the 3D Gaussian splatting method takes a set of images of a static scene, together with the corresponding cameras calibrated by Structure-from-Motion (SfM) [3] as input. SfM enables obtaining a sparse point cloud without normals to initialize a set of 3D Gaussians. Following Zwicker et. al. [4], the Gaussians are defined as

\(G(x)=e^{-\frac{1}{2}(x)^T \Sigma^{-1}(x)}, \) (1)

where \(\Sigma\) is a full 3D covariance matrix defined in world space [4] centered at point (mean) \(\mu\). This Gaussian is multiplied by the parameter \(\alpha\) during the blending process.

As we need to project the 3D Gaussians to 2D space for rendering, following Zwicker et al. [4] let us define the projection to the image space. Given a viewing transformation \(W\), the covariance matrix \(\Sigma’\) in camera coordinates can be denoted as

\(\Sigma^{\prime}=J W \Sigma W^T J^T, \) (2)

where \(J\) is the Jacobian of the affine approximation of the projective transformation.

The covariance matrix \(\Sigma\) of a 3D Gaussian is analogous to describing the configuration of an ellipsoid. Given a scaling matrix \(S\) and rotation matrix \(R\), we can find the corresponding \(\Sigma\) such that

\(\Sigma=R S S^T R^T.\) (3)

This representation of anisotropic covariance allows the optimization of 3D Gaussians to adapt to the geometry of different shapes in captured scenes, resulting in a fairly compact representation.

2. Optimization with Adaptive Density Control of 3D Gaussians

2.1 Optimization of Gaussians

The optimization step creates a dense set of 3D Gaussians representing the scene for free-view synthesis. In addition to positions \(p, š¯›¼,\) and covariance \(\Sigma\), the spherical harmonics (SH) coefficients representing color \(c\) of each Gaussian are also optimized to correctly capture the view-dependent appearance of the scene. The optimization of these parameters is interleaved with steps that control the density of the Gaussians to better represent the scene.

The optimization is performed through successive iterations of rendering and comparing the resulting image to the training views in the provided set of images in the static scene. The loss function is\(L_1\) is combined with a D-SSIM term as

\(\mathcal{L}=(1-\lambda) \mathcal{L}_1+\lambda \mathcal{L}_{\text {D-SSIM }}. \) (4)

2.2 Adaptive Control of Gaussians

The optimization step needs to be able to create geometry and also destroy or move geometry when needed. The adaptive control enables a scheme to densify the Gaussians. After optimization warm-up, the densification is performed at every 100 iterations. In addition, any Gaussian that is essentially transparent, and therefore does not have a contribution in the representation, with \(\alpha < \epsilon_\alpha\) is removed.

Figure 1. demonstrates the densification procedure for the adaptive control of Gaussians.

Simply put, the densification procedure clones a new Gaussian when the small-scale geometry is not sufficiently covered. In the opposite case, when the small-scale geometry is represented by one large splat, the Gaussian is then split in two. Optimization continues with this new set of Gaussians.

3. Fast Differentiable Rasterizer for Gaussians

Tiling and camera view frustum. The method first splits the screen into \(16\times16\) tiles and then culls the 3D Gaussians against the camera view frustum and each tile. Only the Gaussians with 99% confidence intersecting the frustum are kept. In addition, the Gaussians at extreme positions like the ones with means close to the near plane and far outside the view frustum are rejected.

Sorting. All Gaussians are instantiated concerning the number of tiles they overlap. Each instance is assigned a key that combines view space depth and tile ID, and the keys are sorted via single-fast GPU Radix sort.

Rasterization. Finally, a list for each tile is produced using the first and last depth-sorted entry that splats to a given tile. Rasterization is performed in parallel with one thread block for each tile. For rasterization, we launch one thread block for each tile. Each thread first loads the Gaussians into shared memory and then, for a given pixel, accumulates color and \(\alpha\) values by traversing the lists front to back.

This procedure enables fast overall rendering and fast sorting to allow approximate \(\alpha\)-blending and to avoid hard limits on the number of splats that can receive gradients.

Our Results

We test the 3D Gaussian splatting method, on the recently proposed dataset Neu3D [5]. Note that the original 3D Gaussian splatting work is limited to static scenes, while the Neu3D dataset provides dynamic scenes in time. Therefore, we provide the initial frames from each camera view of the flaming salmon scene as input. This corresponds to 19 frames of camera views, in total. We first execute the SfM method, in this case, COLMAP, on these frames to obtain the camera poses as discussed in Sec. 1. Next, we optimize the set of 3D Gaussians for a duration of 30K iterations.

Quantitative Evaluation: Metrics and Runtime

After 30K iterations of training, we report the metrics that are computed on the 3D Gaussian splatting method. \(L_1\) loss corresponds to the average difference between the set of training images and the rendered images as in Sec. 2.1 (Eq. 4). PSNR, or Peak Signal-to-Noise Ratio, corresponds to the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation, measured in decibels. The memory row shows how much memory space the resulting representation occupies.

L1 loss [30K]0.0074
PSNR [30K]37.8426 dB
Memory [PLY]162,5 MB
Table 1. shows the metrics for 3D Gaussian splatting on the flaming salmon scene of Neu3D.

We additionally report the runtime for the first two components, i.e. executing the SfM method to initialize the Gaussians and optimizing the Gaussians to obtain the resulting scene representation. The third component, rasterization, is performed in real time.

COLMAP – Elapsed time0.013 minutes
Training Splats [30K]10 minutes
Table 2. shows the runtime for 3D Gaussian splatting on the flaming salmon scene of Neu3D.

Qualitative Evaluation

We demonstrate the results from the model trained on the initial camera frames of the flaming salmon scene of Neu3D. We provide the real-time rendering results from the viewer where the camera is moving in space in Fig. 2.

Figure 2. shows the results of real-time rendering on SIBR viewer.

Conclusion and Future Work

3D Gaussian splatting is the first approach that truly allows real-time, high-quality radiance field rendering while requiring training times competitive with the fastest previous methods. One of the greatest limitations of this method comes from the memory consumption as the GPU memory usage can rise to ~20GB during training and therefore results in high storage costs. Second, the scenes that 3D Gaussian splatting is concerned with are static and do not incorporate an additional dimension to space, i.e. time. Recent work addresses these limitations through quantization and improvements on Gaussian representation to represent both time and space which we will discuss in the next post.

References

[1] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. ECCV 2020.

[2] Kerbl, B., Kopanas, G., LeimkĆ¼hler, T., & Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. (SIGGRAPH) 2023.

[3] Snavely, N., M. Seitz, S., and Szeliski, R. Photo tourism: exploring photo collections in 3D. ACM Trans. Graph. (SIGGRAPH) 2006.

[4] Zwicker, M., Pfister, H., Van Baar, J., and Gross, M. EWA volume splatting. In Proceedings Visualization, VISā€™01. IEEE, 2001.

[5] Li, T., Slavcheva, M., Zollhoefer, M., Green, S., Lassner, C., Kim, C., Schmidt, T., Lovegrove, S., Goesele, M., Newcombe, R., et al. Neural 3d video synthesis from multi-view video. CVPR 2022.