In the world of computer vision, Simultaneous Localization and Mapping (SLAM) is a critical technique used to create a map of an environment while simultaneously keeping track of a device’s position within it. Traditional SLAM methods rely heavily on geometric models and point cloud data to represent the environment. However, with the advent of deep learning and neural implicit representations, new approaches are revolutionizing SLAM.
One such approach is NICER-SLAM, which stands for Neural Implicit Scene Encoding for RGB SLAM. NICER-SLAM merges the power of Neural Radiance Fields (NeRFs) and RGB camera inputs to construct accurate 3D maps and deliver more robust tracking performance. This blog will take you through the key concepts, architecture, and applications of NICER-SLAM.
What is NICER-SLAM?
NICER-SLAM is a learning-based approach to RGB SLAM, which bypasses traditional point cloud generation and instead encodes the environment using implicit neural networks. Unlike conventional SLAM systems that rely on a feature-based pipeline, NICER-SLAM optimizes a neural network to represent the scene as a continuous function over space. This implicit representation of the scene significantly enhances the accuracy of mapping and tracking tasks using just RGB data.
Why Neural Implicit Representations?
Neural implicit representations (or NeRFs) have recently gained traction due to their ability to represent 3D scenes in a high-resolution, continuous manner without needing explicit 3D models like meshes or voxels. In the context of SLAM, this capability is highly advantageous because it allows for:
Compact scene encoding: Neural networks compress the scene data into a small, continuous model.
Smooth interpolation: Scenes can be smoothly reconstructed at any viewpoint, avoiding the discretization issues that arise in traditional 3D SLAM techniques.
Less reliance on depth sensors: NICER-SLAM performs SLAM operations with RGB cameras, reducing hardware complexity.
How Does NICER-SLAM Work?
NICER-SLAM consists of a two-stage architecture: scene representation and pose estimation. Below is a breakdown of the key components of its pipeline:
1. Scene Representation via Neural Fields
In NICER-SLAM, the scene is encoded using Neural Radiance Fields (NeRFs), a popular method for implicit scene representation. NeRFs represent scenes as a continuous volumetric field, where every point in space is associated with a radiance value (the color) and density (how much light the point blocks).
NeRF-based Scene Model: NICER-SLAM trains a neural network to predict the color and density of points in 3D space, given a viewpoint .
Optimization: The network is optimized by minimizing the photometric error between the actual image captured by the camera and the rendered image generated by the model. This allows for reconstructing a high-fidelity 3D model from RGB data .
2. Pose Estimation and Tracking
To achieve localization, NICER-SLAM tracks the camera’s pose by continuously adjusting the position of the camera with respect to the environment. It employs a learning-based pose estimation network, which uses the encoded scene and the camera images to predict accurate camera poses .
Pose Optimization: The camera pose is iteratively refined by minimizing the error between the projected 3D model and the observed RGB frames, ensuring precise tracking even in challenging environments .
Differentiable Rendering: The system uses differentiable rendering to compute the gradients that guide the optimization of the scene representation and pose estimation together .
Architecture Overview
NICER-SLAM architecture
The architecture of NICER-SLAM can be divided into three primary sections:
Neural Implicit Encoder: The neural network that processes RGB inputs to encode the scene into a continuous neural field.
NeRF Scene Representation: Neural Radiance Fields that store the scene in a compact and implicit form.
Pose Estimator: A learning-based network that estimates the camera pose by leveraging both scene geometry and RGB image information .
Rendering Results
Below is an example of the rendering results obtained using NICER-SLAM. These results showcase the system’s ability to create detailed, high-resolution maps of the environment with smooth surfaces and accurate color reproduction.
Rendering result on the self-captured outdoor dataset
Applications of NICER-SLAM
The innovations brought by NICER-SLAM open up possibilities in several fields:
Autonomous Robotics: Robots can perform high-precision navigation and mapping in unknown environments without relying on depth sensors .
Augmented Reality (AR): NICER-SLAM can enable smoother and more accurate scene reconstructions for AR applications, improving the visual fidelity and interaction with virtual objects .
3D Reconstruction: NICER-SLAM’s ability to create dense and continuous 3D maps makes it a strong candidate for tasks in 3D scanning and modeling .
References:
Wang, N., et al. (2021). NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv preprint.
Mildenhall, B., et al. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Proceedings of the European Conference on Computer Vision (ECCV). arXiv preprint.
Zhu, Z., et al. (2021). Learning-Based SLAM: Progress, Applications, and Future Directions. IEEE Robotics and Automation Magazine. arXiv preprint.
Klein, G., & Murray, D. (2007). Parallel Tracking and Mapping for Small AR Workspaces. IEEE/ACM Transactions on Graphics (TOG). arXiv preprint.
This is the 2nd part of the two-part post series, prepared by Krishna and I. Krishna presented an overview of SLAM systems in a very intuitive and engaging way. In this part, I explore the future of SLAM systems in endoscopy and how our team plans to shape it.
Collaborators: Krishna Chebolu
Introduction
What about the future of SLAM endoscopy systems? To get an insight on where research is heading, we must first discuss the challenges posed by the task to localise an agent and map such a difficult environment, as well as the weaknesses of current systems.
On one hand, the environment of the inside of the human body, coupled with data/device heterogeneity and image quality issues, significantly hinder the performance of endoscopy SLAM systems [1], [2] due to:
1) Texture scarceness, scale ambiguity
2) Illumination variation
3) Bodies (foreign or not), fluids and their movement (e.g., mucus, mucosal movement)
6) Underlining scene dynamics (e.g., imminent corruption of frames with severe artefacts, large organ motion and surface drifts)
7) Data heterogeneity (e.g., population diversity, rare or inconspicuous disease cases, variability in disease appearances from one organ to the other, endoscope variability)
8) Difference in device manufacturers
9) Input of experts being required for their reliable development
10) The organ preparation process
11) Additional imaging quality issues (e.g. free/non-uniform hand motions and organ movements, different image modalities)
12) Real time performance (speed and accuracy trade-off)
Current research of endoscopic SLAM systems mainly focuses on the first 3 of the aforementioned challenges; the state-of-the-art pipelines focus on understanding depth despite the lack of texture, as well as handling lighting changes and foreign bodies like mucus that can be reflective or move and, thus, skew the mapping reconstruction.
Images 1, 2, 3: The images above showcase the three main problems that skew the tissue structure understanding and hinder the performance of mapping of SLAM systems in endoscopy: (1) foreign bodies that are reflective (2) lighting variations and (3) lack of texture. Image credits: [3], [3], [4].
On the other hand, we must pinpoint where the weaknesses of such systems lie. The three main modules of AI endoscopy systems, that operate on image data, are Simultaneous Localization and Mapping (SLAM), Depth Estimation (DE) and Visual Odometry (VO); with the last two being submodules of the broader SLAM systems. SLAM is a computational method that enables a device to map its environment while simultaneously determining its own position within that map, which is often achieved via VO; a technique that estimates the camera’s position and trajectory by examining changes across a series of images. Depth estimation is the process of determining the distance between a camera and the objects in its view by analyzing visual information from one or more images, which is crucial for SLAM to accurately map the environment in three dimensions and understand its surroundings more effectively. Attempting to use general purpose SLAM systems on endoscopy data clearly shows that DE and map reconstruction are underperforming, while localisation/VO is sufficiently captured. This conclusion was reached based on initial experiments; however, further investigations are warranted.
Though the challenges and system weaknesses that current research aims to address are critical aspects of the models’ usability and performance, there is still a wide gap between the curated settings under which these models perform and real-world clinical settings. Clinical applications are still uncommon, due to the lack of holistic and representative datasets, in conjuction with limited participation of clinical experts. This leads to models that lack generalisability; widely used supervised techniques are data voracious and require many human annotations, which, apart from scarce, are often imperfect or overfitted to predominant samples in cohorts. Novel deep learning methods should be steered towards training on diverse endoscopic datasets, the introduction of explainability of results and the interpretability of models, which are required to accelerate this field. Finally, suitable evaluation metrics (i.e. generalisability assessments and robustness tests) should be defined to determine the strength of developed methods in regards to clinical translation.
For a future of advanced and applicable AI endoscopy systems, the directions are clear, as discussed in [1]:
1) Endoscopy-specific solutions must be developed, rather than just applying pipelines from the computer vision field
2) Robustness and generalisation evaluation metrics of the developed solutions must be defined to set the standard to assess and compare model performance
3) Practicability, compactness and real time effectiveness should also be quantified
4) More challenging problems should be explored (subtle lesions instead of apparent lesions)
5) The developed models should be able to adapt to datasets produced in different clinics, using different endoscopes, in the context of varying manifestations of diseases
6) Multi-modal and multi-scale integration of data should be incorporated in these systems
7) Clinical validation is necessary to steadily integrate these systems in the clinical process
Method
But how do weenvision the future of SLAM endoscopy systems?
Our team aims to address directly the issues of texture scarceness, illumination variation and handling of foreign bodies, while indirectly combating some of the rest of the challenges. Building upon state-of-the-art SLAM systems, which already handle localisation/VO sufficiently, we aim to further enhance their mapping process, by integrating a state-of-the-art endoscopy monocular depth estimation pipeline [3] and by developing a module to understand lighting variations in the context of endoscopic image analysis. The aforementioned module will have a corrective nature, automatically adjusting the lighting in the captured images to ensure that the visuals are clear and consistent. Potentially, it could also enhance the image quality by adjusting brightness, contrast, and other image parameters in real-time, standardizing the images of different frames of the endoscopy video. As the module’s task is to improve the visibility and consistency of the image features, it would consequentially also support the depth estimation process, by providing clearer cues and contrast for accurate depth calculations by the endoscopy monocular depth estimation pipeline. Thus, the module would ensure a more consistent and refined input to the SLAM model, rather than raw endoscopy data, which suffer from inconsistencies and heterogeneities, never seen before by the model. With the aforementioned integrations we aim to develop a specialised SLAM endoscopy system and test it in the context of clinical colonoscopy [5]. Ideally, the plan is to first train and test our pipeline on a curated dataset to test its performance under controlled settings and then it would be of great interest to adjust each part of the pipeline to make it perform on real-world clinical data or across multiple datasets. This will provide us with the opportunity to see where a state-of-the-art SLAM endoscopy system stands in the context of real-world applicability and help quantify and address the issues explored in the previous section.
Image 4: State-of-the-art clinical mesh reconstruction using the endoscopy monocular depth estimation pipeline [5].
Image 5: The endoscopy monocular depth estimation pipeline also extracts state-of-the-art depth estimation in endoscopy videos.
Colonoscopy Data
The type of endoscopy procedure we choose to develop our pipeline for is colonoscopy; a medical procedure that uses a flexible fibre-optic instrument equipped with a camera and light (a colonoscope) to examine the interior of the colon and rectum. More specifically, we select to work with the Colonoscopy 3D Video Dataset (C3VD) [5]. The significance of this dataset study is the fact that it provides high quality ground truth data, obtained using a high-definition clinical colonoscope and high-fidelity colon models, creating a benchmark for computer vision pipelines. The study introduced a novel multimodal 2D-3D registration technique to register optical video sequences with ground truth rendered views of a known 3D model.
Video 1: C3VD dataset: Data from the colonoscopy camera (left) and depth estimation (right) extracted by a Generative Adversarial Network (GAN). Video credits: [5]
Conclusion
SLAM systems are the state-of-the-art for localisation and mapping and endoscopy is the gold standard procedure for many hollow organs. Combining the two, we get a powerful medical tool that can not only improve patient care, but also be life-defining in some cases. Its use cases can be prognostic, diagnostic, monitoring and even therapeutic, ranging from, but not limited to: disease surveillance, inflammation monitoring, early cancer detection, tumour characterisation, resection procedures, minimally invasive treatment interventions and therapeutic response monitoring. With the development of SLAM endoscopy systems, the endoscopy surgeon has acquired a visual overview of various environments inside the human body, that would otherwise be impossible. Endoscopy being highly operator-dependent with grim clinical outcomes in some disease cases, makes reliable and accurate automated system guidance imperative. Thus, in recent years, there has been a significant increase in the publication of endoscopic imaging-based methods within the fields of computer-aided detection (CADe), computer-aided diagnosis (CADx) and computer-assisted surgery (CAS). In the future, most designed methods must be more generalisable to unseen noisy data, patient population variability and variable disease appearances, giving an answer to the multi-faceted challenges that the latest models fail to address, under actual clinical settings.
This post concludes part 11 of What it Takes to Get a SLAM Dunk.
Image 6: Michael Jordan (considered by me and many as the G.O.A.T.) performing his most famous dunk. Image credits: ScienceABC
References
[1] Ali, S. Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions. npj Digit. Med.5, 184 (2022). https://doi.org/10.1038/s41746-022-00733-3
[2] Ali, S., Zhou, F., Braden, B. et al. An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy. Sci Rep10, 2748 (2020). https://doi.org/10.1038/s41598-020-59413-5
[3] Paruchuri, A., Ehrenstein, S., Wang, S., Fried, I., Pizer, S. M., Niethammer, M., & Sengupta, R. Leveraging near-field lighting for monocular depth estimation from endoscopy videos. In Proceedings of the European Conference on Computer Vision (ECCV), (2024). https://doi.org/10.48550/arXiv.2403.17915
[5] Bobrow, T. L., Golhar, M., Vijayan, R., Akshintala, V. S., Garcia, J. R., & Durr, N. J. Colonoscopy 3D video dataset with paired depth from 2D-3D registration. Medical Image Analysis, 102956 (2023). https://doi.org/10.48550/arXiv.2206.08903
In this two-part post series, Nicolas and I dive deeper into SLAM systems– our project’s focus for the past two weeks. In this part, I introduce and cover the evolution of SLAM systems. In the next part, Nicolas harnesses our interest by discussing the future. By the end of both parts, we should be able to give you an overview of What it Takes to Get a SLAM Dunk.
Collaborators: Nicolas Pigadas
Introduction
Simultaneous Localization and Mapping (SLAM) systems have become a standard in various technological fields, from autonomous robotics to augmented reality. However, in recent years, this technology has found a particularly unique application in medical imaging– in endoscopic videos. But what is SLAM?
Figure 1: A sample image using SLAM reconstruction from SG News Desk.
SLAM systems were conceptualized in robotics and computer vision for navigation purposes. Before SLAM, the fields employed more elementary methods,
Mapping: the process of creating a representation of an environment, typically in the form of a 2D or 3D map. This was done using grid-based and feature-based methods.
You may be thinking, Krishna, you just described SLAM systems, it sounds like. You are right, but the localizing and mapping were separate processes. So a robot would go through the pains of the Heisenberg principle, i.e., the robot would either localize or map– the or is exclusionary.
It was fairly obvious, but still daunting what the next step in research would be. Before we SLAM dunk our basketball, we must do a few lay-ups and free-throw shoots first.
Precursors to SLAM
Here are some inspirations that contributed to the development of SLAM
Probabilistic robotics: The introduction of probabilistic approaches, such as Bayesian filtering, allowed robots to estimate their position and map the environment with a degree of uncertainty, paving the way for more integrated systems.
Kalman filtering: a mathematical technique for estimating the state of a dynamic system. It allowed for continuous estimation of a robot’s position and could be invariant to noisy sensor data.
Cognitive Mapping in Animals: Research in cognitive science and animal navigation provided theoretical inspiration, particularly the idea that animals build mental maps of their environment while simultaneously keeping track of their location.
Figure 3: Spatial behavior and cognitive mapping of mice with aging. Image from Nature.
SLAM Dunk – A Culmination (some real Vince Carter stuff)
Finally, many researchers agreed that the separation of localizing and mapping was ineffective, and great efforts went into their integration. SLAM was developed. The goal was to enable systems to explore and understand an unknown environment autonomously, they needed to localize and map the environment simultaneously, with each task informing and improving the other.
With its unique ability to localize and map, researchers found SLAM’s use in any sensory device. Some of SLAM’s earlier use were sensor-based; so data would be inputted from range finders, sonar, and LIDAR; in the late 80s and early 90s. It is good to note that the algorithms were computationally intensive– and still are.
As technology evolved, a vision-based SLAM emerged. This shift was inspired by the human visual system, which navigates the world primarily through sight, enabling more natural and flexible mapping techniques.
Key Milestones
With the latest iterations of SLAM being exponentially better than the origin, it is important to recognize the journey. Here are notable SLAM systems:
EKF-SLAM (Extended Kalman Filter SLAM): One of the earliest and most influential SLAM algorithms, EKF-SLAM, laid the foundation for probabilistic approaches to SLAM, allowing for more accurate mapping and localization.
FastSLAM: Introduced in the early 2000s, FastSLAM utilized particle filters, making it more efficient and scalable. This development was crucial in enabling real-time SLAM applications.
Visual SLAM: The transition to vision-based SLAM in the mid-2000s opened new possibilities for the technology. Visual SLAM systems, such as PTAM (Parallel Tracking and Mapping), enabled more detailed and accurate mapping using standard cameras, a significant step toward broader applications.
Figure 4: Left LSD-SLAM, right ORB-SLAM. Image found in fzheng.me
From Robotics to Endoscopy (Medical Vision)
As SLAM technology matured, researchers explored its potential beyond traditional robotics. Medical imaging, particularly endoscopy, presented a fantastic opportunity for SLAM. Endoscopy is a medical procedure involving a flexible tube with a camera to visualize the body’s interior, often within complex and dynamic environments like the gastrointestinal tract.
It is fairly trivial why SLAM could be applied to endoscopic and endoscopy-like procedures to gain insights and make more medically informed decisions. Early work focused on using visual SLAM to navigate the gastrointestinal tract, where the narrow and deformable environment presented significant challenges.
One of the first successful implementations involved using SLAM to reconstruct 3D maps of the colon during colonoscopy procedures. This approach improved navigation accuracy and provided valuable information for diagnosing conditions like polyps or tumors.
Researchers also explored the integration of SLAM with other technologies, such as optical coherence tomography (OCT) and ultrasound, to enhance the quality of the maps and provide additional layers of information. These efforts laid the groundwork for more advanced SLAM systems capable of handling the complexities of real-time endoscopic navigation.
Figure 6: Visual of Optical Coherence Tomography from News-Medical.
Endoscopy SLAMs – What Our Group Looked At
As a part of our study, we looked at some presently used and state-of-the-art SLAM systems. Below are the three that various members of our team attempted:
NICER-SLAM (RGB): a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, which also allows for high-quality novel view synthesis.
ORB3-SLAM (RBG): (there is also ORB1 and ORB2) ORB-SLAM3 is the first real-time SLAM library able to perform Visual, Visual-Inertial, and Multi-Map SLAM with monocular, stereo, and RGB-D cameras, using pin-hole and fisheye lens models. In all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature and significantly more accurate.
DROID-SLAM (RBG): a new deep learning-based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixel-wise depth through a Dense Bundle Adjustment layer.
Some other SLAM systems that our team would have loved to try our hand at are:
Gaussian Splatting SLAM: first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM.
GlORIE-SLAM: Globally Optimized RGB-only Implicit Encoding Point Cloud SLAM. This system uses a deformable point cloud as the scene representation and achieves lower trajectory error and higher rendering accuracy compared to competitive approaches.
This concludes part 1 of What it Takes to Get a SLAM Dunk. This post should have given you a gentle, but robust-enough introduction to SLAM systems. Vince Carter might even approve.
Figure 9: An homage to Vince Carter, arguably the greatest dunk-er ever. Image from Bleacher Report.