RobotForge
Published·~13 min

Modern SLAM: learned features and Gaussian splatting

SuperPoint, DROID-SLAM, Gaussian splats — the deep-learning wave reshaping the SLAM landscape. What's new, what classical methods still beat, and where the production frontier is in 2026.

by RobotForge
#slam#deep-learning#modern

For 30 years SLAM was geometry: features + matching + optimization, all hand-crafted. Since 2018 deep learning has been eating piece after piece of the pipeline. Learned features replaced ORB. Learned matchers replaced descriptor distance. Learned end-to-end SLAM systems exist. And Gaussian splatting brought differentiable photorealistic mapping to the field. Here's what's actually production in 2026.

Classical (ORB) Learned (SuperPoint)
Classical detectors fire on geometric corners only; learned detectors find repeatable points across textures and edges as well, giving more correspondences in featureless or wide-baseline scenes.

The four lines of attack

  1. Replace components: keep the classical pipeline; swap in learned features, matchers, depth estimators.
  2. End-to-end SLAM: train a neural network that ingests video and outputs trajectory + map.
  3. Neural fields for mapping: replace point clouds with NeRF or Gaussian splat representations.
  4. Foundation-model SLAM: leverage VLMs and diffusion models as priors over scene structure.

1. Component-level replacements

SuperPoint (2018) and SuperGlue (2020)

SuperPoint: a CNN that detects keypoints + computes descriptors in one forward pass. Significantly more repeatable than ORB across viewpoint, lighting, and weather changes.

SuperGlue: a graph-neural-network matcher. Given two sets of SuperPoint descriptors, output the optimal one-to-one matching. Robust to occlusion, viewpoint change, low texture.

Practical impact: SuperPoint + SuperGlue beats ORB + brute-force matching by ~30 percentage points on hard wide-baseline benchmarks. Used in modern SfM (COLMAP-Pixel) and several SLAM systems.

Cost: 5–10× slower than ORB; requires GPU.

LightGlue (2023)

SuperGlue with adaptive computation — fast pairs are matched quickly, hard pairs get more attention. ~3× faster than SuperGlue with similar accuracy. The new "default learned matcher."

Learned depth (MiDaS, Depth Anything, ZoeDepth)

Single-image depth estimation. Useful as a prior for monocular SLAM (which is otherwise scale-ambiguous). Doesn't replace stereo or LiDAR but provides robust scale anchors.

2. End-to-end neural SLAM

DROID-SLAM (2021)

One model takes a video stream, outputs camera poses + dense depth maps. Internally: differentiable bundle adjustment via a recurrent network. End-to-end trained on synthetic data.

Strengths: dense reconstruction; works in textureless scenes; good loop closure.

Weaknesses: heavy compute (RTX 3090+); slow real-time; not yet robust enough to replace ORB-SLAM3 in production.

NICE-SLAM, Co-SLAM (2022)

Combine classical tracking with neural-field mapping. Tracking is geometric; the map is a multi-resolution feature grid (NICE) or hash grid (Co). Reconstruction is dense; tracking remains real-time.

3. Gaussian splatting for SLAM

3D Gaussian Splatting (Kerbl et al., 2023) represents a scene as millions of 3D Gaussians, each with position, covariance, color, and opacity. Trained by gradient descent on multi-view photo loss. The result is a continuous-density representation that renders extremely fast (real-time at 30+ FPS) and looks photorealistic.

SLAM systems using Gaussian splats:

  • SplaTAM (2024): tracks the camera by minimizing rendering loss against the current Gaussian map. Updates the map by adding new Gaussians from each frame.
  • MonoGS (2024): monocular Gaussian-splat SLAM.
  • RTG-SLAM (2024): real-time variant; faster updates but lower quality.

What this buys: dense, photorealistic maps usable for visualization, novel-view synthesis, and downstream perception. The map IS the rendering.

What it doesn't buy yet: as fast as classical SLAM. Memory-heavy. Doesn't yet replace LiDAR-LOAM for autonomous-driving-scale environments.

4. Foundation models in SLAM

Most exploratory work in 2024–25. Examples:

  • VLM-driven place recognition: use a vision-language model to embed images; loop-closure detection via the embedding distance.
  • Diffusion priors for mapping: when the scene is partially observed, a diffusion model fills in plausible geometry.
  • LLM-driven semantic SLAM: caption regions of the map; query with natural language.

Production-ready in 2026? Mostly no. Promising on benchmarks; not yet robust enough to replace classical pipelines.

What classical methods still win on

  • Speed: ORB-SLAM3 / LIO-SAM run real-time on a single CPU. Most learned SLAM needs a GPU.
  • Memory: classical maps are kilobytes per square meter; Gaussian-splat maps are megabytes.
  • Robustness in known regimes: feature-based SLAM has well-characterized failure modes; neural SLAM can fail in surprising ways.
  • Auditability: when classical SLAM goes wrong, you can trace which feature mismatched. Neural SLAM gives you a black box.
  • Edge deployment: classical SLAM runs on a Jetson Nano; neural needs an Orin or better.

What learned methods win on

  • Featureless / hard scenes: white walls, uniform ground, fog. Classical features fail; learned features find subtle patterns.
  • Wide-baseline matching: revisits from very different angles. SuperGlue/LightGlue beat ORB+brute-force handily.
  • Photorealistic rendering: Gaussian splats produce maps you can fly through visually. Classical SLAM produces a sparse point cloud.
  • Long-term changes: learned descriptors generalize better across day/night, summer/winter.

The production hybrid

Most 2026 production SLAM stacks combine:

  1. Classical front-end: ORB or LiDAR features, fast feature extraction.
  2. Learned matcher (LightGlue): when the scene is hard.
  3. Classical bundle adjustment: the math is well-understood; converges fast.
  4. Optional Gaussian-splat layer: for visualization or downstream tasks needing dense maps.
  5. Optional learned loop-closure: NetVLAD or recent vision-language embeddings.

Each component is the best tool for its specific role. Pure-classical or pure-neural systems both lose to thoughtful hybrids.

The compute trajectory

Learned SLAM's compute requirement keeps dropping (better models, faster GPUs). 2018 SuperPoint needed a desktop GPU. 2026 LightGlue runs on Jetson Orin. The frontier of "what's deployable" expands every year.

By ~2028, end-to-end neural SLAM at edge-deployable rates is plausible. By then "classical SLAM" will be a heritage technology like Kalman filters — still useful, no longer the cutting edge.

Where to start

  • Run SuperPoint/LightGlue as a drop-in replacement in your existing visual SLAM pipeline. Compare accuracy on hard scenes.
  • Try SplaTAM on TUM RGB-D. Watch it produce photorealistic maps from RGB-D streams.
  • Read DROID-SLAM's paper. Understand why differentiable bundle adjustment is interesting (and what it costs).

Exercise

On a YouTube video of an indoor walk, run COLMAP (classical SfM) and compare with a Gaussian-splat reconstruction. The classical version produces a sparse point cloud that you can use for navigation. The splat version produces a 3D model you can fly through. Different outputs, different uses.

Next

GPS, RTK, and outdoor state estimation — when you don't need SLAM because the satellites can tell you where you are.

Comments

    Sign in to post a comment.