GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

* Equal Contribution, 1Technical University of Munich, 2Universitat Politecnica de Catalunya, 3Google, 4Munich Center for Machine Learning, 5University of Tubingen, 6ETH Zurich, 7Visualais

3DV 2025

Abstract

We introduce GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scenespecific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a crossattention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA’s remarkable open-vocabulary performance on both 2D and 3D.

GALA

GALA teaser

2D Open-Vocabulary Query

GALA teaser

3D Open-Vocabulary Segmentation

GALA teaser

BibTeX

@InProceedings{alegret2025gala, 
      author = {Elena Alegret* and Kunyi Li* and Sen Wang and Siyun Liang and Michael Niemeyer and Stefano Gasperini and Nassir Navab and Federico Tombari}, 
      title = {GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting}, 
      booktitle = {International Conference on 3D Vision (3DV)}, 
      year = {2026},
    }