GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

^* Equal Contribution, ¹Technical University of Munich, ²Universitat Politecnica de Catalunya, ³Google, ⁴Munich Center for Machine Learning, ⁵University of Tubingen, ⁶ETH Zurich, ⁷Visualais

Abstract

We introduce GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scenespecific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a crossattention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA’s remarkable open-vocabulary performance on both 2D and 3D.

BibTeX

@InProceedings{alegret2025gala, author = {Elena Alegret* and Kunyi Li* and Sen Wang and Siyun Liang and Michael Niemeyer and Stefano Gasperini and Nassir Navab and Federico Tombari}, title = {GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting}, booktitle = {International Conference on 3D Vision (3DV)}, year = {2026}, }

GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

3DV 2025

Abstract

GALA

2D Open-Vocabulary Query

3D Open-Vocabulary Segmentation

BibTeX