We introduce GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scenespecific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a crossattention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA’s remarkable open-vocabulary performance on both 2D and 3D.
@InProceedings{alegret2025gala,
author = {Elena Alegret* and Kunyi Li* and Sen Wang and Siyun Liang and Michael Niemeyer and Stefano Gasperini and Nassir Navab and Federico Tombari},
title = {GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting},
booktitle = {International Conference on 3D Vision (3DV)},
year = {2026},
}