Leveraging Geospatial Representation Learning via Multimodal Contrastive Learning
3.3.1 Geoinformatics: how’s AI doing these days?

Leveraging Geospatial Representation Learning via Multimodal Contrastive Learning

Thursday, Oct 9, 2025
10:10 AM - 10:30 AM | Europe/Berlin
INTERGEO Conference | Transparenz 2 (no translation/keine Übersetzung)
English
About

Spatial data are characterised by intrinsic properties such as autocorrelation, scale dependence, and high heterogeneity. These characteristics require explicit consideration in the development of data-driven models. With the growing availability of spatial data sets, such as those derived from remote sensing, natural images, or textual sources with a spatial reference, not only the volume increases but also the variety of potentially combinable information. Currently, manual labelling of such data for specific tasks is resource intensive. Self-supervised learning provides a viable approach in this context: it facilitates the automatic generation of suitable representations from large volumes of unlabelled data and serves as a foundation for adaptable, cross-domain downstream tasks similar to foundation models.

Approaches such as SatCLIP, GeoCLIP, and CSP have effectively illustrated the contrastive coupling of two modalities, namely image data and spatial positioning, within a unified embedding space. Although these binary models succeed in mitigating the necessity for labelling, their representational capacity remains constrained, as they encapsulate only a portion of the accessible locational data.

This study explores the integration of multiple modalities into a shared embedding space through contrastive learning, utilising position as a binding element. The modalities include texts from Wikipedia, multispectral Sentinel-2 imagery, and geolocations. To achieve this, three architectural models are developed: (1) an adapted version of ImageBind, (2) cyclic training employing a fixed position encoder, and (3) a modified NT-Xent loss that fuses data pairs from diverse modalities.

Preliminary training with the text dataset reveals that the generated representations exhibit properties distinct from those of image-focused models. The evaluation is based on classification and regression-based downstream tasks in order to systematically analyse the representation quality, generalisation ability and information content of multimodal spatial embeddings.

Speakers

Jonathan Hecht

Master Student

Moderators

Ribana Roscher

Professor of Data Science for Crop Systems