About
Spatial data are characterised by intrinsic properties such as autocorrelation, scale dependence, and high heterogeneity. These characteristics require explicit consideration in the development of data-driven models. With the growing availability of spatial data sets, such as those derived from remote sensing, natural images, or textual sources with a spatial reference, not only the volume increases but also the variety of potentially combinable information. Currently, manual labelling of such data for specific tasks is resource intensive. Self-supervised learning provides a viable approach in this context: it facilitates the automatic generation of suitable representations from large volumes of unlabelled data and serves as a foundation for adaptable, cross-domain downstream tasks similar to foundation models.
Approaches such as SatCLIP, GeoCLIP, and CSP have effectively illustrated the contrastive coupling of two modalities, namely image data and spatial positioning, within a unified embedding space. Although these binary models succeed in mitigating the necessity for labelling, their representational capacity remains constrained, as they encapsulate only a portion of the accessible locational data.
This study explores the integration of multiple modalities into a shared embedding space through contrastive learning, utilising position as a binding element. The modalities include texts from Wikipedia, multispectral Sentinel-2 imagery, and geolocations. To achieve this, three architectural models are developed: (1) an adapted version of ImageBind, (2) cyclic training employing a fixed position encoder, and (3) a modified NT-Xent loss that fuses data pairs from diverse modalities.
Preliminary training with the text dataset reveals that the generated representations exhibit properties distinct from those of image-focused models. The evaluation is based on classification and regression-based downstream tasks in order to systematically analyse the representation quality, generalisation ability and information content of multimodal spatial embeddings.

