Date

Schon über die letzten Jahr(zehnte) gibt es eine Zunahme räumlichen Daten von geogetaggten Fotos, über Fernerkundungsdaten oder auch Texten, wie Wikipedia Artikel. Gleichzeitig gewinnt maschinelles Lernen im Alltag aber auch in der Geoinformatik an Bedeutung. Dies wirft die Frage auf, wie und ob diese heterogenen Daten samt des unterschiedlichen Informationsgehalts für maschinelle Lernverfahren verwendet werden können. Ein existierender Ansatz ist die Idee des Location Encodings. Nachdem die Relevanz der Thematik verdeutlicht wurde, wird der Pitch Möglichkeiten aufzeigen, um verschiedener Datenquellen innerhalb eines Location Encoders durch selbstüberwachte Lernverfahren zu kombinieren und ob diese Ansätze für die Zukunft vielversprechend sind.

Spatial data are characterised by intrinsic properties such as autocorrelation, scale dependence, and high heterogeneity. These characteristics require explicit consideration in the development of data-driven models. With the growing availability of spatial data sets, such as those derived from remote sensing, natural images, or textual sources with a spatial reference, not only the volume increases but also the variety of potentially combinable information. Currently, manual labelling of such data for specific tasks is resource intensive. Self-supervised learning provides a viable approach in this context: it facilitates the automatic generation of suitable representations from large volumes of unlabelled data and serves as a foundation for adaptable, cross-domain downstream tasks similar to foundation models.
Approaches such as SatCLIP, GeoCLIP, and CSP have effectively illustrated the contrastive coupling of two modalities, namely image data and spatial positioning, within a unified embedding space. Although these binary models succeed in mitigating the necessity for labelling, their representational capacity remains constrained, as they encapsulate only a portion of the accessible locational data.
This study explores the integration of multiple modalities into a shared embedding space through contrastive learning, utilising position as a binding element. The modalities include texts from Wikipedia, multispectral Sentinel-2 imagery, and geolocations. To achieve this, three architectural models are developed: (1) an adapted version of ImageBind, (2) cyclic training employing a fixed position encoder, and (3) a modified NT-Xent loss that fuses data pairs from diverse modalities.
Preliminary training with the text dataset reveals that the generated representations exhibit properties distinct from those of image-focused models. The evaluation is based on classification and regression-based downstream tasks in order to systematically analyse the representation quality, generalisation ability and information content of multimodal spatial embeddings.