Towards Learning Cross-Modal Perception-Trace Models
Achim Rettinger, Viktoria Bogdanova, Philipp Niemann

TL;DR
This paper investigates human perception in multi-modal documents, develops a perception-trace model inspired by eye tracking data, and demonstrates its potential to enhance embedding quality across modalities.
Contribution
It introduces CMPM, a novel perception-trace model based on human eye tracking data, to improve multi-modal embeddings beyond traditional heuristics.
Findings
Perception-based models capture multi-modality and layout information.
CMPM improves basic skip-gram embeddings.
Human-inspired perception models have high potential for embedding enhancement.
Abstract
Representation learning is a key element of state-of-the-art deep learning approaches. It enables to transform raw data into structured vector space embeddings. Such embeddings are able to capture the distributional semantics of their context, e.g. by word windows on natural language sentences, graph walks on knowledge graphs or convolutions on images. So far, this context is manually defined, resulting in heuristics which are solely optimized for computational performance on certain tasks like link-prediction. However, such heuristic models of context are fundamentally different to how humans capture information. For instance, when reading a multi-modal webpage (i) humans do not perceive all parts of a document equally: Some words and parts of images are skipped, others are revisited several times which makes the perception trace highly non-sequential; (ii) humans construct meaning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
