Learning Object Semantic Similarity with Self-Supervision
Arthur Aubret, Timothy Schauml\"offel, Gemma Roig, Jochen Triesch

TL;DR
This paper presents a bio-inspired neural network model that learns semantic object relationships from visual and linguistic co-occurrence data, mirroring human-like understanding of object contexts and categories.
Contribution
It introduces a novel approach combining temporal and visuo-language alignment to learn semantic structures from raw visual and language input without supervision.
Findings
Model clusters objects by context in high-level layers.
Lower layers reflect object identity and category.
Temporal and visuo-language alignment are effective learning strategies.
Abstract
Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a ``kitchen" or ``eating'' context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation {\em de novo} from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Advanced Data Processing Techniques
