TL;DR
WorldComp2D introduces a lightweight, explicitly structured latent space framework for efficient spatio-semantic reasoning, demonstrated through facial landmark localization with reduced computational costs.
Contribution
It proposes a novel framework that explicitly structures latent space geometry based on object identity and spatial proximity, improving efficiency over existing methods.
Findings
Reduces parameters and FLOPs by up to 4.0X and 2.2X respectively.
Maintains real-time CPU performance.
Demonstrates effectiveness in facial landmark localization.
Abstract
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
