Describe Anything Anywhere At Any Moment
Nicolas Gorlo, Lukas Schmid, Luca Carlone

TL;DR
DAAAM introduces a real-time, large-scale 4D scene understanding framework that constructs hierarchical scene graphs with detailed semantic descriptions, significantly advancing spatio-temporal reasoning in computer vision and robotics.
Contribution
The paper presents DAAAM, a novel spatio-temporal memory system that combines an optimization-based frontend with hierarchical scene graphs for real-time, detailed 4D scene understanding.
Findings
Achieves state-of-the-art accuracy on NaVQA and SG3D benchmarks.
Improves question accuracy by 53.6% on OC-NaVQA.
Maintains real-time performance while providing detailed semantic descriptions.
Abstract
Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
