What's in the Box? Reasoning about Unseen Objects from Multimodal Cues

Lance Ying; Daniel Xu; Alicia Zhang; Katherine M. Collins; Max H. Siegel; Joshua B. Tenenbaum

arXiv:2506.14212·cs.AI·June 18, 2025

What's in the Box? Reasoning about Unseen Objects from Multimodal Cues

Lance Ying, Daniel Xu, Alicia Zhang, Katherine M. Collins, Max H. Siegel, Joshua B. Tenenbaum

PDF

Open Access

TL;DR

This paper introduces a neurosymbolic model that combines neural networks and Bayesian inference to understand unseen objects using multimodal cues, closely mirroring human reasoning in a novel guessing game.

Contribution

The work presents a new neurosymbolic approach that effectively integrates multimodal information for reasoning about unseen objects, outperforming unimodal and baseline models.

Findings

01

Model correlates strongly with human judgments

02

Unimodal ablated models perform poorly

03

Large neural models show less accurate reasoning

Abstract

People regularly make inferences about objects in the world that they cannot see by flexibly integrating information from multiple sources: auditory and visual cues, language, and our prior beliefs and knowledge about the scene. How are we able to so flexibly integrate many sources of information to make sense of the world around us, even if we have no direct knowledge? In this work, we propose a neurosymbolic model that uses neural networks to parse open-ended multimodal inputs and then applies a Bayesian model to integrate different sources of information to evaluate different hypotheses. We evaluate our model with a novel object guessing game called ``What's in the Box?'' where humans and models watch a video clip of an experimenter shaking boxes and then try to guess the objects inside the boxes. Through a human experiment, we show that our model correlates strongly with human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems

MethodsContrastive Language-Image Pre-training