MARPLE: A Benchmark for Long-Horizon Inference
Emily Jin, Zhuoyi Huang, Jan-Philipp Fr\"anken, Weiyu Liu, Hannah Cha,, Erik Brockbank, Sarah Wu, Ruohan Zhang, Jiajun Wu, Tobias Gerstenberg

TL;DR
MARPLE is a new benchmark designed to evaluate AI's ability to perform long-horizon, multimodal inference in complex, simulated environments, highlighting current models' limitations compared to human reasoning.
Contribution
The paper introduces MARPLE, a comprehensive benchmark for testing long-term, multimodal inference capabilities in AI, with detailed analysis of model performance and evidence utilization.
Findings
Humans outperform AI models in long-horizon inference tasks.
Traditional inference models are less robust and performant than humans.
All modes of evidence (visual, language, auditory) contribute to inference performance.
Abstract
Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
