MARPLE: A Benchmark for Long-Horizon Inference

Emily Jin; Zhuoyi Huang; Jan-Philipp Fr\"anken; Weiyu Liu; Hannah Cha,; Erik Brockbank; Sarah Wu; Ruohan Zhang; Jiajun Wu; Tobias Gerstenberg

arXiv:2410.01926·cs.LG·October 4, 2024

MARPLE: A Benchmark for Long-Horizon Inference

Emily Jin, Zhuoyi Huang, Jan-Philipp Fr\"anken, Weiyu Liu, Hannah Cha,, Erik Brockbank, Sarah Wu, Ruohan Zhang, Jiajun Wu, Tobias Gerstenberg

PDF

Open Access 1 Repo 1 Video

TL;DR

MARPLE is a new benchmark designed to evaluate AI's ability to perform long-horizon, multimodal inference in complex, simulated environments, highlighting current models' limitations compared to human reasoning.

Contribution

The paper introduces MARPLE, a comprehensive benchmark for testing long-term, multimodal inference capabilities in AI, with detailed analysis of model performance and evidence utilization.

Findings

01

Humans outperform AI models in long-horizon inference tasks.

02

Traditional inference models are less robust and performant than humans.

03

All modes of evidence (visual, language, auditory) contribute to inference performance.

Abstract

Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marple-benchmark/marple
pytorchOfficial

Videos

MARPLE: A Benchmark for Long-Horizon Inference· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings