# CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

**Authors:** Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, William L., Hamilton

arXiv: 1908.06177 · 2019-09-05

## TL;DR

CLUTRR is a diagnostic benchmark designed to evaluate natural language understanding systems' ability to generalize systematically and robustly in reasoning about kinship relations in stories, revealing gaps in current models.

## Contribution

The paper introduces CLUTRR, a novel benchmark for assessing systematic generalization and robustness in NLU models through kinship reasoning tasks.

## Key findings

- Graph neural network outperforms BERT and MAC in generalization and robustness.
- Current state-of-the-art models show significant performance gaps on the benchmark.
- CLUTRR effectively measures models' ability to infer logical relations and handle noise.

## Abstract

The recent success of natural language understanding (NLU) systems has been troubled by results highlighting the failure of these models to generalize in a systematic and robust way. In this work, we introduce a diagnostic benchmark suite, named CLUTRR, to clarify some key issues related to the robustness and systematicity of NLU systems. Motivated by classic work on inductive logic programming, CLUTRR requires that an NLU system infer kinship relations between characters in short stories. Successful performance on this task requires both extracting relationships between entities, as well as inferring the logical rules governing these relationships. CLUTRR allows us to precisely measure a model's ability for systematic generalization by evaluating on held-out combinations of logical rules, and it allows us to evaluate a model's robustness by adding curated noise facts. Our empirical results highlight a substantial performance gap between state-of-the-art NLU models (e.g., BERT and MAC) and a graph neural network model that works directly with symbolic inputs---with the graph-based model exhibiting both stronger generalization and greater robustness.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.06177/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/1908.06177/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1908.06177/full.md

---
Source: https://tomesphere.com/paper/1908.06177