# The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

**Authors:** Seiji Maekawa, Hayate Iso, Nikita Bhutani

arXiv: 2509.00245 · 2025-10-02

## TL;DR

This paper introduces a new benchmark and task for evaluating large language models' ability to identify rare, distinctive features across document collections, highlighting current limitations in statistical reasoning.

## Contribution

The paper presents Distinctive Feature Mining (DFM) and DiFBench, enabling systematic evaluation of LLMs' rarity detection and statistical reasoning capabilities.

## Key findings

- Significant performance gap between general-purpose and reasoning-enhanced models.
- All models' performance declines with increased task complexity and document count.
- Common failure mode involves misidentifying frequent features as distinctive.

## Abstract

Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00245/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00245/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/2509.00245/full.md

---
Source: https://tomesphere.com/paper/2509.00245