Kinship Data Benchmark for Multi-hop Reasoning
Tianda Sun, Dimitar Kazakov

TL;DR
KinshipQA is a new benchmark that evaluates large language models' ability to perform multi-hop reasoning over culturally specific genealogical data, revealing differences in reasoning skills across models and cultures.
Contribution
We introduce KinshipQA, a generative pipeline for creating large-scale, culture-specific genealogical data for multi-hop reasoning evaluation of LLMs.
Findings
Models show varied performance across cultural contexts.
KinshipQA exposes systematic reasoning differences among models.
Benchmark enables controlled variation of task difficulty and cultural assumptions.
Abstract
Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Advanced Graph Neural Networks
