DateLogicQA: Benchmarking Temporal Biases in Large Language Models
Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi

TL;DR
DateLogicQA is a comprehensive benchmark designed to evaluate large language models' abilities and biases in temporal reasoning across diverse date formats and contexts, revealing key challenges in handling temporal data.
Contribution
The paper introduces DateLogicQA, a new benchmark with a novel Semantic Integrity Metric and analysis of temporal biases in LLMs, advancing evaluation methods in temporal reasoning.
Findings
LLMs exhibit significant biases at representation and logical levels.
Temporal reasoning remains a key challenge for current LLMs.
The Semantic Integrity Metric effectively assesses tokenization quality.
Abstract
This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
