DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Gagan Bhatia; MingZe Tang; Cristina Mahanta; Madiha Kazi

arXiv:2412.13377·cs.CL·May 20, 2025

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi

PDF

Open Access 1 Repo 1 Video

TL;DR

DateLogicQA is a comprehensive benchmark designed to evaluate large language models' abilities and biases in temporal reasoning across diverse date formats and contexts, revealing key challenges in handling temporal data.

Contribution

The paper introduces DateLogicQA, a new benchmark with a novel Semantic Integrity Metric and analysis of temporal biases in LLMs, advancing evaluation methods in temporal reasoning.

Findings

01

LLMs exhibit significant biases at representation and logical levels.

02

Temporal reasoning remains a key challenge for current LLMs.

03

The Semantic Integrity Metric effectively assesses tokenization quality.

Abstract

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gagan3012/eais-temporal-bias
noneOfficial

Videos

DateLogicQA: Benchmarking Temporal Biases in Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods