On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

Minh Duc Bui; Kyung Eun Park; Goran Glava\v{s}; Fabian David Schmidt; Katharina von der Wense

arXiv:2506.02591·cs.CL·June 4, 2025

On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

Minh Duc Bui, Kyung Eun Park, Goran Glava\v{s}, Fabian David Schmidt, Katharina von der Wense

PDF

Open Access 1 Video

TL;DR

This paper investigates how large language models handle different measurement systems across cultures, revealing they often default to dominant systems and require more computation to accurately reason about underrepresented ones, raising accessibility concerns.

Contribution

The study provides a comprehensive analysis of LLMs' performance across diverse measurement systems and highlights the increased computational cost when reasoning about underrepresented cultures.

Findings

01

LLMs default to the most common measurement system in training data.

02

Performance varies significantly across different measurement systems.

03

Reasoning methods like chain-of-thought increase test-time compute, especially for underrepresented systems.

Abstract

Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs' answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures· underline

Taxonomy

TopicsNeural Networks and Applications