Aligning Large Language Model Behavior with Human Citation Preferences

Kenichiro Ando; Tatsuya Harada

arXiv:2602.05205·cs.CL·February 6, 2026

Aligning Large Language Model Behavior with Human Citation Preferences

Kenichiro Ando, Tatsuya Harada

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how large language models cite sources compared to human preferences, analyzing their tendencies and proposing calibration methods to improve alignment with human citation behaviors.

Contribution

It characterizes LLM citation behaviors across different content types and introduces calibration techniques to better align model citations with human preferences.

Findings

01

Humans most frequently seek citations for medical texts.

02

Models tend to overcite explicitly marked sources by up to 27%.

03

Models underselect citations for numeric sentences and personal names.

Abstract

Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

**1. Novel and Timely Research Direction** This paper focuses on an under-explored question that what content in LLM outputs deserves citations. Existing research centers on RAG retrieval, citation validation, or not cite-worthiness itself. Closest prior work (CiteWorth ACL 2021, Redi et al. 2019) is limited to narrow domains and excludes LLM behavior. Its contributions fill this gap: (1) first use of preference learning for cite-worthiness; (2) cross-category comparison (8 types); (3) anal

Weaknesses

The paper claims to study 'alignment between LLM and human citation preferences,' but its dataset has critical limitations that undermine this core goal: 1. **Single-domain bias** All 6,000 sentences are sourced from Wikipedia, a specialized text type with unique editorial standards (e.g., prioritizing verifiability over practical utility). This differs fundamentally from ordinary users’ citation needs—for example, a user seeking 'insomnia medication advice' has distinct expectations vs. re

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is centered “given two statements, which one most needs a citation?” which isolates cite-worthiness as a preference judgment. This is a novel task and different from most prior work on attribution, which tends to assume you already know a claim needs support and then focuses on finding or attaching the right source 2. The dataset curation is thoughtful and high quality. The use of Wikipedia inline templates is a very clever design, and this is accompanied by a large scale human ann

Weaknesses

My main concern is how useful this alignment goal itself is. The supervision signal is strictly relative (“which of two sentences needs a citation more?”). Real assistants, however, must make absolute, independent decisions about each span (“does this claim require a citation at all?”). Because the dataset never captures ‘both’, ‘neither’, or graded severity, it’s unclear whether models trained on this signal will learn a properly calibrated trigger for citation in deployment. Therefore even the

Reviewer 03Rating 2Confidence 4

Strengths

- The paper clearly articulates the problem it aims to solve the gap between LLM citation behavior and human "cite-worthiness" preferences and outlines a straightforward methodology to address it - The conclusions are supported. The paper evaluates 11 different models to identify the problem broadly, then trains and optimizes 5 open-source models to test its proposed solution - A core contribution is the large, human-annotated preference dataset. The successful results from DPO (Direct Preferenc

Weaknesses

- The writing of the paper need to be further crafted and polished. In line 214 and 205, there are missing appendix references. Redundancy is another issues exists, for example "fine-tuning hamrs alignment" are brought out repeatedly. I strongly suggest the author organise the logical flow and go through the paper carefully - In the data collection phase, the description of the noise filtering process is insufficient. Additional information is required regarding "which noise filtering method was

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Text Readability and Simplification