Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Jiazheng Li, Hainiu Xu, Zhaoyue Sun, Yuxiang Zhou, David West, Cesare, Aloisi, Yulan He

TL;DR
This paper introduces a novel framework that uses thought trees and preference optimization to generate faithful rationales for science question scoring, achieving performance comparable to black-box classifiers and improving assessment accuracy.
Contribution
It presents a new method combining thought trees and synthetic data for calibrating LLMs, enhancing rationale quality and scoring accuracy.
Findings
38% improvement in assessment performance (QWK score)
Higher-quality rationales verified by humans and LLMs
Effective use of synthetic preference data from thought trees
Abstract
Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jiazhengli/deberta-v3-large-Rationale-to-Scoremodel· 2 dl· ♡ 12 dl♡ 1
- 🤗jiazhengli/Meta-Llama-3-8B-QLoRA-Assessment-Rationale-sftmodel· 2 dl2 dl
- 🤗jiazhengli/Meta-Llama-3-8B-QLoRA-Assessment-Rationale-dpomodel· 3 dl· ♡ 13 dl♡ 1
- 🤗jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-sftmodel· 2 dl2 dl
- 🤗jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpomodel· 6 dl6 dl
- 🤗MachoMaheen/devdock4bitmodel
- 🤗sicer/arc-agi-legacymodel
- 🤗JilinHu/llemma_7b_3epoch_r32_e5_RQ1model· 1 dl1 dl
- 🤗Xin-Rui/LLAMA-Fac-NEW-A800model· ♡ 1♡ 1
- 🤗Linksome/lmfmodel
Videos
Taxonomy
TopicsAdvanced Text Analysis Techniques · Educational Technology and Assessment
