RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

Qiming Bao; Juho Leinonen; Paul Denny; Michael J. Witbrock

arXiv:2605.04539·cs.CL·May 13, 2026

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

Qiming Bao, Juho Leinonen, Paul Denny, Michael J. Witbrock

PDF

TL;DR

RLearner-LLM introduces a hybrid preference optimization method that improves logical correctness in large language models while maintaining fluency, reducing reliance on human annotations.

Contribution

It combines NLI signals with verifier scores in a new pipeline, overcoming the limitations of preference signals in knowledge-intensive tasks.

Findings

01

Up to 6x NLI improvement over SFT models.

02

Consistent answer-coverage gains across domains.

03

Faster inference with smaller models without losing alignment quality.

Abstract

Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.