Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review

Daoan Zhang; Shuo Zhang; Zijian Jin; Jiebo Luo; Shengyu Fu; Elsie Nallipogu

arXiv:2601.04252·cs.SE·January 9, 2026

Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review

Daoan Zhang, Shuo Zhang, Zijian Jin, Jiebo Luo, Shengyu Fu, Elsie Nallipogu

PDF

Open Access

TL;DR

Sphinx is a comprehensive framework that enhances LLM-based pull request review by generating context-rich comments, evaluating review quality with structured benchmarks, and training models with reward optimization to improve accuracy and coverage.

Contribution

The paper introduces Sphinx, a novel unified framework combining data generation, structured evaluation, and reward-based training for improved LLM-driven PR review.

Findings

01

Models trained with Sphinx achieve up to 40% higher checklist coverage.

02

Sphinx's evaluation benchmark moves beyond BLEU to assess review quality.

03

State-of-the-art performance in review completeness and precision.

Abstract

Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software Testing and Debugging Techniques