Scaling Unverifiable Rewards: A Case Study on Visual Insights
Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang

TL;DR
This paper introduces Selective TTS, a process-based refinement framework for multi-stage LLM pipelines that improves output quality and stability when final rewards are unverifiable, demonstrated through visual data insights.
Contribution
We propose Selective TTS, a novel multi-stage refinement method that distributes compute and prunes low-quality branches to stabilize and enhance LLM-based insights.
Findings
Increased insight quality scores from 61.64 to 65.86.
Reduced variance in output quality.
Aligned judge model with human experts (Kendall's τ=0.55).
Abstract
Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
