Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun

TL;DR
This paper compares supervised fine-tuning and Best-of-N methods for adapting large language models to bit string generation, analyzing their theoretical convergence properties under different conditions.
Contribution
It provides a theoretical comparison of two standard adaptation methods, revealing conditions where each method outperforms the other.
Findings
Supervised fine-tuning outperforms BoN when the setting is realizable.
BoN can have better convergence rates when realizability fails, depending on the failure mode.
The analysis highlights how response length affects convergence rates in both methods.
Abstract
Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
