The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs
Zhiliang Chen, Alfred Wei Lun Leong, Shao Yong Ong, Apivich Hemachandra, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

TL;DR
This paper introduces JoBS, a novel method that efficiently co-optimizes data and model configurations for training large language models by leveraging a performance predictor and Bayesian optimization, outperforming existing approaches.
Contribution
JoBS is the first approach to jointly optimize data and model configurations for LLMs using a scaling-law-inspired predictor within a Bayesian optimization framework.
Findings
JoBS outperforms existing multi-fidelity BO baselines.
JoBS achieves better optimization results across diverse LLM tasks.
The method effectively allocates budget between predictor learning and optimization.
Abstract
Co-optimizing data and model configurations for training LLMs presents a classic chicken-and-egg dilemma: The best training data configuration (e.g., data mixture) for a downstream task depends on the chosen model configuration (e.g., model architecture), and vice versa. However, jointly optimizing both data and model configurations is often deemed intractable, and existing methods focus on either data or model optimization without considering their interaction. We introduce JoBS, an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization (BO) in jointly optimizing LLM training data and model configurations efficiently. JoBS allocates a portion of the optimization budget to learn an LLM performance predictor that predicts how promising a training configuration is from a small number of training steps. The remaining budget is used to perform BO…
Peer Reviews
Decision·Submitted to ICLR 2026
**Problem formulation**: Explicitly formulating the interdependence between data mixture ratios and LoRA training configurations as a joint optimization problem is novel. Figure 2b demonstrates that optimal data mixtures vary across LoRA configurations, which is non-intuitive and well-illustrated. **Technical approach**: Combining BO with a scaling law predictor is sound. Theorem 4.1 shows prediction noise is handled as observation noise. Deep kernels for heteroskedastic modeling and continuous
**Severely limited scope**: Only PEFT (specifically LoRA) is evaluated, not full fine-tuning or other PEFT methods (prefix tuning, adapters). The abstract and title should make the scope clear. The claim that "JoBS can also be adapted for LLM pretraining" is unsupported speculation with limited evidence. **Questionable significance**: If this problem is important, why does no prior work address it? The baseline comparisons require running data mixture optimization then model training configurat
The paper focuses on an important question by explicitly formulating the chicken-and-egg dilemma between training data and model configuration in LLM fine-tuning as a joint black-box optimization problem. It proposes JoBS, which combines Bayesian Optimization with a neural performance scaling-law predictor to efficiently explore the configuration space. It models LLM performance as a Gaussian process, uses a predictor to extrapolate final results from short training runs, and provides a theoreti
1. The paper references the term $\gamma_T$ in the main theorem, but does not explicitly define it within the text. 2. All experiments are conducted on relatively small LLMs (up to 8B parameters). It remains unclear whether the proposed method scales to larger backbone models, where optimization dynamics may differ. 3. The compared baselines optimize data and model configurations separately. It would strengthen the empirical evidence to include or discuss any existing methods (if any) that at
The paper designs its method using BO techniques flexibly and appropriately. The core idea of using a performance predictor to amortize the cost of BO evaluations is well-motivated and practical. The writing is easy to follow, and the overall narrative is clear. The empirical results look good and strong, showing consistent improvements over a wide range of baselines, including independent and alternating optimization schemes, across multiple tasks and models. The "interaction improvement" claim
The problem formulation relies on a fixed training time budget. However, training time is highly sensitive to the implementation (e.g., specific frameworks for PEFT or inference). It is questionable whether using time as the primary budget is a robust choice, as opposed to a more implementation-agnostic budget like total tokens or training steps or FLOPs or other potential choices. The motivation in Section 3, particularly Figure 2, is a key pillar of the paper. However, I am wondering if the p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Recommender Systems and Techniques
