The Effects of Data Split Strategies on the Offline Experiments for CTR Prediction
Ramazan Tarik Turksoy, Beyza Turkmen

TL;DR
This paper investigates how different data split strategies, especially temporal splits, affect the accuracy of offline CTR prediction evaluations, highlighting the importance of realistic data partitioning for model assessment.
Contribution
It systematically compares random and temporal data splits in offline CTR prediction evaluation, emphasizing the significance of realistic data partitioning strategies.
Findings
Temporal splits better reflect real-world scenarios.
Data split strategy significantly impacts offline evaluation results.
Random splits may overestimate model performance.
Abstract
Click-through rate (CTR) prediction is a crucial task in online advertising to recommend products that users are likely to be interested in. To identify the best-performing models, rigorous model evaluation is necessary. Offline experimentation plays a significant role in selecting models for live user-item interactions, despite the value of online experimentation like A/B testing, which has its own limitations and risks. Often, the correlation between offline performance metrics and actual online model performance is inadequate. One main reason for this discrepancy is the common practice of using random splits to create training, validation, and test datasets in CTR prediction. In contrast, real-world CTR prediction follows a temporal order. Therefore, the methodology used in offline evaluation, particularly the data splitting strategy, is crucial. This study aims to address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced X-ray and CT Imaging · Fault Detection and Control Systems
