Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman

TL;DR
This paper introduces a cluster-based frame selection strategy to prevent information leakage in video-derived datasets, ensuring more representative and balanced dataset partitions for improved model evaluation.
Contribution
It presents a novel clustering approach to improve dataset splitting, reducing leakage and enhancing the reliability of video-based machine learning evaluations.
Findings
Reduces information leakage in video datasets
Creates more balanced and representative dataset splits
Improves reliability of model evaluation on video data
Abstract
We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · User Authentication and Security Systems
