LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
Tiezhu Sun, Weiguo Pian, Nadia Daoudi, Kevin Allix, Tegawend\'e F., Bissyand\'e, Jacques Klein

TL;DR
LaFiCMIL is a novel large file classification method leveraging correlated multiple instance learning, enabling efficient, high-performance classification of lengthy documents on a single GPU, surpassing existing models in benchmarks.
Contribution
Introduces LaFiCMIL, a new approach for large file classification that efficiently scales BERT to nearly 20,000 tokens on a single GPU, with state-of-the-art results.
Findings
Achieves new benchmarks across seven datasets.
Scales BERT to handle nearly 20,000 tokens on a single GPU.
Operates efficiently for binary, multi-class, and multi-label tasks.
Abstract
Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Advanced Malware Detection Techniques · Software Engineering Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · fail · Linear Layer · Dropout · WordPiece · Attention Dropout · Residual Connection · Weight Decay
