Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models
Arthur C\^amara, Claudia Hauff

TL;DR
This paper investigates how different data handling strategies between disk, main memory, and VRAM affect the training efficiency of neural IR models on multiple GPUs, revealing that streaming data from disk can be more scalable and faster than loading all data into memory.
Contribution
The study compares three data management approaches for IR datasets in neural training, highlighting the scalability and performance benefits of disk streaming over in-memory loading.
Findings
Streaming data from disk can outperform in-memory loading for large datasets.
Memory optimization techniques like memory pinning and RAMDISK reduce training time.
In-memory loading is not feasible for setups with many GPUs due to memory constraints.
Abstract
When training neural rankers using Large Language Models, it's expected that a practitioner would make use of multiple GPUs to accelerate the training time. By using more devices, deep learning frameworks, like PyTorch, allow the user to drastically increase the available VRAM pool, making larger batches possible when training, therefore shrinking training time. At the same time, one of the most critical processes, that is generally overlooked when running data-hungry models, is how data is managed between disk, main memory and VRAM. Most open source research implementations overlook this memory hierarchy, and instead resort to loading all documents from disk to main memory and then allowing the framework (e.g., PyTorch) to handle moving data into VRAM. Therefore, with the increasing sizes of datasets dedicated to IR research, a natural question arises: s this the optimal solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Advanced Neural Network Applications
