A Deep Dive into VirusTotal: Characterizing and Clustering a Massive File Feed
Kevin van Liebergen (1), Juan Caballero (1), Platon Kotzias (2), Chris, Gates (2) ((1) IMDEA Software Institute, (2) Norton Research Group)

TL;DR
This paper analyzes 328 million VirusTotal reports to characterize the file feed, compare it with other telemetry, and develop scalable clustering methods for threat hunting, revealing insights into malware diversity and detection.
Contribution
It provides an in-depth analysis of VirusTotal's file feed and introduces scalable clustering approaches for large-scale threat hunting.
Findings
HAC-T and FVG produce high precision clusters
FVG scales to 235 million samples in 15 hours
Clusters help identify potentially malicious samples labeled as benign
Abstract
Online scanners analyze user-submitted files with a large number of security tools and provide access to the analysis results. As the most popular online scanner, VirusTotal (VT) is often used for determining if samples are malicious, labeling samples with their family, hunting for new threats, and collecting malware samples. We analyze 328M VT reports for 235M samples collected for one year through the VT file feed. We use the reports to characterize the VT file feed in depth and compare it with the telemetry of a large security vendor. We answer questions such as How diverse is the feed? Does it allow building malware datasets for different filetypes? How fresh are the samples it provides? What is the distribution of malware families it sees? Does that distribution really represent malware on user devices? We then explore how to perform threat hunting at scale by investigating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Internet Traffic Analysis and Secure E-voting
