Clid: Identifying TLS Clients With Unsupervised Learning on Domain Names
Ihyun Nam, Gerry Wan

TL;DR
Clid is an unsupervised learning tool that identifies TLS clients by clustering domain names from SNI fields, providing broad client insights without relying on outdated rule-based databases.
Contribution
This paper introduces Clid, a novel unsupervised clustering approach using Bayesian optimization to identify TLS clients based on domain name associations.
Findings
Clid successfully identifies strongly associated domain names for at least 60% of clients.
Clid outperforms rule-based methods in dynamic network environments.
Clid can adapt to large-scale TLS datasets with millions of handshakes.
Abstract
In this paper, we introduce Clid, a Transport Layer Security (TLS) client identification tool based on unsupervised learning on domain names in the server name indication (SNI) field. Clid aims to provide some information on a wide range of clients, even though it may not be able to identify a definitive characteristic about each one of the clients. This is a different approach from that of many existing rule-based client identification tools that rely on hardcoded databases to identify granular characteristics of a few clients. Often times, these tools can identify only a small number of clients in a real-world network as their databases grow outdated, which motivates an alternative approach like Clid. For this research, we utilize some 345 million anonymized TLS handshakes collected from a large university campus network. From each handshake, we create a TCP fingerprint that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification
