Impact of HTTP Cookie Violations in Web Archives
Sawood Alam, Plinio Vargas, Michele C. Weigle, and Michael L. Nelson

TL;DR
This paper examines how HTTP cookie violations during web archiving can cause content bias and proposes methods for crawlers and replay systems to better handle cookies, improving archive fidelity.
Contribution
It introduces a novel approach for storing cookies with short expiration and considering Vary headers during replay to reduce cookie violations in web archives.
Findings
Cookies with short expiration reduce bias
Considering Vary headers improves replay accuracy
Proposed methods decrease defaced mementos
Abstract
Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Web Data Mining and Analysis · Advanced Malware Detection Techniques
