Impact of HTTP Cookie Violations in Web Archives

Sawood Alam; Plinio Vargas; Michele C. Weigle; and Michael L. Nelson

arXiv:1906.07141·cs.DL·June 18, 2019·1 cites

Impact of HTTP Cookie Violations in Web Archives

Sawood Alam, Plinio Vargas, Michele C. Weigle, and Michael L. Nelson

PDF

Open Access

TL;DR

This paper examines how HTTP cookie violations during web archiving can cause content bias and proposes methods for crawlers and replay systems to better handle cookies, improving archive fidelity.

Contribution

It introduces a novel approach for storing cookies with short expiration and considering Vary headers during replay to reduce cookie violations in web archives.

Findings

01

Cookies with short expiration reduce bias

02

Considering Vary headers improves replay accuracy

03

Proposed methods decrease defaced mementos

Abstract

Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Web Data Mining and Analysis · Advanced Malware Detection Techniques