Adapting the Hypercube Model to Archive Deferred Representations and   Their Descendants

Justin F. Brunelle; Michele C. Weigle; Michael L. Nelson

arXiv:1601.05142·cs.DL·January 21, 2016·2 cites

Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants

Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

PDF

Open Access 1 Repo

TL;DR

This paper presents a method to improve web archiving by discovering and crawling deferred representations and their descendants, addressing challenges posed by client-side technologies like JavaScript.

Contribution

It adapts the Hypercube model to effectively identify and archive client-side generated web page states and their embedded resources.

Findings

01

Average of 38.5 descendants per seed URI

02

70.9% of descendants reached via onclick events

03

Added 15.6 times more embedded resources than Heritrix

Abstract

The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to archive all of the resources in deferred representations and the result is archives with web pages that are either incomplete or that erroneously load embedded resources from the live web. We propose a method of discovering and crawling deferred representations and their descendants (representation states that are only reachable through client-side events). We adapt the Dincturk et al. Hypercube model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

N0taN3rd/Squidwarc
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Software Testing and Debugging Techniques · Software Engineering Research