Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants
Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

TL;DR
This paper presents a method to improve web archiving by discovering and crawling deferred representations and their descendants, addressing challenges posed by client-side technologies like JavaScript.
Contribution
It adapts the Hypercube model to effectively identify and archive client-side generated web page states and their embedded resources.
Findings
Average of 38.5 descendants per seed URI
70.9% of descendants reached via onclick events
Added 15.6 times more embedded resources than Heritrix
Abstract
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to archive all of the resources in deferred representations and the result is archives with web pages that are either incomplete or that erroneously load embedded resources from the live web. We propose a method of discovering and crawling deferred representations and their descendants (representation states that are only reachable through client-side events). We adapt the Dincturk et al. Hypercube model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Software Testing and Debugging Techniques · Software Engineering Research
