Enriching Existing Test Collections with OXPath

Philipp Schaer; Mandy Neumann

arXiv:1706.06836·cs.IR·September 13, 2017

Enriching Existing Test Collections with OXPath

Philipp Schaer, Mandy Neumann

PDF

TL;DR

This paper introduces a lightweight method using OXPath for efficiently enriching test collections with web data, demonstrated on GIRT4-XT, facilitating easier extension and creation of test collections for information retrieval evaluation.

Contribution

The paper presents a novel, simple approach employing OXPath to harvest web data for enriching test collections, reducing technical barriers compared to traditional methods.

Findings

01

Successfully extended GIRT4 with additional metadata fields

02

Method applicable to various scenarios for creating or expanding test collections

03

Enables reuse of collections for diverse evaluation purposes

Abstract

Extending TREC-style test collections by incorporating external resources is a time consuming and challenging task. Making use of freely available web data requires technical skills to work with APIs or to create a web scraping program specifically tailored to the task at hand. We present a light-weight alternative that employs the web data extraction language OXPath to harvest data to be added to an existing test collection from web resources. We demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with additional metadata fields harvested via OXPath from the social sciences portal Sowiport. This allows the re-use of this collection for other evaluation purposes like bibliometrics-enhanced retrieval. The demonstrated method can be applied to a variety of similar scenarios and is not limited to extending existing collections but can also be used to create completely…

Tables1

Table 1. Table 1: Overview on the included fields of the original GIRT4 corpus, the available SOLIS data from the Sowiport portal and the combined GIRT4-XT corpus. Three different states are marked in the table: – = field data not available ; ∘ \circ = available in unstructured form; ∙ ∙ \bullet = available in structured form.

Corpus	id	author	editor	title	source	issn	isbn	pubyear	keywords	class.	abstract	full text	method	location	publisher	pages	language	country
GIRT4	$∙$	$∙$	–	$\circ$	$\circ$	–	–	$∙$	$∙$	$∙$	$∙$	–	$∙$	–	–	–	$∙$	$∙$
SOLIS	$∙$	$∙$	$∙$	$\circ$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$\circ$	$∙$	$\circ$	$\circ$	$∙$	$∙$	$∙$
GIRT4-XT	$∙$	$∙$	$∙$	$\circ$	$\circ$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	–	$∙$	$\circ$	$\circ$	$∙$	$∙$	$∙$

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: TH Köln (University of Applied Sciences), Cologne, Germany 11email: [email protected]

Enriching Existing Test Collections with OXPath

Philipp Schaer

Mandy Neumann

(March 3, 2024)

Abstract

Extending TREC-style test collections by incorporating external resources is a time consuming and challenging task. Making use of freely available web data requires technical skills to work with APIs or to create a web scraping program specifically tailored to the task at hand. We present a light-weight alternative that employs the web data extraction language OXPath to harvest data to be added to an existing test collection from web resources. We demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with additional metadata fields harvested via OXPath from the social sciences portal Sowiport. This allows the re-use of this collection for other evaluation purposes like bibliometrics-enhanced retrieval. The demonstrated method can be applied to a variety of similar scenarios and is not limited to extending existing collections but can also be used to create completely new ones with little effort.

Keywords:

Test collections $\cdot$ Metadata enrichment $\cdot$ GIRT $\cdot$ OXPath $\cdot$ Harvesting of metadata $\cdot$ Scholarly retrieval

1 Introduction

Building TREC-style test collections for information retrieval evaluation is a costly activity. It involves at least three main tasks: (1) setting up an appropriate set of documents, (2) generating a list of topics (50 or even more, as suggested by Voorhees [9]), and (3) obtaining relevance assessments (most of the time by employing domain experts or search specialists as assessors). All three tasks combined sum up and make the generation of new test collection or the redesign and extension of existing test collections a time consuming and challenging task. Generating a completely new test collections is the most complex scenario. Therefore we would like to focus on the enrichment of existing test collections, especially the set of documents. This would allow the reuse of documents, topics and relevance assessments while enabling the old test collection to be reused in other evaluation contexts like scholarly search or bibliometrics-enhanced retrieval [6].

Previous projects like EFIREval111https://sites.google.com/site/ekanoulas/grants/EFIREval already focused on adapting test collections to new environments or incorporated richer information about the different retrieval scenarios and searchers’ activities but only few tried to augment the document collection, which is why we would like to focus on this desideratum.

Research question. How can we enrich parts of existing test collections, like the document collection, by incorporating external resources like digital libraries or other freely available web data sets with as little effort as possible?

Approach. We propose a light-weight method for extending and augmenting the documents sets in test collections by incorporating the web extraction language OXPath. This language that derived from XPath is capable of extracting huge sets of information from large web corpora. It is used by scholarly literature portals like dblp to build up their data sets.

Contributions. We show the feasibility of our approach by extending the GIRT4 collection that was used in the Domain-Specific Track of CLEF with freely available data from the social sciences portal Sowiport222http://sowiport.gesis.org. After harvesting the additional data we created an extended collection called GIRT4-XT that augments the original GIRT4 documents with additional attributes like ISSN codes. This way the rather old test collection that initially was used to do cross-lingual and domain-specific retrieval evaluations can be used for other evaluation purposes like bibliometrics-enhanced retrieval.

2 Related Work

Re-using existing test collections for other purposes in general is not a new idea. Berendsen et al. [2] were using the previously mentioned GIRT collection to generate a so-called pseudo test collection that is automatically generated. The relatively spare data of the GIRT collection (content bearing metadata only being the title and a rather short abstract) comes with a rich set of annotations (see Table 1). These annotations were used to generate pseudo topics and relevance assessments. This pseudo test collection provided training material for learning to rank methods.

A similar approach was used by Roy, Ray, and Mitra [8] who used the CiteSeerX collection to generate a test collection for citation recommendation services. They extracted the textual part of a citation context to form a query. The cited references were taken to be the relevant documents for that query. This way 2,826 queries were obtained but most queries (contexts) have only one relevant citation, making this test collection rather sparse.

Larsen and Lioma [5] described different strategies to generate a scholarly IDEAL test collection. While they came up with some new ideas and strategies of gathering and curating a document collection they rely on manually crafted topics and relevance assessments to complete the test collections. As they outline, the scholars that are the sources of topics and relevance assessments are notoriously busy, hard to engage and unlikely to be crowdsourced. They named INEX as a role model of community effort in collecting relevance judgments from its participants and encouraged to follow that road.

The reuse of document and test collections is common practice by adding new topics and relevance assessments or by transferring them to new application domains (e.g. from IR evaluation to recommender systems). Both approaches most often rely on manual work and judgments. Another approach for building up test collections was presented in the Social Book Search [4] track of CLEF. They built their task on top of the INEX Amazon/LibraryThing collection [1] and enriched it with content from forum discussions on the LibraryThing website to extract topics and relevance assessments. This is a rather technical methodology to obtain this crucial part of a test collection which involved the generation of custom web crawlers for this single purpose.

3 Materials and Methods

As suggested by some of the related work (e.g. Social Book Search), test collections can be created or enhanced with freely available web data. But web pages are meant to be displayed to a human user, as opposed to APIs that provide a means for software applications to gather the structured data that makes up the content of those web pages. Thus for compiling a corpus from web data, one would have to either have access to such an API, or work directly with the human-oriented HTML interface. The former would definitely require some programming/scripting skills, while the latter would either require extensive programming skills for scraping the web page content, or a lot of human effort to collect the desired information manually. In the past, several attempts have been made to ease the process of acquiring web data for non-technical users, by providing web data extraction tools.

OXPath is an open-source language focusing on deep web crawling that takes a declarative approach to the problem [3]. Based on the XML query language XPath, it enables the simulation of user interaction with a web page and the extraction of information in the course of these interactions. To achieve this, OXPath extends the capabilities of XPath with five new elements: (1) actions like clicking and form filling, (2) interactions with the visual appearance of a page, (3) means of identifying nodes by multiple relations, (4) extraction markers to yield hierarchical records of sought-after information, and (5) the Kleene star to enable navigation of paginated content. With these means, it is possible to craft an expression to harvest a lot of data with just a few lines of code.

OXPath can be used e.g. for harvesting bibliographic metadata for digital libraries like dblp, as presented by Michels et al. [7]. In contrast to other tools made for extracting bulk data from web pages, OXPath proves to be particularly memory-efficient as shown by Furche et al. [3].

Regarding document sets in test collections, OXPath can also be used to extract additional information from such digital libraries to extend the test collection with new attributes. Taking the social sciences portal Sowiport as an example, we created a light-weight OXPath wrapper that is able to harvest targeted information from a specific set of records and save the extracted data in a hierarchically structured form. Listing 1 demonstrates a sample OXPath wrapper that is able to interact with the web page of Sowiport333Note that we replaced all German terms from the Sowiport portal with English equivalencies in this listing.. It narrows down the list of presented items to those from a specific database (in this case the social science literature database SOLIS, that GIRT4 is based on) and navigates through the result list in a loop (lines 3–5). By clicking the title of each record element (line 7), the element’s detail view is opened where additional data can be found. For example, in lines 8–10 the editor field is located in the page and each listed editor extracted separately. In a similar vein, the acquisition id (“Acquis. id”) is extracted from a different location on the same page (lines 11–13). The extracted data is hierarchical in nature and can be serialized e.g. in XML or CSV format for further processing.

4 Results

By harvesting additional data from the SOLIS database in Sowiport using a relatively simple declarative expression, we were able to extend the original GIRT4 data with additional information, such as ISSN/ISBN codes or editor, publisher and location information (see Table 1 for an overview). The items from the GIRT collection were matched with the harvested data via their id which was both present in the harvested SOLIS data (acquisition id) and the GIRT4 data set (DOCID without the GIRT prefix).

Of a total of 151.319 documents in GIRT4 we extended 135.214 documents with data from SOLIS/Sowiport. Note that only the documents on social science literature were extended while the social science project descriptions also included in GIRT were ignored. The new test collection is called GIRT4-XT and includes a total of six new metadata fields that were not included in the original data set (editor, ISSN, ISBN, location, publisher, and page numbers). Some of the SOLIS records include links to full texts but as most of them are behind publisher pay walls we were not able to extract them.

5 Discussion and Conclusion

We showed how to extend and enrich existing information retrieval test collections by harvesting freely available metadata from digital library systems by employing the web extraction language OXPath. This method allows us to reuse existing test collections (especially their topics and relevance assessments) in different domains by adding new metadata to the existing documents in the collection.

We demonstrated the feasibility of the process by extending GIRT4 with additional document annotations like editor names, ISSN codes of the related journal or page numbers. This way new kinds of experiments are possible like those discussed in the bibliometrics-enhanced IR community, but the proposed methods and techniques are not limited to this domain. Another use case for our test collection enrichment strategy might be the TREC Genomics Track test collections444http://skynet.ohsu.edu/trec-gen/. As suggested by Larsen and Lioma [5] these collections can be augmented by references extracted from PubMed, a scenario more than suitable for OXPath.

The proposed approach heavily relies on the usage of OXPath as it is an easy-to-learn, light-weight, and all-in-one rapid development technology to gather the additional (meta-)data from web resources like digital libraries. Although the advantages outweigh the disadvantages we would like to point out some shortcomings of OXPath that have to be considered. First of all OXPath is not tuned for speed which results in rather moderate processing times. Internally the whole web page has to be rendered and processed to allow a human-comparable extraction mechanism. When processing many hundred thousand web pages the harvesting process can take many days. There are ways to distribute the whole process on parallel threads but this is not a built-in feature. Another point is that there are relatively few tools to support the development process555We developed an extension for the text editor Atom ourselves, see https://atom.io/packages/language-oxpath. In spite of these limitations, OXPath is still a powerful and useful tool for harvesting semi-structured data from web resources.

In the future, we want to employ OXPath not only for the enhancement of existing test collections, but also for the creation of completely new ones, were all the data necessary should be extracted from web resources. One of our role models for this is the Social Book Search collection.

5.0.1 Acknowledgements.

This work was supported by Deutsche Forschungsgemeinschaft (DFG), grant no. SCHA 1961/1-2.

Bibliography9

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Beckers, T., Fuhr, N., Pharo, N., Nordlie, R., Fachry, K.N.: Overview and results of the INEX 2009 Interactive Track. In: 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2010) (2010)
2[2] Berendsen, R., Tsagkias, M., de Rijke, M., Meij, E.: Generating pseudo test collections for learning to rank scientific articles. In: Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, p. 42–53. Springer (2012), http://link.springer.com/chapter/10.1007/978-3-642-33247-0_6
3[3] Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal 22(1), 47–72 (Feb 2013), http://dx.doi.org/10.1007/s 00778-012-0286-6
4[4] Koolen, M., Kazai, G., Preminger, M., Doucet, A.: Overview of the INEX 2013 Social Book Search Track. In: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization” - Fourth International Conference of the Cross-Language Evaluation Forum, CLEF 2013. p. 26 pages. Valencia, Spain (Sep 2013), https://hal.archives-ouvertes.fr/hal-01073644
5[5] Larsen, B., Lioma, C.: On the need for and provision for an “ideal” scholarly information retrieval test collection. In: Proc. of the 3rd Workshop on Bibliometricenhanced Information Retrieval (BIR 2016), p. 73–81 (2016), http://ceur-ws.org/Vol-1567/paper 8.pdf
6[6] Mayr, P., Scharnhorst, A., Larsen, B., Schaer, P., Mutschke, P.: Bibliometric-enhanced Information Retrieval, p. 798–801. Springer International Publishing (2014)
7[7] Michels, C., Fayzrakhmanov, R.R., Ley, M., Sallinger, E., Schenkel, R.: Oxpath-based data acquisition for dblp. In: JCDL ’17: Proceedings of the 17th ACM/IEEE-CS on Joint Conference on Digital Libraries. pp. 319–320. ACM, New York, NY, USA (2017), to appear.
8[8] Roy, D., Ray, K., Mitra, M.: From a scholarly big dataset to a test collection for bibliographic citation recommendation. In: Workshops at the Thirtieth AAAI Conference on Artificial Intelligence (2016), http://www.aaai.org/ocs/index.php/WS/AAAIW 16/paper/view/12635