# Unlocking Social Media and User Generated Content as a Data Source for   Knowledge Management

**Authors:** James Meneghello, Nik Thompson, Kevin Lee, Kok Wai Wong, Bilal, Abu-Salih

arXiv: 1907.11934 · 2019-09-04

## TL;DR

This paper presents novel methods for automatically collecting and unlocking user-generated content from social media, enhancing data accessibility for knowledge management and analytical applications.

## Contribution

It introduces new algorithms for navigating pagination and site-agnostic data collection, along with a publicly available testbed for future research.

## Key findings

- Increased UGC data accessibility demonstrated by new algorithms.
- Effective navigation of pagination systems for data extraction.
- Benchmarking shows improved performance over existing techniques.

## Abstract

The pervasiveness of Social Media and user-generated content has triggered an exponential increase in global data volumes. However, due to collection and extraction challenges, data in many feeds, embedded comments, reviews and testimonials are inaccessible as a generic data source. This paper incorporates Knowledge Management framework as a paradigm for knowledge management and data value extraction. This framework embodies solutions to unlock the potential of UGC as a rich, real-time data source for analytical applications. The contributions described in this paper are threefold. Firstly, a method for automatically navigating pagination systems to expose UGC for collection is presented. This is evaluated using browser emulation integrated with dynamic data collection. Secondly, a new method for collecting social data without any a priori knowledge of the sites is introduced. Finally, a new testbed is developed to reflect the current state of internet sites and shared publicly to encourage future research. The discussion benchmarks the new algorithm alongside existing data extraction techniques and provides evidence of the increased amount of UGC data made accessible by the new algorithm.

---
Source: https://tomesphere.com/paper/1907.11934