# Multi-class Twitter Data Categorization and Geocoding with a Novel   Computing Framework

**Authors:** Sakib Mahmud Khan, Mashrur Chowdhury, Linh B. Ngo, Amy Apon

arXiv: 1905.02916 · 2019-08-30

## TL;DR

This paper presents a novel computing framework combining L-LDA and SVM to classify and geocode transportation-related Twitter data with high accuracy, demonstrated through a case study in New York City.

## Contribution

The study introduces a new analytical framework that integrates L-LDA with SVM for improved transportation data classification from Twitter.

## Key findings

- SVM classifier achieves over 85% accuracy in identifying transportation tweets.
- The combined L-LDA incorporated SVM achieves over 98.3% classification accuracy.
- The framework effectively geocodes and categorizes transportation events from Twitter data.

## Abstract

This study details the progress in transportation data analysis with a novel computing framework in keeping with the continuous evolution of the computing technology. The computing framework combines the Labelled Latent Dirichlet Allocation (L-LDA)-incorporated Support Vector Machine (SVM) classifier with the supporting computing strategy on publicly available Twitter data in determining transportation-related events to provide reliable information to travelers. The analytical approach includes analyzing tweets using text classification and geocoding locations based on string similarity. A case study conducted for the New York City and its surrounding areas demonstrates the feasibility of the analytical approach. Approximately 700,010 tweets are analyzed to extract relevant transportation-related information for one week. The SVM classifier achieves more than 85% accuracy in identifying transportation-related tweets from structured data. To further categorize the transportation-related tweets into sub-classes: incident, congestion, construction, special events, and other events, three supervised classifiers are used: L-LDA, SVM, and L-LDA incorporated SVM. Findings from this study demonstrate that the analytical framework, which uses the L-LDA incorporated SVM, can classify roadway transportation-related data from Twitter with over 98.3% accuracy, which is significantly higher than the accuracies achieved by standalone L-LDA and SVM.

---
Source: https://tomesphere.com/paper/1905.02916