# How Much Data is Enough? A Statistical Approach with Case Study on   Longitudinal Driving Behavior

**Authors:** Wenshuo Wang, Chang Liu, Ding Zhao

arXiv: 1706.07637 · 2017-06-26

## TL;DR

This paper presents a statistical framework to determine the optimal amount of naturalistic driving data needed for accurately modeling driver behaviors, balancing data sufficiency and resource efficiency.

## Contribution

It introduces a novel assessment method using Gaussian kernel density estimation and Kullback-Leibler divergence to quantify data requirements for driver behavior analysis.

## Key findings

- The method effectively estimates the necessary data volume for capturing driver behavior features.
- Application to car-following data validates the approach's ability to identify sufficient data.
- Results align with existing experimental settings in driver behavior research.

## Abstract

Big data has shown its uniquely powerful ability to reveal, model, and understand driver behaviors. The amount of data affects the experiment cost and conclusions in the analysis. Insufficient data may lead to inaccurate models while excessive data waste resources. For projects that cost millions of dollars, it is critical to determine the right amount of data needed. However, how to decide the appropriate amount has not been fully studied in the realm of driver behaviors. This paper systematically investigates this issue to estimate how much naturalistic driving data (NDD) is needed for understanding driver behaviors from a statistical point of view. A general assessment method is proposed using a Gaussian kernel density estimation to catch the underlying characteristics of driver behaviors. We then apply the Kullback-Liebler divergence method to measure the similarity between density functions with differing amounts of NDD. A max-minimum approach is used to compute the appropriate amount of NDD. To validate our proposed method, we investigated the car-following case using NDD collected from the University of Michigan Safety Pilot Model Deployment (SPMD) program. We demonstrate that from a statistical perspective, the proposed approach can provide an appropriate amount of NDD capable of capturing most features of the normal car-following behavior, which is consistent with the experiment settings in many literatures.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1706.07637/full.md

## Figures

30 figures with captions in the complete paper: https://tomesphere.com/paper/1706.07637/full.md

## References

74 references — full list in the complete paper: https://tomesphere.com/paper/1706.07637/full.md

---
Source: https://tomesphere.com/paper/1706.07637