Stochastic Sampling Simulation for Pedestrian Trajectory Prediction

Cyrus Anderson; Xiaoxiao Du; Ram Vasudevan; Matthew Johnson-Roberson

arXiv:1903.01860·cs.RO·February 27, 2020

Stochastic Sampling Simulation for Pedestrian Trajectory Prediction

Cyrus Anderson, Xiaoxiao Du, Ram Vasudevan, Matthew Johnson-Roberson

PDF

1 Repo

TL;DR

This paper introduces a stochastic sampling simulation method to generate synthetic pedestrian trajectories, enabling effective training of deep neural networks for pedestrian prediction without extensive real-world data collection.

Contribution

The novel simulation approach produces realistic synthetic data that, when used to train deep learning models, outperforms models trained on real annotated data.

Findings

01

Synthetic trajectories improve prediction accuracy

02

Training with synthetic data surpasses real data training

03

Simulation reduces need for costly data annotation

Abstract

Urban environments pose a significant challenge for autonomous vehicles (AVs) as they must safely navigate while in close proximity to many pedestrians. It is crucial for the AV to correctly understand and predict the future trajectories of pedestrians to avoid collision and plan a safe path. Deep neural networks (DNNs) have shown promising results in accurately predicting pedestrian trajectories, relying on large amounts of annotated real-world data to learn pedestrian behavior. However, collecting and annotating these large real-world pedestrian datasets is costly in both time and labor. This paper describes a novel method using a stochastic sampling-based simulation to train DNNs for pedestrian trajectory prediction with social interaction. Our novel simulation method can generate vast amounts of automatically-annotated, realistic, and naturalistic synthetic pedestrian trajectories…

Tables2

Table 1. TABLE I : Summary statistics for each scene.

Dataset	$μ_{p}$	$σ_{p}$	$σ_{s}$
ETH	6.15	4.46	0.35
Hotel	5.60	3.41	0.15
Zara	7.36	3.95	0.25
Univ	26.77	20.31	0.27

Table 2. TABLE II : Prediction performance across all datasets and methods (best in bold and second best underlined ). The lower the errors, the better the performance. All errors are reported in meters. The μ 𝜇 \mu refers to the mean error value across all datasets for each of the ADE, MDE, and FDE evaluation metrics. Each row corresponds to results on a test dataset. For example, the first row reports the ADE values when testing on the ETH dataset while trained on the other three datasets.

		Real		Real+Synth-Large		Synth-Equal		Synth-Large
Metric	Dataset	20%	100%	20%	100%	20%	100%	20%	100%
	ETH	0.95	0.82	0.82	0.77	0.96	0.80	0.79	0.75
ADE	Hotel	0.83	0.63	0.48	0.64	0.70	0.57	0.43	0.43
	Zara	0.96	0.39	0.36	0.39	0.68	0.37	0.30	0.35
	Univ	0.72	0.55	0.37	0.37	0.46	0.38	0.38	0.38
$μ$		0.86	0.60	0.51	0.54	0.70	0.53	0.47	0.48
	ETH	0.56	0.51	0.45	0.38	0.55	0.42	0.45	0.40
MDE	Hotel	0.47	0.17	0.15	0.12	0.17	0.16	0.13	0.12
	Zara	0.44	0.10	0.12	0.09	0.14	0.11	0.13	0.11
	Univ	0.37	0.21	0.18	0.15	0.13	0.15	0.20	0.17
$μ$		0.46	0.25	0.23	0.19	0.25	0.21	0.23	0.20
	ETH	1.79	1.61	1.65	1.55	1.76	1.56	1.59	1.50
FDE	Hotel	1.55	1.29	0.99	1.33	1.25	1.10	0.84	0.86
	Zara	1.77	0.81	0.75	0.84	1.27	0.75	0.62	0.72
	Univ	1.31	1.06	0.78	0.77	0.91	0.77	0.78	0.78
$μ$		1.60	1.19	1.04	1.12	1.30	1.05	0.96	0.96

Equations8

s_{t k} = \frac{∣∣ x _{(t + 1) k} - x _{t k} ∣∣}{Δ t},

s_{t k} = \frac{∣∣ x _{(t + 1) k} - x _{t k} ∣∣}{Δ t},

A D E = \frac{1}{N _{p} \times T} i = 1 \sum N_{p} t = 1 \sum T E [∣∣ y_{t i} - x_{t i} ∣∣],

A D E = \frac{1}{N _{p} \times T} i = 1 \sum N_{p} t = 1 \sum T E [∣∣ y_{t i} - x_{t i} ∣∣],

M D E = \frac{1}{N _{p} \times T} i = 1 \sum N_{p} t = 1 \sum T j min {∣∣ y_{t i}^{(j)} - x_{t i} ∣∣} .

M D E = \frac{1}{N _{p} \times T} i = 1 \sum N_{p} t = 1 \sum T j min {∣∣ y_{t i}^{(j)} - x_{t i} ∣∣} .

F D E = \frac{1}{N _{p}} i = 1 \sum N_{p} E [∣∣ y_{T i} - x_{T i} ∣∣] .

F D E = \frac{1}{N _{p}} i = 1 \sum N_{p} E [∣∣ y_{T i} - x_{T i} ∣∣] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umautobots/sim_traj
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Stochastic Sampling Simulation for Pedestrian Trajectory Prediction

Cyrus Anderson1, Xiaoxiao Du2, Ram Vasudevan3, and Matthew Johnson-Roberson2 This work was supported by a grant from Ford Motor Company via the Ford-UM Alliance under award N022884.1C. Anderson is with the Robotics Institute, University of Michigan, Ann Arbor, MI 48109 USA [email protected]. Du and M. Johnson-Roberson are with the Department of Naval Architecture and Marine Engineering, University of Michigan, Ann Arbor, MI 48109 USA [email protected]; [email protected]. Vasudevan is with the Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48109 USA [email protected]

Abstract

Urban environments pose a significant challenge for autonomous vehicles (AVs) as they must safely navigate while in close proximity to many pedestrians. It is crucial for the AV to correctly understand and predict the future trajectories of pedestrians to avoid collision and plan a safe path. Deep neural networks (DNNs) have shown promising results in accurately predicting pedestrian trajectories, relying on large amounts of annotated real-world data to learn pedestrian behavior. However, collecting and annotating these large real-world pedestrian datasets is costly in both time and labor. This paper describes a novel method using a stochastic sampling-based simulation to train DNNs for pedestrian trajectory prediction with social interaction. Our novel simulation method can generate vast amounts of automatically-annotated, realistic, and naturalistic synthetic pedestrian trajectories based on small amounts of real annotation. We then use such synthetic trajectories to train an off-the-shelf state-of-the-art deep learning approach Social GAN (Generative Adversarial Network) to perform pedestrian trajectory prediction. Our proposed architecture, trained only using synthetic trajectories, achieves better prediction results compared to those trained on human-annotated real-world data using the same network. Our work demonstrates the effectiveness and potential of using simulation as a substitution for human annotation efforts to train high-performing prediction algorithms such as the DNNs.

Index Terms:

Deep Learning in Robotics and Automation, Simulation and Animation

I Introduction

In crowded urban environments, mobile robots, such as autonomous vehicles (AVs) and social robots, must navigate safely and efficiently while in close proximity to many pedestrians on the road. To avoid collision and ensure a smooth ride, it is crucial for the robots to accurately predict where a nearby pedestrian may move to next. However, the motion and actions of each pedestrian may depend on the behavior of others, which makes it difficult to forecast [1, 2, 3, 4]. Our goal is to predict all possible future locations and trajectories of pedestrians with probability estimates accounting for social interactions.

Classical approaches to forecasting pedestrian trajectories include Kalman filters, Gaussian Processes [5], and inverse optimal control [6], which estimates a model for each pedestrian based on past behavior to forecast the future. These approaches have traditionally focused only on predicting single pedestrians without considering the social interactions between different pedestrians. These earlier results have been improved upon by extending the frameworks with hand-crafted features to model social interactions [7].

More recently, deep neural networks (DNNs) have been used to successfully make long-term pedestrian trajectory prediction accounting for social interactions [2, 8, 4, 9, 10]. Instead of using hand-crafted features, they rely on vast amounts of annotated trajectories to learn social interactions directly from the data. However, it is often difficult, if not impossible, to obtain such large datasets with accurate annotation without resorting to either an enormous effort of manual labeling [11, 12] or heavily constructed experiments with instrumented participants [13], both of which are expensive and prone to error. Therefore, it would be valuable to develop a training scheme that requires very small amounts of labeled real-world data, yet still produces satisfactory prediction results for test datasets.

To address this problem, we propose the use of automatically-annotated, realistic pedestrian simulations to train deep neural networks for 2D (top-down view) trajectory prediction. Generating synthetic data for training DNNs has shown promising results for various applications, including image classification [14], object detection [15], and pose estimation [16]. However, to the best of our knowledge, our work is the first that proposes techniques for synthesizing pedestrian trajectories.

Figure 1 shows an overview of our system. We develop a novel stochastic sampler that can generate tens of thousands of realistic pedestrian trajectories based on limited real-world annotations. We then use these samples to train state-of-the-art DNNs to predict pedestrian trajectories. In this paper, we use Social GAN [4] as our main prediction network, which provides state-of-the-art predictions and takes into account the interactions of all pedestrians in the scene. We envision our method being used to train high-performing prediction algorithms (such as DNNs) when there is little annotated data, or where the costs of collecting and labeling real data are high.

Our main contributions include (1) a novel nonparametric model of pedestrians, (2) a method to sample realistic pedestrian trajectories from this model, and (3) experiments training on these sampled trajectories and predicting on benchmark pedestrian datasets, such as the ETH dataset [17, 11] and the UCY dataset [18]. We demonstrate that Social GAN trained on the realistic sampled pedestrian trajectories alone achieves performance surpassing that achieved by training on real-world human-annotated data.

The paper is organized as follows. Section II describes related work in data augmentation and simulation as well as related work in trajectory prediction networks. Section III describes our proposed simulation model and synthetic trajectory generation process. Section IV presents pedestrian prediction results on benchmark datasets. Section V concludes this work.

II Related Work

In this section, we first describe DNN-based pedestrian trajectory prediction methods and our motivation for using the Social GAN. Then, we describe related work in synthetic data generation for training DNNs, such as using physics simulations and domain randomization. Finally, we describe related work in data augmentation and methods for generating synthetic datasets based on real data annotations.

II-A DNN-Based Pedestrian Trajectory Prediction Methods

In recent literature, DNN-based methods, particularly methods based on Long Short-Term Memory (LSTM) networks, have shown successful results in pedestrian trajectory prediction applications [2, 19, 20, 4, 21, 22, 23]. As our method does not consider scene geometry, we refrain from using SS-LSTM [21] or scene-LSTM [23], which incorporates scene information. Instead, we select Social GAN [4] as our main prediction network, which provides state-of-the-art prediction results and takes into account the interactions of all pedestrians in the scene. Most recently, a convolutional neural network (CNN)-based trajectory prediction approach was proposed [10]. This approach uses highly parallelizable convolutional layers to handle temporal dependencies and predict future pedestrian positions. However, this CNN-based approach only handles individual trajectory information and does not consider social interactions as Social GAN does. Therefore, in this paper, we chose Social GAN as our basic prediction network.

The Social GAN uses a pooling mechanism together with a Generative Adversarial Network (GAN) to learn the social interactions of pedestrians and produces probabilistic trajectory outcomes. The current Social GAN is a purely data-driven approach [4] and benefits from large quantities of annotated trajectory data. However, accurate annotation for large datasets are often difficult, expensive, or impossible to obtain [11, 12, 13]. In this paper, we develop a stochastic sampling-based simulation system to automatically generate large amounts of annotated, simulated yet realistic pedestrian trajectories to use as training data for a Social GAN, and aim to produce prediction results comparable or better than the Social GAN prediction results trained on real data.

II-B Physics Simulators and Domain Randomization

Physics simulators and engines can generate new data without the aid of existing data. Some studies have focused on crafting synthetic data as similar to real data as possible. In [15] and [24], for example, the Grand Theft Auto (GTA) game engine was utilized to produce automatically-annotated, photo-realistic images for object detection and semantic segmentation. In [25], the OpenRAVE simulator [26] was used to generate a large-scale database for grasps. This simulated grasps dataset was then used to train a DNN for binary classification (stable or unstable grasps) tasks.

In contrast, domain randomization (DR) methods [27, 28] were used to “bridge the gap” between simulation and reality. Domain randomization methods focus on bringing variability to the simulation, typically by varying global parameters (e.g., camera pose, shape and number of objects, texture, lighting) and adding noise to the simulated data without much regard to photo-realism. For example, the images synthesized in [29] contain car models and added geometric objects rendered over random background images. The aim here is to encourage DNNs to learn features invariant to the different kinds of noise added, as well as rendering artifacts and avoid over-fitting. Note that both physics simulations and DR methods do not rely on existing real datasets or annotations.

II-C Data Augmentation and (Real-)Data-Driven Synthesis

Data Augmentation (DA) is another approach to enlarge and enhance training datasets by performing a variety of transformations, typically on images [14]. Data augmentation methods usually seek to increase the amount of training data by generating new artificial data from existing real data while preserving label information. Common forms of DA include image translation, horizontal flips/reflections, crops, and perturbations to color (intensity) values [14]. These methods have been widely used in image classification [14] and object detection [30]. Other applications include acoustic modeling [31] and natural language processing [32]. So far, we are not aware of any standard label-preserving transforms designed specifically for non-image-based pedestrian trajectories, due in part to the need to account for interactions among pedestrians and scene geometry.

In addition to image transformation, several methods have been proposed to transfer the style between real and synthetic data, adopting the GAN framework [33, 34, 35]. Rogez and Schmid [16] proposed a synthesis engine to augment existing real images with manual 2D pose labels into 3D poses using 3D Motion Capture (MoCap) data. In a way, these methods were trying to automatically learn the relationship, or transformation, between real and synthetic datasets instead of performing predefined transformation as DA usually does.

Our work is most similar to [16] in that we also use labeled real-datasets to generate a new, larger synthetic dataset for training. Unlike [16] which synthesizes images for pose estimation, we simulate realistic pedestrian trajectories. We also perform a variety of perturbations to generate our synthetic data, inspired by DA methods. We show that, with our proposed method, the DNN trained on synthetic data outperforms when trained on real data, even when the synthetic data is generated from a small amount of real annotations.

III Stochastic Sampling-Based Simulation

In this paper, our overall task is to predict future pedestrian trajectories given each pedestrian’s previous positions considering social interactions. To do so, we aim to generate large amounts of synthetic pedestrian trajectories for training a Social GAN. A limited amount of real data is given to our simulation system so we can generate realistic trajectories based on how pedestrians actually walk from observed real datasets. In this section, we define the notations and pre-computation steps used in our method and then describe our simulation method in detail.

III-A Notations and Pre-computation

Let $\mathbf{x}_{tk}$ denote the 2D position (top-down view) at time $t$ for the $k^{th}$ pedestrian, $\mathbf{x}_{tk}\in{\mbox{{R}}}^{2}$ . Since we utilized a small, real dataset to generate our simulation dataset, we denote $\mathcal{D_{R}}$ as the given real dataset and $\mathcal{D_{S}}$ as the generated simulation dataset. In our system, $\left|\mathcal{D_{S}}\right|\gg\left|\mathcal{D_{R}}\right|$ , where $\left|\mathcal{D}\right|$ refers to the size of the dataset $\mathcal{D}$ . Denote $\mathcal{N}$ as the Gaussian/normal distribution and denote $\mathcal{U}$ as the uniform distribution.

We denote the total number of unique pedestrians in the real dataset $\mathcal{D_{R}}$ by $K$ . At each frame (timestep) in $\mathcal{D_{R}}$ , we record the number of pedestrians in the scene as $K_{p}$ . $K$ is a known constant for $\mathcal{D_{R}}$ , while $K_{p}$ may change from frame to frame as pedestrians are entering and exiting the scene. From $K_{p}$ , we can compute the average number of pedestrians in a frame as $\mu_{p}$ and the variance of the number of pedestrians in the scene as $\sigma_{p}^{2}$ .

From $\mathcal{D_{R}}$ , we can also compute the walking speed for each pedestrian, following

[TABLE]

where $\mathbf{x}_{tk}$ denotes the 2D position at each timestep $t$ for pedestrian $k=1,...,K$ , $\Delta t$ is the difference in time between two frames/timesteps (fixed), and $||\cdot||$ denotes Euclidean distance. Figure 2 shows a simple illustration for computing the speed for two pedestrians. Note that a pedestrian in $\mathcal{D_{R}}$ may appear in sequences of varying lengths (due to entering and exiting the scene or due to available data). We denote the sequence length (length of observed timesteps) as $T_{k}$ for pedestrian $k$ in $\mathcal{D_{R}}$ . Also, note that the speed $s_{tk}$ can vary in each step for real pedestrians.

Given the speed for each pedestrian at each timestep, we can compute the average walking speed for $k^{th}$ pedestrian as $\bar{s}_{k}$ . In our method, we assume that all persons walk with the same speed variation between timesteps and we compute $\sigma_{s}^{2}$ as the pooled variance across all $s_{tk},\forall t,k$ . Note that $\bar{s}_{k}$ changes for each pedestrian and $\sigma_{s}^{2}$ is the same for all pedestrians.

The above summary statistics reflect how pedestrians walk in the real dataset and are used later to generate synthetic trajectories in our sampling scheme.

III-B Sampling Number of Pedestrian and Walking Speed

In our simulation, we use stochastic sampling to generate realistic pedestrian trajectories. In this section, we describe the method to sample the number of pedestrians and walking speeds for the simulated dataset.

Let $n_{p}$ denote the number of pedestrians in a frame in the simulated dataset. In simulating a single set of pedestrians, we sample $n_{p}$ based on statistics from the given real dataset. We already obtained the average number of pedestrian in a frame $\mu_{p}$ and the variance of the number of pedestrians in the scene $\sigma_{p}^{2}$ from Section III-A. We assume the number of pedestrians at each time follows a normal distribution $\mathcal{N}(\mu_{p},\sigma_{p}^{2})$ left-truncated at zero, which we denote with $\mathcal{N}(\mu_{p},\sigma_{p}^{2},0)$ . We can then sample $n_{p}\sim\mathcal{N}(\mu_{p},\sigma_{p}^{2},0)$ .

Regarding walking speed, we model each pedestrian as walking at a desired constant speed. Denote $s^{(i)}$ as the speed of the $i^{th}$ sampled pedestrian. We use the superscript to distinguish between the index of the stochastically sampled and real datasets. For each $i$ , we first uniformly sample an average speed value, $\bar{s}^{(i)}$ , from the pool of average speeds from real pedestrians, denoted as $\bar{s}^{(i)}\sim\mathcal{U}(\{\bar{s}_{k}\}_{k=1}^{K})$ . The variance of speed is assumed to be the same as real pedestrians, $\sigma_{s}^{2}$ . Then, we can sample $s^{(i)}$ based on a truncated normal distribution, $\mathcal{N}(\bar{s}^{(i)},\sigma_{s}^{2},0)$ .

III-C Pedestrian Trajectory Sampling

Based on the number of pedestrians and walking speeds sampled above, we can determine the actual paths of the pedestrians. We generate the pedestrian trajectories by assigning the sampled speeds to these paths.

We represent the real dataset as a collection of trajectories, i.e., $\mathcal{D_{R}}=\{f_{k}\}_{k=1}^{K}$ , where $f_{k}$ is the trajectory for the $k^{th}$ pedestrian. The trajectory $f_{k}=\{\mathbf{x}_{tk}|t=t_{k},...,T_{k}\}$ , for pedestrian $k$ present in the scene from time $t_{k}$ to $T_{k}$ .

For each sampled pedestrian $i$ , we first uniformly sample a trajectory, ${f}^{(i)}$ , from the pool of all real pedestrian paths, denoted as $f^{(i)}\sim\mathcal{U}(\{f_{k}\}_{k=1}^{K})$ . Then, we apply the following three types of perturbations to the $f^{(i)}$ :

•

Translation by an amount $\Delta\mathbf{x}\sim\mathcal{U}([-r,r]\times[-r,r])$ , where $r$ is the user-defined displacement in each axis of the 2D plane.

•

Reversal with probability $p_{r}$ : We reverse the start and ending locations as well as all the waypoints in between.

•

Truncation by a random number of steps.

Fig 3 shows an illustration for the perturbations. The pedestrian then follows the path $g$ , a piecewise linear spline fit [36] to the perturbed $f^{(i)}$ . All synthetic pedestrians have fixed $N+1$ timesteps, and we denote a sampled set of trajectories as the set $\mathcal{X^{S}}=\{\mathbf{x}_{li}|i=1,...,n_{p},l=1,...,N+1\}$ , for $N+1$ timesteps and $n_{p}$ pedestrians in the scene.

We run Algorithm 1 for a user-defined $M$ number of times to generate the entire large-scale simulated trajectory dataset. In practice we see that $M>20$ produces datasets yielding competitive models, with better performance for larger $M$ .

IV Experiments

To measure the effectiveness of the proposed method, we first generated synthetic datasets from the sampling-based simulation method described above. We employ Social GAN [4], a state-of-the-art deep learning trajectory prediction network architecture, to perform a prediction based on our simulated data. We evaluate our method on two widely used pedestrian trajectory datasets. The ETH dataset [17, 11] contains over 850 labeled frames of data in each of two distinct scenes, ETH and Hotel. The UCY dataset [18] also has two scenes, Zara and University, each with over 1500 labeled frames. Using leave-one-out cross validation, we evaluated the Social GAN’s prediction outputs on each scene, having trained on data from the other three scenes.

IV-A Baselines

Since the focus of this paper is synthetic trajectory dataset generation, we compare Social GAN prediction results trained on the following four methods:

•

Real: Train on all the available frames of real data in each scene (approximately 5,000 frames in total).

•

Synth-Large: Train on a large synthetic dataset. We sampled simulated trajectories 500 times ( $M=500$ ) using Algorithm 1 with $N=20$ timesteps for each scene. When sampling from University dataset, we sampled $M=100$ times due to the large numbers of pedestrians in the scene. This yields over 20,000 simulated, labeled frames in every train-test cross validation split.

•

Synth-Equal: Train on a synthetic dataset that has the same size as the real dataset (which is much smaller than the size of Synth-Large). We sampled simulated trajectories such that the number of frames of simulated pedestrian trajectories is equal to that of the real data for each scene.

•

Real + Synth-Large: Train on the combined data from Real and Synth-Large. This is to evaluate the effect of including real data in training.

The Synth-Large dataset has over 15k more frames in in each cross validation split than using 100% of real data (“Real-100%”). In addition, we examined the effect of using an even smaller number of labeled frames of real data. We randomly selected 20% of the real data from each scene (around 1.2k frames) and used this to train the Social GAN. We also generated a separate set of synthetic datasets based only on the 20% real data and reported prediction results as well. These results were reported under “20%” columns in Table II.

IV-B Evaluation Metrics

Since Social GAN makes probabilistic predictions, we treat the predicted position for the $i^{th}$ pedestrian at timestep $t$ as a random variable $\mathbf{y}_{ti}$ . To thoroughly sample the predictive distribution for $\mathbf{y}_{ti}$ , we made 100 probabilistic predictions of the pedestrian’s possible locations. We denote the ground truth position as $\mathbf{x}_{ti}$ . Similar to prior work [2, 4], we used the following error metrics:

IV-B1 Average Displacement Error (ADE)

Expected distance between the ground truth (GT) pedestrian location and the probabilistic prediction. We estimate this for the dataset by averaging across all $N_{p}$ pedestrians in the dataset and all predicted timesteps ( $t=1,...,T$ ) as

[TABLE]

where $\mathbb{E}[\cdot]$ refers to the expected value.

IV-B2 Minimum Displacement Error (MDE)

Minimum distance between the GT pedestrian location and our predictions, averaged across pedestrians and timesteps. Denoting the $j$ th probabilistic prediction by $\mathbf{y}_{ti}^{(j)}$ , the MDE is given by

[TABLE]

IV-B3 Final Displacement Error (FDE)

Expected distance between the GT pedestrian location at the final time step $T$ and the predicted final position. This is averaged across pedestrians, written as

[TABLE]

The ADE provides a measure of spread in the predictions, in that a model producing a large spread will necessarily have a large ADE. Unless the model makes predictions near the pedestrian, a small spread in predictions will not ensure a low ADE. The MDE reflects recall in the “best case”, measuring the closest prediction to the pedestrian. The FDE is equivalent to ADE measured only at the final timestep. Ideally, we want the ADE, MDE, and FDE to all have low values to show that all the predictions are close to each pedestrian GT location.

IV-C Training Parameters

The Social GAN network architecture is trained with a learning rate of 0.001 and batch sizes of 64 for 200 epochs, following the training procedure in [4] for each experiment. In alignment with this, we use a timestep of $\Delta t=0.4$ s when sampling pedestrian trajectories. Predictions are made by observing pedestrian trajectories for 8 timesteps (3.2 s) and making predictions for the next 8 timesteps ( $T=8$ ).

For each scene we separately calculate the mean and standard deviation for the number of pedestrians at each timestep ( $\mu_{p}$ and $\sigma_{p}$ ), and the standard deviation about their desired speeds ( $\sigma_{s}$ ). These values are also used when sampling from the smaller amounts of real data, as these can be reliability estimated with much less effort than that needed to build a dataset. The summary statistics calculated for each scene are given in Table I. Note that the University scene alone (“Univ” row) contains larger mean and variance for number of pedestrians compared with the rest.

IV-D Comparison of Prediction Performance

Table II shows the ADE, MDE, and FDE comparison results across all cross validation datasets, predicted using Social GAN trained on various dataset generation methods. Training on large amount of sampled data in “Synth-Large” achieves the best performance, producing lower prediction errors than training on 100% real data. Figure 5 shows a qualitative comparison for both of these models. This lower error performance holds true for “Synth-Large-20%” as well, where we sampled from only 20% of the real data to make the synthetic dataset. We can attribute part of this high performance to the increased amount of sampled data used for training. “Synth-Large” outperforms “Synth-Equal”, where the only difference is the amount of sampled data (15k more frames in “Synth-Large”). The realistic variations contained in the additional labeled frames allows for learning a better representation for the true distribution of pedestrian trajectories. This performance difference is especially pronounced in the ADE. We observe similar performance when trained on 100% versus 20% real data, where using more training data increases the performance.

When adding the real dataset to the sampled dataset (in “Real+Synth-Large”), we do not observe strictly increased performance compared to training on “Synth-Large” alone. While the ADE have increased slightly when adding the real dataset, the MDE has decreased. The lack of large performance increases makes intuitive sense, since the synthetic data is sampled from the real data and the pedestrian statistics from real data is largely contained in these stochastically sampled trajectories.

In the next section, we will show that adding real data promotes the expression of uncertainty in the DNN, which aids in lowering the MDE. We also show that the small decreases in MDE depend more on the variations in the sampled data than the amount of data sampled through an ablation study.

IV-E Ablation Study

In our ablation study, we removed the dataset fitting terms $\sigma_{s}$ and $\sigma_{p}$ from the sampler to see their effect on the performance. Upon removing these terms, we sampled from their respective densities without variance, which is equivalent to setting $\sigma_{s}$ or $\sigma_{p}$ to zero.

We sampled large datasets from 100% of the real data following the same procedure as for generating Synth-Large, once with $\sigma_{s}=0$ and again with both $\sigma_{s}=0,\sigma_{p}=0$ . We compared these “reduced models” to the Synth-Large results with full pedestrian statistics. These reduced models both attain an average ADE of 0.47 m and FDE of 0.95 m, compared to 0.48 m and 0.96 m of the full model. On the other hand, both have higher MDE. Removing $\sigma_{s}$ from the sampler increases MDE from 0.20 m to 0.22 m; further removing $\sigma_{p}$ increases this to 0.23 m.

We also report an additional quantile-based metric to evaluate their performances. Recall that for each pedestrian at each predicted timestep, Social GAN produced 100 probabilistic predictions. This quantile-based metric is defined by sorting the predictions by their distance to the ground truth pedestrian locations and calculating the average distance for each quantile. Figure 4 shows the quantile-based distance metric across all training datasets. Ideally, we want the curve to have low distance value across all quantiles. Naturally, the higher the quantile value, the higher the distance (since we sorted the distances in ascending order). The further the curve is shifted towards the top-left, the better the performance. Since we have 100 predictions, the “quantile” here is equivalent to “percentile”.

The lowest point on each curve is equivalent to MDE, since it represents the closest (minimum) distance. The full sampler (with non-zero $\sigma_{s}$ and $\sigma_{p}$ values) performs the best in the 50% percentile (when quantile less than 0.5 on the plot). The model with no $\sigma_{s}$ , on the hand hand, outperforms that the no $\sigma_{s},\sigma_{p}$ model at the minimum distance as well as in the middle quantiles. All three synthetic-based models outperform the model trained on real data by a great margin across all percentiles.

The more gentle slope for the real model in Figure 4 reflects a greater spread in distances to the ground truth locations as well as uncertainty in the probabilistic predictions. Expressing this uncertainty is important for safety-critical applications such as autonomous driving, since we would like to avoid hitting any predicted locations a pedestrian may potentially be. Training on the sampled data without real-pedestrian statistics (no $\sigma_{s}$ and $\sigma_{p}$ ) reduces this expression of uncertainty and results in the steeper slopes. Including real-pedestrian statistics and calibrating the sampler to the real data before sampling can help recover this uncertainty, as shown in Figure 4 by the less steep slopes for the full model, where $\sigma_{s}$ and $\sigma_{p}$ were added.

V Conclusion

In this work, we presented a novel stochastic sampling method for simulating realistic pedestrian trajectories. We developed a model to extract pedestrian number and walking speed from a small real dataset, and used this information to sample synthetic pedestrian trajectories. We trained a Social GAN on the sampled datasets and evaluated the prediction results on a variety of benchmark datasets of pedestrian trajectories. We show improved prediction performance when trained on large amounts of synthetic data generated by the proposed sampling scheme when compared with trained on real datasets. We also performed an ablation study on the effect of the pedestrian statistics and show that our extracted pedestrian parameters can represent how pedestrians walk in real dataset and allow the DNN to more accurately model the true distribution of pedestrian trajectories.

Future directions include extending the sampling method to incorporate scene geometry, and training a DNN that utilizes the scene information such as [23] on the synthetic datasets. Sampling from the space of interactions, such as sampling the outcomes of pedestrian yielding, is another direction.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Luber, J. A. Stork, G. D. Tipaldi, and K. O. Arras, “People tracking with human motion predictions from social forces,” in IEEE Int. Conf. Robot. Autom. , 2010, pp. 464–469.
2[2] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. , 2016, pp. 961–971.
3[3] Y. Luo, P. Cai, A. Bera, D. Hsu, W. S. Lee, and D. Manocha, “Porca: Modeling and planning for autonomous driving among many pedestrians,” IEEE Robot. Autom. Lett. , vol. 3, no. 4, pp. 3418–3425, 2018.
4[4] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. , 2018, pp. 2255–2264.
5[5] D. Ellis, E. Sommerlade, and I. Reid, “Modelling pedestrian trajectory patterns with gaussian processes,” in IEEE Int. Conf. Comput. Vis. Workshops , 2009, pp. 1229–1234.
6[6] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in Proc. Eur. Conf. Comput. Vis. Springer, 2012, pp. 201–214.
7[7] P. Trautman and A. Krause, “Unfreezing the robot: Navigation in dense, interacting crowds,” in IEEE/RSJ Int. Conf. Intell. Robot. Syst. , 2010, pp. 797–803.
8[8] S. Yi, H. Li, and X. Wang, “Pedestrian behavior understanding and prediction with deep neural networks,” in Eur. Conf. Comput. Vis. Springer, 2016, pp. 263–279.