Traffic Surveillance Camera Calibration by 3D Model Bounding Box   Alignment for Accurate Vehicle Speed Measurement

Jakub Sochor; Roman Jur\'anek; Adam Herout

arXiv:1702.06451·cs.CV·June 2, 2017

Traffic Surveillance Camera Calibration by 3D Model Bounding Box Alignment for Accurate Vehicle Speed Measurement

Jakub Sochor, Roman Jur\'anek, Adam Herout

PDF

TL;DR

This paper introduces an automatic traffic camera calibration method using 3D vehicle models and bounding box alignment, significantly improving speed measurement accuracy over previous techniques.

Contribution

The paper presents a novel automatic scene scale inference approach based on 3D model bounding box matching, which is viewpoint-independent and enhances calibration precision.

Findings

01

Reduced calibration error by 50% compared to previous methods.

02

Outperformed manual calibration with an 86% error reduction.

03

Effective across various viewpoints and lighting conditions.

Abstract

In this paper, we focus on fully automatic traffic surveillance camera calibration, which we use for speed measurement of passing vehicles. We improve over a recent state-of-the-art camera calibration method for traffic surveillance based on two detected vanishing points. More importantly, we propose a novel automatic scene scale inference method. The method is based on matching bounding boxes of rendered 3D models of vehicles with detected bounding boxes in the image. The proposed method can be used from arbitrary viewpoints, since it has no constraints on camera placement. We evaluate our method on the recent comprehensive dataset for speed measurement BrnoCompSpeed. Experiments show that our automatic camera calibration method by detection of two vanishing points reduces error by 50% (mean distance ratio error reduced from 0.18 to 0.09) compared to the previous state-of-the-art…

Tables6

Table 1. Table 1: Errors of distance measurement ratios (see Section 5.1 for details). The first row for each calibration method contains absolute errors ; the relative errors in percents are in the second row.


system	mean	median	99 %
Edgelets (ours)	0.09	0.04	0.49
Edgelets (ours)	6.45	3.38	39.08
ITS15	0.18	0.05	1.36
ITS15	11.74	5.25	61.03
ManualCalib	0.02	0.01	0.15
ManualCalib	1.80	1.26	10.98

Table 2. Table 2: Distance measurement errors on the road plane for different calibrations. Only distances towards the first vanishing point (red in Figure 8 ) were used for this evaluation. The first row for each calibration method contains absolute errors in meters ; the relative errors in percents are in the second row.

system	mean	median	99 %
Edgelets + BBScale + reg (ours)	0.26	0.17	1.08
Edgelets + BBScale + reg (ours)	2.33	2.06	5.49

ITS15 + BMVC14	1.23	0.81	5.40
ITS15 + BMVC14	9.62	10.65	21.07
Edgelets + ManualScale (ours)	0.10	0.06	0.57
Edgelets + ManualScale (ours)	0.98	0.62	4.46

ITS15 + ManualScale	0.25	0.14	1.54
ITS15 + ManualScale	2.11	1.66	8.07

ManualCalib + ManualScale	0.10	0.08	0.32
ManualCalib + ManualScale	1.08	0.65	3.59

Table 3. Table 3: Distance measurement errors on the road plane for different calibrations. Each segment of the table represents a different level of supervision in the calibration. The first row for each calibration method contains absolute errors in meters and the relative errors in percents are in the second row.

system	mean	median	99 %
Edgelets + BBScale + reg (ours)	0.34	0.18	2.29
Edgelets + BBScale + reg (ours)	3.47	2.28	30.49

ITS15 + BMVC14	1.17	0.72	5.82
ITS15 + BMVC14	9.79	9.00	55.89
Edgelets + ManualScale (ours)	0.24	0.10	2.60
Edgelets + ManualScale (ours)	2.66	1.00	34.75

ITS15 + ManualScale	0.57	0.20	5.43
ITS15 + ManualScale	5.84	2.07	52.19

ManualCalib + ManualScale	0.07	0.04	0.30
ManualCalib + ManualScale	0.84	0.50	3.47

Table 4. Table 4: Evaluation of speed measurement errors; all systems differ only in the calibration and scale inference, with the same tracking of vehicles. Each segment represents one level of supervision in the calibration (automatic, known ground truth distances on road, known ground truth speeds). The first row for each calibration method contains absolute errors in km/h ; the relative errors in percents are in the second row.

system	mean	median	99 %
Edgelets + BBScale + reg (ours)	1.10	0.97	3.05
Edgelets + BBScale + reg (ours)	1.39	1.22	4.13

ITS15 + BMVC14	7.98	8.18	18.58
ITS15 + BMVC14	10.15	11.45	19.22
Edgelets + ManualScale (ours)	1.04	0.83	3.48
Edgelets + ManualScale (ours)	1.31	1.04	4.61

ITS15 + ManualScale	1.44	1.17	5.43
ITS15 + ManualScale	1.76	1.50	6.16

ManualCalib + ManualScale	1.35	0.95	4.84
ManualCalib + ManualScale	1.64	1.18	5.40
Edgelets + SpeedScale (ours)	0.52	0.35	2.57
Edgelets + SpeedScale (ours)	0.66	0.44	3.71

ITS15 + SpeedScale	0.80	0.57	3.70
ITS15 + SpeedScale	0.99	0.72	4.68

ManualCalib + SpeedScale	0.56	0.38	2.73
ManualCalib + SpeedScale	0.71	0.48	3.63

Table 5. Table 5: Analysis of influence of different aspects of used 3D car models. It shows that it is best to use both models. The second segment of the table also shows that it is useful to use scale correction regression as described in Section 4.3 . The first row for each 3D model combination method contains absolute errors in km/h ; the relative errors in percents are in the second row.


system	mean	median	99 %
Sedan	2.39	1.74	8.67
Sedan	2.82	2.14	7.74
Combi	2.03	1.72	6.51
Combi	2.48	2.14	5.94

Combi + Sedan	1.38	0.99	5.18
Combi + Sedan	1.70	1.23	4.94
Sedan + reg	2.43	2.49	7.26
Sedan + reg	2.97	3.17	6.56

Combi + reg	1.03	0.82	3.29
Combi + reg	1.33	1.04	4.49

Combi + Sedan + reg	1.10	0.97	3.05
Combi + Sedan + reg	1.39	1.22	4.13

Table 6. Table 6: Evaluation of differences between vehicle detection and tracking proposed by Dubská et al. [ 2014 ] and our detection and tracking method. FPPM denotes the number of False Positives Per Minute, recall was computed as mean recall across all videos and speed error denotes mean speed measurement error.

method	FPPM	recall	speed error [km/h]
Dubská et al. [2014]	9.77	0.885	1.46
ours	1.91	0.863	1.21

Equations29

f

f

\overline{u}

\overline{v}

\overline{w}

n

ρ

\overline{p}

\overline{p}

P

\displaystyle\mathbf{X}_{i}=\left[\begin{array}[]{cc}w_{1}(m_{1}-x_{i})&w_{1}(n_{1}-y_{i})\\ w_{2}(m_{2}-x_{i})&w_{2}(n_{2}-y_{i})\\ \vdots&\vdots\\ w_{k}(m_{k}-x_{i})&w_{k}(n_{k}-y_{i})\\ \end{array}\right]

\displaystyle\mathbf{X}_{i}=\left[\begin{array}[]{cc}w_{1}(m_{1}-x_{i})&w_{1}(n_{1}-y_{i})\\ w_{2}(m_{2}-x_{i})&w_{2}(n_{2}-y_{i})\\ \vdots&\vdots\\ w_{k}(m_{k}-x_{i})&w_{k}(n_{k}-y_{i})\\ \end{array}\right]

W_{i} Σ_{i}^{2} W_{i}^{T}

W_{i} Σ_{i}^{2} W_{i}^{T}

where

W_{i}

Σ_{i}

λ_{ij} = \frac{l _{t_{i}}}{∥ F _{ij} - R _{ij} ∥}

λ_{ij} = \frac{l _{t_{i}}}{∥ F _{ij} - R _{ij} ∥}

λ^{*} = ar g λ max \leavevmode p (λ ∣ (λ_{ij}, m_{ij}))

λ^{*} = ar g λ max \leavevmode p (λ ∣ (λ_{ij}, m_{ij}))

v = i = 1 \dots N - τ median (\frac{λ _{r e g}^{*} ∥ P _{i + τ} - P _{i} ∥}{t _{i + τ} - t _{i}})

v = i = 1 \dots N - τ median (\frac{λ _{r e g}^{*} ∥ P _{i + τ} - P _{i} ∥}{t _{i + τ} - t _{i}})

v^{*} = ar g v min (p_{1}, p_{2}, d) \in D_{2} \sum ∣ λ ∥ P_{1} - P_{2} ∥ - d ∣,

v^{*} = ar g v min (p_{1}, p_{2}, d) \in D_{2} \sum ∣ λ ∥ P_{1} - P_{2} ∥ - d ∣,

λ = E [\frac{d _{i}}{∥ P _{i, 1} - P _{i, 2} ∥}]

λ = E [\frac{d _{i}}{∥ P _{i, 1} - P _{i, 2} ∥}]

λ = E [\frac{v ^ _{i}}{v _{i}}]

λ = E [\frac{v ^ _{i}}{v _{i}}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

Traffic Surveillance Camera Calibration

by 3D Model Bounding Box Alignment

for Accurate Vehicle Speed Measurement

Jakub Sochor

[email protected]

Roman Juránek

[email protected]

Adam Herout

[email protected]

Brno University of Technology, Faculty of Information Technology,

Centre of Excellence IT4Innovations, Božetěchova 2, 612 66 Brno, Czech Republic

Abstract

In this paper, we focus on fully automatic traffic surveillance camera calibration, which we use for speed measurement of passing vehicles. We improve over a recent state-of-the-art camera calibration method for traffic surveillance based on two detected vanishing points. More importantly, we propose a novel automatic scene scale inference method. The method is based on matching bounding boxes of rendered 3D models of vehicles with detected bounding boxes in the image. The proposed method can be used from arbitrary viewpoints, since it has no constraints on camera placement. We evaluate our method on the recent comprehensive dataset for speed measurement BrnoCompSpeed. Experiments show that our automatic camera calibration method by detection of two vanishing points reduces error by 50 % (mean distance ratio error reduced from 0.18 to 0.09) compared to the previous state-of-the-art method. We also show that our scene scale inference method is more precise, outperforming both state-of-the-art automatic calibration method for speed measurement (error reduction by 86 % – 7.98 km/h to 1.10 km/h) and manual calibration (error reduction by 19 % – 1.35 km/h to 1.10 km/h). We also present qualitative results of the proposed automatic camera calibration method on video sequences obtained from real surveillance cameras in various places, and under different lighting conditions (night, dawn, day).

keywords:

speed measurement , camera calibration , fully automatic , traffic surveillance , bounding box alignment , vanishing point detection

††journal: Computer Vision and Image Understanding.

1 Introduction

Surveillance systems pose specific requirements on camera calibration. Their cameras are typically placed in hardly accessible locations and optics are focused at longer distances, making the common pattern-based calibration approaches unusable (such as classical Zhang [2000]). That is why many solutions place markers to the observed scene and/or measure existing geometric features [Sina et al., 2013, Do et al., 2015, You and Zheng, 2016, Luvizon et al., 2016]. These approaches are laborious and inconvenient both in terms of camera setup (manually clicking on the measured features in the image) and in terms of physically visiting the scene and measuring the distances.

In our paper, we focus on precise and at the same time fully automatic traffic surveillance camera calibration including scene scale for speed measurement. The proposed speed measurement method needs to be able to deal with significant viewpoint variation, different zoom factors, various roads and densities of traffic. If the method should be applicable for large-scale deployment, it needs to run fully automatically without the necessity to stop traffic for installation or for performing calibration measurements.

Our solution uses camera calibration obtained from two detected vanishing points and it is built on our previous work [Dubská et al., 2014, 2015]. However, this calibration procedure only allows reconstruction of the rotation matrix and the intrinsic parameters from vanishing points, and it is still necessary to obtain the scene scale. We propose to detect vehicles on the road by Faster-RCNN [Ren et al., 2015], classify them into a few common fine-grained types by a CNN [Krizhevsky et al., 2012] and use bounding boxes of 3D models for the known classes to align the detected vehicles. The vanishing point-based calibration allows for full reconstruction of the viewpoint on the vehicle and the only free parameter in the alignment is therefore the scene scale. Figure 1 shows an example of the 3D model and the aligned images. Our experiments show that our method (mean speed measurement error 1.10 km/h) significantly outperforms existing automatic camera calibration method by Dubská et al. [2014] (error reduction by 86 % – mean error 7.98 km/h) and also calibration obtained from manual measurements on the road (error reduction by 19 % – mean error 1.35 km/h). This is important because in previous approaches, automation always compromised accuracy, forcing a trade off by the system developer. Our work shows that fully automatic calibration methods may produce better results than manual calibration (which was performed thoroughly and according to state-of-the-art approaches).

Existing solutions for traffic surveillance camera calibration [Dailey et al., 2000, Schoepflin and Dailey, 2003, Cathey and Dailey, 2005, Grammatikopoulos et al., 2005, He and Yung, 2007b, Maduro et al., 2008, Sina et al., 2013, Nurhadiyatna et al., 2013, Dubská et al., 2014, Lan et al., 2014, Luvizon et al., 2014, Dubská et al., 2015, Do et al., 2015, Luvizon et al., 2016, You and Zheng, 2016] (see Section 2 for detailed analysis) usually have limitations for real world applications. They are either limited to some viewpoints (zero pan, second vanishing point at infinity), or they require some per-installed-camera manual work. To our knowledge, there is only one work [Dubská et al., 2014] which does not have these limitations, and therefore we compare our results with this solution. For a brief description of the method, see Section 2; a more comprehensive review can be found in a recent dataset paper BrnoCompSpeed by Sochor et al. [2016b].

The key contributions of this paper are:

An improved camera calibration method by detection of two vanishing points. The camera calibration error is reduced by 50 % – 0.18 to 0.09 mean distance ratio error.

2.

A novel method for scene scale inference, which significantly outperforms automatic traffic camera calibration methods (error reduced by 86 % – 7.98 km/h to 1.10 km/h) and also manual calibration (error reduced by 19 % – 1.35 km/h to 1.10 km/h) in automatic speed measurement from a monocular camera.

3.

Results show that when used for the speed measurement task, the automatic (zero human input) method can perform better than the laborious manual calibration, which is generally considered accurate and treated as the ground truth. This finding can be important also in other fields beyond traffic surveillance.

2 Related Work

The camera calibration algorithm (obtaining intrinsic and extrinsic parameters of the surveillance camera) is critical for the accuracy of vehicle speed measurement by a single monocular camera, as it directly influences the speed measurement accuracy. There is a very recent comprehensive review of the traffic surveillance calibration methods [Sochor et al., 2016b], so for detailed information we refer to this review and we include only a brief description of the methods.

Several methods [He and Yung, 2007b, Cathey and Dailey, 2005, Grammatikopoulos et al., 2005] are based on the detection of vanishing points as an intersection of road markings (lane dividing lines). Other methods [Dubská et al., 2014, 2015, Schoepflin and Dailey, 2003, Dailey et al., 2000] use vehicle motion to calibrate the camera. There is also a set of methods which use some form of manually measured dimensions on the road plane [Maduro et al., 2008, Nurhadiyatna et al., 2013, Sina et al., 2013, Luvizon et al., 2014, 2016, Do et al., 2015, Lan et al., 2014].

An important attribute of calibration methods is whether they are able to work automatically without any manual per-camera calibration input. Only two methods [Dailey et al., 2000, Dubská et al., 2014] are fully automatic and both of them use mean vehicle dimensions for camera calibration. Another important requirement for real-world deployment is whether the camera can be placed in an arbitrary position above the road, which is not true for some methods as they assume to have zero pan or other constraints.

Regarding fine-grained vehicle classification, there are several approaches. The first one is based on detected parts of vehicles [Krause et al., 2015, Simon and Rodner, 2015, Fang et al., 2016], another approach is based on bilinear pooling [Lin et al., 2015, Gao et al., 2016]. There is also an approach based on Convolutional Neural Networks (CNN) and input modification [Sochor et al., 2016a]. For object detection, it is possible to use boosted cascades [Dollár et al., 2014], HOG detectors [Dalal and Triggs, 2005], or Deformable Parts Models (DPMs) [Felzenszwalb et al., 2010]. There are also recent advances in object detection based on CNNs [Girshick et al., 2014, Ren et al., 2015, Liu et al., 2016].

Several authors deal with alignment of 3D models and vehicles and use this technique for gathering data in the context of traffic surveillance. Lin et al. [2014] propose to jointly optimize 3D model fitting and fine-grained classification, and Hsiao et al. [2014] align edges formulated as an Active Shape Model [Cootes et al., 1995, Li et al., 2009]. Krause et al. [2013] and propose the use of synthetic data to train geometry and viewpoint classifiers for 3D model and 2D image alignment. Prokaj and Medioni [2009] use detected SIFT features [Lowe, 1999] to align 3D vehicle models and the vehicle’s observation. They use the alignment mainly to overcome vehicle appearance variation under different viewpoints. However, in our case, as the precise viewpoint on the vehicle is known (Section 4.3), such alignment does not have to be performed. Hence, we adopt a simpler and more efficient method based on 2D bounding boxes – simplifying the procedure considerably without sacrificing the accuracy.

When it comes to camera calibration in general, various approaches exist. The widely used method by Zhang [2000] uses a calibration checkerboard to obtain intrinsic and extrinsic camera parameters (relative to the checkerboard). Liu et al. [2012] use controlled panning or tilting with stereo matching to calibrate the camera. Correspondences of lines and points are used by Chaperon et al. [2011]. Yu et al. [2009] focus on automatic camera calibration for tennis videos from detected tennis court lines.

3 Traffic Camera Model

The main goal of camera calibration in the application of speed measurement is to be able to measure distances on the road plane between two arbitrary points in meters (or other distance units), therefore we only focus on a camera model which enables the measurement of distance between two points on the road plane.

For convenience and better comparison of the methods, we adopt the traffic camera model and notation proposed in previous papers [Dubská et al., 2014, 2015]; however, to make the paper self-contained, we briefly describe the model and notation. For intrinsic parameters of our camera model, we assume to have zero pixel skew, and the principal point $\mathbf{c}$ in the center of the image. The method also assumes the road section to be flat and straight; the experiments reported in the previous work and our experiments as well show that this requirement is not very strict, because most roads that are not sharply curved locally meet this assumption for practical purposes.

Homogeneous 2D image coordinates are referenced by bold small letters $\mathbf{p}=[p_{x},p_{y},1]^{T}$ , points on the image plane $\mathbf{\overline{p}}=[p_{x},p_{y},f]^{T}$ in 3D, where $f$ is the focal length, are denoted by small bold letters with overline. Finally, other 3D points (on the road plane) are denoted by bold capital letters $\mathbf{P}=[P_{x},P_{y},P_{z}]^{T}$ .

Figure 2 shows the camera model and its notation. For convenience, we assume that the origin of the image coordinate system is at the center of the image; therefore, the principal point $\mathbf{c}$ has 2D homogeneous coordinates $[0,0,1]^{T}$ (3D coordinates of the center of camera projection are $[0,0,0]^{T}$ ). As it is shown, the road plane is denoted by $\bm{\rho}$ . We encode vanishing points in the following way. The first one (in the direction of vehicle flow) is referenced as $\mathbf{u}$ ; the second vanishing point (whose direction is perpendicular to the first one and which is parallel to the road plane) is denoted by $\mathbf{v}$ ; and the third one (direction perpendicular to the road plane) is $\mathbf{w}$ .

Using the first two vanishing points $\mathbf{u}$ , $\mathbf{v}$ and the principal point $\mathbf{c}$ , it is possible to compute the focal length $f$ , the third vanishing point $\mathbf{w}$ , the road plane normalized normal vector $\mathbf{n}$ , and the road plane $\bm{\rho}$ . However, the road plane is computed only up to scale (as it is not possible to recover the distance to the road plane only from the vanishing points) and therefore, we add an arbitrary value $\delta=1$ as the constant term in Equation (6).

[TABLE]

With known road plane $\bm{\rho}$ , it is possible to compute 3D coordinates $\mathbf{P}=[P_{x},P_{y},P_{z}]^{T}$ of an arbitrary point $\mathbf{p}\leavevmode\nobreak\ =\leavevmode\nobreak\ [p_{x},p_{y},1]^{T}$ by projecting it onto the road plane using the following equations:

[TABLE]

It is possible to measure distances on the road plane directly with 3D coordinates $\mathbf{P}$ ; however, as the road plane is shifted to a predefined distance by a constant term, the distance $\|\mathbf{P}_{1}-\mathbf{P}_{2}\|$ between points $\mathbf{P_{1}}$ and $\mathbf{P_{2}}$ is not directly expressed in meters (or other real-world units of distance). Therefore, it is necessary to introduce another calibration parameter, referred to as the scene scale $\lambda$ , which converts the distance $\|\mathbf{P}_{1}-\mathbf{P}_{2}\|$ from pseudo-units on the road plane to meters by scaling the distance to $\lambda\|\mathbf{P}_{1}-\mathbf{P}_{2}\|$ .

Under the assumptions that the principal point is in the center of the image and zero pixel skew, it is necessary for the calibration method to compute two vanishing points ( $\mathbf{u}$ and $\mathbf{v}$ in our case) together with the scene scale $\lambda$ , yielding 5 degrees of freedom. Methods to convert these camera parameters to the standard intrinsic and extrinsic camera model $\mathbf{K\leavevmode\nobreak\ [R\leavevmode\nobreak\ T]}$ have been discussed before in several papers [Zhang et al., 2013, Fung et al., 2003, Zheng and Peng, 2014], therefore we refer to them.

4 Camera Calibration and Vehicle Tracking

We adopted the calibration method of Dubská et al. [2014], which gives the image coordinates of the vanishing points and scene scale information. We improved the method with more precise detection of the vanishing points, and we infer the scene scale by using 3D models of frequently passing cars.

Our method measures the speed of passing cars detected by Faster-RCNN [Ren et al., 2015] and tracked by a combination of background subtraction and Kalman filter [Kalman, 1960] assisted by the detector. This method, more sophisticated than the previous method [Dubská et al., 2014], gives fewer false positives and a comparable recall rate. In the case of very dense flow when vehicles overlap each other in the camera image (which does rarely occur even in real conditions), our method would miss some of the cars as we target free-flow conditions. In the following text, we describe the components of the method in detail, and evaluate it in Section 5.

4.1 Vanishing Point Estimation from Edgelets

We adopted the algorithm proposed by Dubská et al. [2015] (based on the detection of two orthogonal vanishing points) for the detection of the first vanishing point and propose to use a similar algorithm for detecting the second vanishing point. However, we improved the detection of the second vanishing point by using edgelets instead of image gradients used in the previous paper [Dubská et al., 2015]. This change, although subtle, improves the calibration and speed measurement considerably, as the results in Section 5.3 show.

We start with the detection of vanishing points from which the camera rotation with respect to the road can be estimated. The first vanishing point $\mathbf{u}$ is estimated from the movement of the vehicles by a form of cascaded Hough Transform [Dubská et al., 2015] of lines formed by tracking points of interest on the moving vehicles. This is a more stable approach than finding the closest point to the lines in an algebraic way, because it is more robust to tracking noise and it is not influenced by vehicles that change lane (and therefore, the vanishing point of their movement is different from the rest of the vehicles). Similarly to Dubská et al. [2015], we use the Min-eigenvalue point detector [Shi and Tomasi, 1994] and the KLT tracker [Tomasi and Kanade, 1991].

To detect the second vanishing point $\mathbf{v}$ we use edges on passing vehicles as many lines formed by the edges coincide with $\mathbf{v}$ . This step heavily relies on the correct estimation of the orientation of the edges. The angle can be easily computed from gradients, but angles close to $k\pi/2$ are almost impossible to accurately recover on small neighborhoods. We estimate edge orientation from a larger neighborhood by analysis of the shape of image gradient magnitude (edgelets). The detection process is shown in Figure 3.

Edgelets are detected by the following algorithm. Given an image $\mathbf{I}$ , first, we find seed points $\mathbf{s}_{i}$ as local maxima of gradient magnitude of the image $\mathbf{E}=\|\nabla{\mathbf{I}}\|$ , keeping only the strong ones with magnitudes above a threshold. From the $9\times{}9$ neighborhood of each seed point $\mathbf{s}_{i}=[x_{i},y_{i},1]^{T}$ , matrix $\mathbf{X}_{i}$ is formed:

[TABLE]

where $[m_{k},n_{k},1]^{T}$ are coordinates of the neighboring pixels ( $k=1\dots 81$ ) and $w_{k}$ is their gradient magnitude from $\mathbf{E}$ , i.e. for a $9\times 9$ neighborhood, the size of $\mathbf{X}_{i}$ is $81\times 2$ . Then, singular vectors and values of $\mathbf{X}_{i}$ can be computed as:

[TABLE]

Vectors $\mathbf{a}_{1}$ and $\mathbf{a}_{2}$ represent the eigenvectors of $\mathbf{X}_{i}$ , while $\lambda_{1}$ and $\lambda_{2}$ denote the corresponding eigenvalues. Edge orientation is then the first singular column vector $\mathbf{d}_{i}=\mathbf{a}_{1}$ from (15) and the edge quality is the ratio of singular values $q_{i}=\frac{\lambda_{1}}{\lambda_{2}}$ from (18). Each edgelet is then represented as a triplet $\mathcal{E}_{i}=\left(\mathbf{s}_{i},\mathbf{d}_{i},q_{i}\right)$ .

We gather the edgelets from the input video (see Figure 4), keeping only the strong ones which do not coincide with the already estimated $\mathbf{u}$ , and accumulate them to the Diamond Space accumulator [Dubská and Herout, 2013]. The position of the global maximum in the accumulator is taken as the second vanishing point $\mathbf{v}$ . It should be noted that in this step, additional filtering can be applied – e.g. masking the Diamond Space to find only plausible solutions (i.e. avoid imaginary focal length from Equation (1)), or to find solutions within a certain range of focal lengths or horizon inclinations (when known in advance). This may improve the robustness of the second vanishing point estimation.

4.2 Vehicle Detection and Tracking

During speed measurement, passing cars are detected in each frame by the Faster-RCNN (FRCN) detector [Ren et al., 2015] but any detector can be used as well (e.g. ACF, LDCF [Dollár et al., 2014]). We trained the detector on the COD20K dataset [Juránek et al., 2015], which contains approximately 20 k car instances for training from views of surveillance nature. The detection rate of the detector is 96 % with 0.02 false positive detections per image on the test part of the COD20K dataset. The detector yields a coarse information about locations of cars in the image (bounding boxes are not precisely aligned). We use a simple heuristic to remove detections that would lead to imprecise tracking and ultimately to wrong speed estimation – those that are slightly occluded by other detections and that are farther from the camera. Therefore we track only cars that are fully visible.

For the tracking itself, we use a simple background model that builds a background reference image by moving average. In the foreground image, compact blobs are detected and the FRCN detections are used to group those blobs that correspond to one car. From each group of blobs, the convex hull and its 2D bounding box are extracted. Finally, we track the 2D bounding box of the convex hull using a Kalman filter to get the movement of the car. For an example, see Figure 5.

For each tracked car, we extract a reference point for speed measurement. The convex hull is used to construct the 3D bounding box [Dubská et al., 2014] and we take the center of the bottom-front edge – the reference point located in the ground/road plane. Each track is represented by a sequence of bounding boxes and reference points both constructed from the convex hull. Our method inherits all the advantages and limitations of similar approaches based on the extraction of the vehicle’s foreground mask. We rely on the extractor to do its job properly, and we can take advantage of works dealing with different issues related to for example lighting and weather (for example contour extractors such as Yang et al. [2016], or semantic segmentation methods such as Long et al. [2015]). In Section 5.6, we show a number of examples of real-world surveillance cameras under bad conditions, where the calibration algorithm nonetheless works well.

4.3 Scale Inference using 3D Model Bounding Box Alignment

The previous state-of-the-art automatic method for scale inference in traffic surveillance by Dubská et al. [2014] used three-dimensional bounding boxes built around the vehicle and mean dimensions of vehicles to compute the scale. However, this approach has two main drawbacks. The obvious one is in the usage of mean dimensions of vehicles. However, the more important one is less obvious: the constructed bounding box is too tight around the vehicle and the tightness is largely influenced by the particular viewpoint direction. This causes systematic errors in the calibration depending on the camera location with respect to the road, leading to high sensitivity to viewpoint change.

We propose to use a different approach for scale inference, overcoming the mentioned imprecisions. We use fine-grained types of vehicles (i.e. make, model, variant, model year) and for a few (two in our experiments) common types we obtained 3D models which are rendered to the image and we align them to the real observed vehicles in order to obtain the proper scale.

As it is necessary to know the precise vehicle classes (up to model year) for our scale inference method, we used the BoxCars dataset [Sochor et al., 2016a] and we also collected some other training data from videos related to papers by Dubská et al. [2014, 2015]. The classification of vehicles is done only into a few most common fine-grained vehicle types on roads in the area plus one class for all the others vehicles. The full training dataset contained $\sim$ 23 k tracks and $\sim$ 92 k images of vehicles. We used a CNN [Krizhevsky et al., 2012] for the classification itself. The classification accuracy on the validation set ( $\sim$ 7 k of images) was $0.97$ . As only single instances of vehicles are classified by the CNN, we use mean probability over all of the detections belonging to one vehicle track to improve the recognition rates.

For each vehicle, we also build a 3D bounding box around it [Dubská et al., 2014] to obtain the center $\mathbf{b}$ of the vehicle’s base in image coordinates. To obtain the viewpoint vector $\bm{\phi}$ , we first compute the rotation matrix $\mathbf{R}$ , which has columns equal to normalized $\mathbf{\overline{u}}$ , $\mathbf{\overline{v}}$ , and $\mathbf{\overline{w}}$ . It is then possible to compute the 3D viewpoint vector as $\bm{\phi}=-\mathbf{R}^{T}\mathbf{\overline{b}}$ . The minus sign is necessary as we need the viewpoint vector going from the vehicle to the camera, not the opposite one.

Once the viewpoint vector to the vehicle, the vehicle’s class, and its position on the screen are determined, we render the appropriate 3D model given the parameters. The only open variable is the scale of the vehicle to be rendered (i.e. the distance between the vehicle and the camera). Examples of the two used 3D models are shown in Figure 6. Therefore, we render images of the vehicle in multiple different scales and match the bounding boxes of the rendered vehicles with the bounding box detected in the video by using the Intersection-over-Union (IoU) metric. Examples of such matches can be found in Figure 7. The figure also shows two interesting points related to the vehicle in red: points on the base of the 3D models representing front $\mathbf{f}$ and rear $\mathbf{r}$ of the vehicle. Finally, for all vehicle instances $i$ and scales $j$ , these points are projected on the road plane, yielding $\mathbf{F}_{ij}$ and $\mathbf{R}_{ij}$ . They are used to compute the scale $\lambda_{ij}$ (Eq. (19), where $l_{t_{i}}$ is the real world length of the type $t_{i}$ ). For all considered combinations of $i$ and $j$ , the IoU matching metric $m_{ij}$ is computed.

[TABLE]

To obtain the final camera’s scale $\lambda^{*}$ , all the scales $\lambda_{ij}$ are taken into account together with metrics $m_{ij}$ . We consider only cases with $m_{ij}$ larger then a predefined threshold (we used 0.85 in our experiments) to eliminate poor matches. Finally, we compute $\lambda^{*}$ according to Equation (20). The probability $p\left(\lambda\,|\,(\lambda_{ij},m_{ij})\right)$ is computed by kernel density estimation with a discretized space:

[TABLE]

In order to further improve the scale inference, we use several training videos from BrnoCompSpeed dataset [Sochor et al., 2016b]. We train the scale-correcting linear regression $\lambda^{*}_{reg}=\alpha\lambda^{*}+\beta$ , using manually obtained scales as the ground truth. Even though this step is not necessary, it improves the scale acquisition further by correcting the imprecise geometry of the obtained 3D models.

We also experimented with an alignment metric based on matching of edges on the rendered and detected vehicles (based on distance transform). However, the speed measurement did not improve further. The biggest problem with this method is that most of the edges on vehicles are blurry and therefore not detected at all. However, the vehicle detector [Ren et al., 2015] is able to detect the vehicles properly and in most cases accurately. Also, the proposed algorithm using just the bounding boxes is much more efficient in terms of storage (it is possible to store just the bounding boxes, not the images) and computation.

4.4 Speed Measurement of Tracked Cars

The speed measurement itself is done by following the methodology proposed by Sochor et al. [2016b]. Given a tracked car with reference points $\mathbf{p}_{i}$ and timestamps $t_{i}$ for each reference point, where $i=1\dots N$ , the speed $v$ is calculated from Equation (21) by projecting the reference points $\mathbf{p}_{i}$ to the ground plane $\mathbf{P}_{i}$ (see Equation (8)).

[TABLE]

The speed is computed as the median value of speeds between consecutive time positions. However, for stability of the measurement, it is better not to use the next frame, but the time position several video frames apart. This is controlled by the constant $\tau$ , and for all our experiments, we use $\tau=5$ (the time difference is usually $0.2\,\mathrm{s}$ ).

5 Experiments and Results

To evaluate our proposed methods for camera calibration and scene scale inference, we use the very recent BrnoCompSpeed dataset [Sochor et al., 2016b] which contains over 20 k vehicles with precise ground truth speed from multiple locations. The dataset also contains markers on the road with known dimensions between them. For an example of such road markers, see Figure 8. The ground truth distances can be used for either calibration or evaluation of distance measurements on the road plane. It is also possible to evaluate the accuracy of vanishing point estimation by using the markings [Sochor et al., 2016b]. In the following text we will refer to various methods for camera calibration which are defined as:

ITS15 – Automatic camera calibration method as described by Dubská et al. [2015]. Brief outline of the method is in Sections 2 and 4.1.

2.

Edgelets – Camera calibration method proposed in this paper, Section 4.1.

3.

ManualCalib – We use known distances (Figure 8) on the road for manual calibration of the camera. In agreement with the previous papers [Cathey and Dailey, 2005, Grammatikopoulos et al., 2005, He and Yung, 2007a] we use intersection lanes dividing lines (blue dashed lines in Figure 8) for estimation of the first vanishing point $\mathbf{u}$ . As there are usually more than just two lane dividing lines, we use least squares minimization to obtain the intersection of multiple lines. Formally, given lines $\mathbf{l}_{i}$ with normalized normal vectors, we compute the vanishing point $\mathbf{u}$ by solving $\mathbf{A}\mathbf{u}=-\mathbf{b}$ in a least squares manner, where rows of $\mathbf{A}$ contain transposed normal vectors of the lines, and rows of $\mathbf{b}$ contain constant terms of the lines.

The second vanishing point $\mathbf{v}$ can be obtained in the same manner (as the intersection of yellow dashed lines in Figure 8, since they are perpendicular to the vehicle flow on the road). However, we found out that it is more accurate and robust to use the intersection only as a first guess, and then use measured distances on the road to optimize the vanishing point position using Equation (22).

[TABLE]

where set $\mathcal{D}_{2}$ contains image endpoints and distances measured on the road towards the second vanishing point (green line segments in Figure 8) and scale $\lambda$ is computed for the given vanishing points $\mathbf{u},\mathbf{v}$ by Equation (23). It should be noted that the computation of 3D coordinates $\mathbf{P}_{i}$ of image point $\mathbf{p}_{i}$ depends on the vanishing points (see Equation (8) for details). The optimization itself is done by grid search (we loop over discretized feasible positions of $\mathbf{v}$ corresponding to reasonable focal lengths and evaluate the optimization objective (22)).

The usage of standard manual methods based on calibration patterns (e.g checkerboards) proposed by Zhang [2000] is impractical, as it would require a large checkerboard (more than $10\,\mathrm{m}^{2}$ ) placed on the road.

We also define method names for different approaches for scale inference:

BMVC14 – Scale inference method proposed by Dubská et al. [2014]. Brief outline of the method is in Section 2.

2.

BBScale + reg – Our method for scale calibration using bounding box matching (Section 4.3) with scale correction regression.

3.

ManualScale – Scale computed from manually measured distances between markers towards the first vanishing point on the road. The scale is computed as the mean value of Equation (23) from a set of endpoints and distances $(\mathbf{p}_{i,1},\mathbf{p}_{i,2},d_{i})$ towards the first vanishing point (red line segments in Figure 8).

[TABLE]

4.

SpeedScale – Scale is computed from ground truth speed measurements and minimizes the speed measurement error for given camera calibration. It can be understood as the lower error bound for the given camera calibration method. The scale is computed as the mean value of Equation (24) where, the set $\mathcal{M}$ contains pairs of ground truth speed $\hat{v}_{i}$ and measured speed $v_{i}$ . It is assumed that scale $\lambda=1$ was used for computation of speeds $v_{i}$ .

[TABLE]

If not stated otherwise, the evaluation was done on BrnoCompSpeed – Split C (contains more than 10 k of vehicle tracks for evaluation), because our method requires parameter tuning for the scale correction regression and split C provides a sufficient amount of data for training and testing. For each metric, we report mean, median, and 99 percentile error for both absolute units ( $err=|\hat{r}-r|$ ) and relative units ( $err=|\hat{r}-r|/\hat{r}\cdot 100\%$ ), where $\hat{r}$ denotes the ground truth measurement, and $r$ represents the measured value.

5.1 Evaluation of Vanishing Point Estimation – Camera Calibration Error

To evaluate the camera calibration itself (the obtained vanishing points), we follow the evaluation metric proposed with the BrnoCompSpeed dataset [Sochor et al., 2016b]. The evaluation measures the difference between ratios of distances between markings towards the first vanishing point (red lines in Figure 8) and the distances between markers towards the second vanishing point (green lines in Figure 8). As the ratio does not depend on scale, this metric considers only the camera calibration in the form of two detected vanishing points.

Since we do not require any parameter tuning for the camera calibration method, we report the results on all videos in the BrnoCompSpeed dataset (including the extra session0). The results (reported in Table 1) show that our automatic calibration method Edgelets outperforms calibration method ITS15 almost twice on mean error. It should be noted that the same distances that were used to obtain the manual calibration were evaluated by the calibration error metric based on distance ratios; this gives the manual calibration an unfair advantage in the comparison.

The significant improvement of our method is caused by more precise acquisition of $\mathbf{v}$ ; position of $\mathbf{u}$ stays the same for our method as for the ITS15 calibration method. There are two reasons why vanishing points play an important role. The first one is that the vanishing points are directly used for estimating the focal length; the second one is that they are used for computation of the viewpoint on the vehicle for scale estimation. Therefore, if the viewpoint is computed imprecisely, the alignment of the rendered 3D model is also imprecise.

5.2 Evaluation of Distance Measurement in the Road Plane

The next step is to evaluate the camera calibration together with the obtained scale. We use manual annotations of distances on the road plane which are directed towards the first or the second vanishing point, respectively (red and green in Figure 8).

First, we evaluated the distance measurement only towards the first vanishing point as it is the direction in which the vehicles are going and it is more important for speed measurement. The results are shown in Table 2 for different combinations of calibrations and scale estimations. First, our fully automatic method for camera calibration (Edgelets) and scale inference (BBScale + reg) significantly outperforms the previous automatic method ITS15 + BMVC14. Second, when we use our automatically computed calibration and scale obtained with manual annotations, we achieve almost the same results as ManualCalib + ManualScale, which required much more manual effort than our automatic system.

When we evaluated the same metric with all the distances, the results are similar (see Table 3). Again, our method significantly outperforms the previous automatic method. Considering the calibrations with manually obtained scale, our system has a slightly higher error then the manual calibration. However, this is caused by the fact that the manual calibration is optimized directly to the evaluation metric by Equation (22) and thus gets an unfair and unrealistic advantage.

To summarize the distance measurement results: our method significantly outperforms previous automatic state-of-the-art for speed measurement – the mean error for distance measurement in the direction of vehicles’ flow (which is important for speed measurement) was reduced by $79\,\%$ (1.23 m to 0.26 m).

5.3 Evaluation of Speed Measurement

The most important part of the evaluation is the speed measurement itself. We used the same vehicle detection and tracking system (see Section 5) in all experiments so that the results for different calibrations and scales are directly comparable.

We show both quantitative results in the form of Table 4 and plots with cumulative error histograms in Figure 9. The table and the figures are divided into several parts where we compare similar levels of supervision.

The first level of supervision is fully automatic; in the second level, known ground truth dimensions on the road plane are used. In the third and final level of supervision, we use known ground truth speeds to form the lower error bound for different calibration methods.

Regarding the first level of supervision, our system Edgelets + BBScale + reg significantly outperforms the previous automatic method ITS15 + BMVC14 and we reduce the mean speed measurement error by $86\,\%$ (7.98 km/h to 1.10 km/h) . Another important fact is that our fully automatic method for camera calibration and scale inference also outperforms manual calibration and scale inference (1.35 km/h mean error) while the error is reduced by $19\,\%$ (1.35 km/h to 1.10 km/h). This improvement is important because in previous approaches, the automation always compromised accuracy, forcing the system developer to trade off between them. Our work shows that fully automatic calibration methods may produce better results than manual calibration.

When it comes to the second and third level of supervision, the results follow the same trend with our calibration outperforming all of them (manual and automatic). The fact that manual calibration is better on the calibration metric (Section 5.1) and distance measurement (Section 5.2), while our method outperforms the manual calibration at the speed measurement task, is caused by the fact that manual calibration uses the same data which are then used for the evaluation of the calibration metric and distance measurement. The achieved accuracy is very close to meeting the standards for speed measurements accuracy required for enforcement (typically $3\,\%$ in many European countries). The accuracy is definitely comparable to measurements achievable by radars [Sochor et al., 2016b], while being considerably cheaper, more flexible, and passive.

5.4 Sensitivity to Selection of the 3D Model

We also evaluated how using different 3D models of vehicles influences the speed measurement results. The results are shown in Table 5 and Figure 9 (bottom right). We tested several combinations of used vehicles: use of only one of the models (Combi, Sedan) or both of them together (Combi + Sedan), forming the first segment of the table. It shows that using both models significantly improves the results, as the errors in geometry of the 3D models cancel out. We consider that using only a few (as few as two) fine-grained models is beneficial because it is not necessary to obtain more 3D models and training data for fine-grained recognition. The experiments show that having two models is sufficient for obtaining usable results; using more than two models in practice would follow the same principles and could increase the robustness further.

The second segment of the table shows the performance of the system with scale correction regression to overcome the inaccuracies of the 3D models. The results show that for model Combi, the error significantly decreases. However, for the Sedan model, the results stay more or less the same. This paradox is caused by the smaller number of training data for Sedan version as for some training videos, no Sedan vehicle was detected. The results also show that if we use both models, the performance drop is not that significant (1.10 km/h to 1.38 km/h) and therefore, it is possible to use the scale inference without the scale correction regression.

5.5 Vehicle Detection and Tracking Evaluation

Since we use a different vehicle detection and tracking method from Dubská et al. [2014], we also evaluate this part of the solution. We compare the methods on all videos of BrnoCompSpeed (including extra session0) with exactly the same calibration (ManualCalib + ManualScale) to isolate the influence of vehicle detection and tracking.

We report the number of False Positives Per Minute and mean recall in vehicle counting. The results can be found in Table 6, and as the table shows, our method considerably reduces the number of false positives with essentially the same recall.

A tracked vehicle is matched to the ground truth if it passes through the correct lane and the time difference of pass through the measurement line (yellow line in Figure 8 which is closest to the camera) compared to the ground truth is less than $0.2\,\mathrm{s}$ . This threshold is used by Sochor et al. [2016b] to correctly match the vehicles, as a higher threshold could lead to mismatches between the detected track and ground truth.

As we use the same calibration, we can also compare directly the speed measurement error which is influenced (with the same calibration) only by the tracking. As the table shows, our tracking method yields slightly reduced speed measurement error for the same scale and camera calibration.

For the tracking and speed measurement, we use the point at the front of the vehicle on the road plane (using the 3D bounding box), which is geometrically correct, as the point is on the road plane. We evaluated how the choice of the tracking point influences the measurement error, comparing to a naive solution which takes the center of the bottom edge of the 2D bounding box for the tracking, and we found out that the difference to the correct solution was negligible.

5.6 Camera Calibration on Real Surveillance Cameras

The automatic calibration from vehicle movement can be justifiably suspected of requiring idealized conditions and to be sensitive to bad lighting, etc. In order to verify the usability of our camera calibration method in real-world conditions, we obtained data from surveillance cameras in production use at 9 different locations. The videos were captured both at day and night conditions. The data are of rather poor quality ( $704\times 576\,\mathrm{px}$ or $704\times 288\,\mathrm{px}$ ) with 6 frames per second and a mean length of 40s. As the ground truth calibration is not available for the data, we report only qualitative results in the form of equilateral grid projected on the road plane. Despite the challenging character of the sequences (poor video quality and lighting conditions), we were able to correctly detect the vanishing points, as can be seen in Figure 10 on a few examples, and thus find the camera parameters and its orientation, which is important in many real-world surveillance applications (e.g estimation of vehicle viewpoints or image rectification).

6 Conclusions

We propose a fully automatic method for traffic surveillance camera calibration. It does not have any constraints on camera placement and does not require any manual input whatsoever. The results show that our system decreases the mean speed measurement error by $86\,\%$ (7.98 km/h to 1.10 km/h) compared to the previous automatic state-of-the-art method and by $19\,\%$ (1.35 km/h to 1.10 km/h) compared to the manual calibration method. This improvement is important, as in the previous approaches, automation always compromised accuracy, forcing the system developer to trade off between them. Our work shows that fully automatic calibration methods may produce better results than manual calibration. This result can be important beyond the field of traffic surveillance, since different forms of manual camera calibration are often considered the “ground truth”, but our work shows that automatic calibration from statistics of repeated inaccurate measurements can be more precise, despite requiring no user input. Our method removes the necessity of per-camera setting or calibration, but it still requires some human annotations per coarse geographic region (e.g. European Union or the USA) and per time period when the car models get vastly replaced (e.g. per decade).

In the experiments, we also showed that our method is able to calibrate real world traffic surveillance cameras and our proposed method for vehicle detection and tracking significantly reduces the number of false positives compared to the previous method. In future work, we would like to simplify the system and remove the necessity to render the vehicles by approximation of the bounding box size with a function parametrized by viewpoint and image location.

Acknowledgments

This work was supported by The Ministry of Education, Youth and Sports of the Czech Republic from the National Programme of Sustainability (NPU II); project IT4Innovations excellence in science – LQ1602.

We would also like to thank to company CAMEA for providing us data from industrial surveillance cameras.

References

Cathey and Dailey [2005]

Cathey F, Dailey D.

A novel technique to dynamically measure vehicle speed using uncalibrated roadway cameras.

In: Intelligent Vehicles Symposium. 2005. p. 777–82.

Chaperon et al. [2011]

Chaperon T, Droulez J, Thibault G.

Reliable camera pose and calibration from a small set of point and line correspondences: A probabilistic approach.

Computer Vision and Image Understanding 2011;115(5):576 –85.

Special issue on 3D Imaging and Modelling.

Cootes et al. [1995]

Cootes TF, Taylor CJ, Cooper DH, Graham J.

Active shape models-their training and application.

Computer vision and image understanding 1995;61(1):38–59.

Dailey et al. [2000]

Dailey D, Cathey F, Pumrin S.

An algorithm to estimate mean traffic speed using uncalibrated cameras.

IEEE Transactions on Intelligent Transportation Systems 2000;1(2):98–107.

Dalal and Triggs [2005]

Dalal N, Triggs B.

Histograms of oriented gradients for human detection.

In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE; volume 1; 2005. p. 886–93.

Do et al. [2015]

Do VH, Nghiem LH, Thi NP, Ngoc NP.

A simple camera calibration method for vehicle velocity estimation.

In: Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2015 12th International Conference on. 2015. p. 1–5.

Dollár et al. [2014]

Dollár P, Appel R, Belongie S, Perona P.

Fast feature pyramids for object detection.

IEEE Transactions on Pattern Analysis and Machine Intelligence 2014;36(8):1532–45.

Dubská et al. [2014]

Dubská M, Sochor J, Herout A.

Automatic camera calibration for traffic understanding.

In: BMVC. 2014. .

Dubská and Herout [2013]

Dubská M, Herout A.

Real projective plane mapping for detection of orthogonal vanishing points.

In: Proceedings of the British Machine Vision Conference. BMVA Press; 2013. .

Dubská et al. [2015]

Dubská M, Herout A, Juranek R, Sochor J.

Fully automatic roadside camera calibration for traffic surveillance.

Intelligent Transportation Systems, IEEE Transactions on 2015;16(3):1162–71.

Fang et al. [2016]

Fang J, Zhou Y, Yu Y, Du S.

Fine-grained vehicle model recognition using a coarse-to-fine convolutional neural network architecture.

IEEE Transactions on Intelligent Transportation Systems 2016;PP(99):1–11.

Felzenszwalb et al. [2010]

Felzenszwalb P, Girshick R, McAllester D, Ramanan D.

Object detection with discriminatively trained part-based models.

IEEE Transactions on Pattern Analysis and Machine Intelligence 2010;32(9):1627–45.

Fung et al. [2003]

Fung GSK, Yung NHC, Pang GKH.

Camera calibration from road lane markings.

Optical Engineering 2003;42(10):2967–77.

Gao et al. [2016]

Gao Y, Beijbom O, Zhang N, Darrell T.

Compact bilinear pooling.

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. .

Girshick et al. [2014]

Girshick R, Donahue J, Darrell T, Malik J.

Rich feature hierarchies for accurate object detection and semantic segmentation.

In: Computer Vision and Pattern Recognition. 2014. .

Grammatikopoulos et al. [2005]

Grammatikopoulos L, Karras G, Petsa E.

Automatic estimation of vehicle speed from uncalibrated video sequences.

In: Proceedings of International Symposium on Modern Technologies, Educationand Profeesional Practice in Geodesy and Related Fields. 2005. p. 332–8.

He and Yung [2007a]

He X, Yung N.

New method for overcoming ill-conditioning in vanishing-point-based camera calibration.

Optical Engineering 2007a;46(3).

He and Yung [2007b]

He XC, Yung NHC.

A novel algorithm for estimating vehicle speed from two consecutive images.

In: IEEE Workshop on Applications of Computer Vision, WACV. 2007b. .

Hsiao et al. [2014]

Hsiao E, Sinha S, Ramnath K, Baker S, Zitnick L, Szeliski R.

Car make and model recognition using 3D curve alignment.

In: IEEE WACV. 2014. .

Juránek et al. [2015]

Juránek R, Herout A, Dubská M, Zemčík P.

Real-time pose estimation piggybacked on object detection.

In: The IEEE International Conference on Computer Vision (ICCV). 2015. .

Kalman [1960]

Kalman RE.

A new approach to linear filtering and prediction problems.

Transactions of the ASME – Journal of Basic Engineering 1960;(82 (Series D)):35–45.

Krause et al. [2015]

Krause J, Jin H, Yang J, Fei-Fei L.

Fine-grained recognition without part annotations.

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. .

Krause et al. [2013]

Krause J, Stark M, Deng J, Fei-Fei L.

3D object representations for fine-grained categorization.

In: ICCV Workshop 3dRR-13. 2013. .

Krizhevsky et al. [2012]

Krizhevsky A, Sutskever I, Hinton GE.

Imagenet classification with deep convolutional neural networks.

In: Pereira F, Burges C, Bottou L, Weinberger K, editors. Advances in Neural Information Processing Systems 25. Curran Associates, Inc.; 2012. p. 1097–105.

Lan et al. [2014]

Lan J, Li J, Hu G, Ran B, Wang L.

Vehicle speed measurement based on gray constraint optical flow algorithm.

Optik - International Journal for Light and Electron Optics 2014;125(1):289 –95.

Li et al. [2009]

Li C, Gatenby C, Wang L, Gore JC.

A robust parametric method for bias field estimation and segmentation of mr images.

In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE; 2009. p. 218–23.

Lin et al. [2015]

Lin TY, RoyChowdhury A, Maji S.

Bilinear cnn models for fine-grained visual recognition.

In: International Conference on Computer Vision (ICCV). 2015. .

Lin et al. [2014]

Lin YL, Morariu VI, Hsu W, Davis LS.

Jointly optimizing 3D model fitting and fine-grained classification.

In: ECCV. 2014. .

Liu et al. [2012]

Liu J, Kanazawa A, Jacobs D, Belhumeur P.

Dog breed classification using part localization.

In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C, editors. ECCV 2012. Springer Berlin Heidelberg; volume 7572 of Lecture Notes in Computer Science; 2012. p. 172–85.

Liu et al. [2016]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC.

SSD: Single shot multibox detector.

To appear.

Long et al. [2015]

Long J, Shelhamer E, Darrell T.

Fully convolutional networks for semantic segmentation.

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. .

Lowe [1999]

Lowe DG.

Object recognition from local scale-invariant features.

In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Ieee; volume 2; 1999. p. 1150–7.

Luvizon et al. [2014]

Luvizon D, Nassu B, Minetto R.

Vehicle speed estimation by license plate detection and tracking.

In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. 2014. p. 6563–7.

Luvizon et al. [2016]

Luvizon DC, Nassu BT, Minetto R.

A video-based system for vehicle speed measurement in urban roadways.

IEEE Transactions on Intelligent Transportation Systems 2016;PP(99):1–12.

Maduro et al. [2008]

Maduro C, Batista K, Peixoto P, Batista J.

Estimation of vehicle velocity and traffic intensity using rectified images.

In: Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on. 2008. p. 777–80.

Nurhadiyatna et al. [2013]

Nurhadiyatna A, Hardjono B, Wibisono A, Sina I, Jatmiko W, Ma’sum M, Mursanto P.

Improved vehicle speed estimation using gaussian mixture model and hole filling algorithm.

In: Advanced Computer Science and Information Systems (ICACSIS), 2013 International Conference on. 2013. p. 451–6.

Prokaj and Medioni [2009]

Prokaj J, Medioni G.

3-D model based vehicle recognition.

In: IEEE WACV. 2009. .

Ren et al. [2015]

Ren S, He K, Girshick R, Sun J.

Faster R-CNN: Towards real-time object detection with region proposal networks.

In: Advances in Neural Information Processing Systems (NIPS). 2015. .

Schoepflin and Dailey [2003]

Schoepflin T, Dailey D.

Dynamic camera calibration of roadside traffic management cameras for vehicle speed estimation.

Intelligent Transportation Systems, IEEE Transactions on 2003;4(2):90–8.

Shi and Tomasi [1994]

Shi J, Tomasi C.

Good features to track.

In: 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1994. p. 593–600.

Simon and Rodner [2015]

Simon M, Rodner E.

Neural activation constellations: Unsupervised part model discovery with convolutional networks.

In: International Conference on Computer Vision (ICCV). 2015. .

Sina et al. [2013]

Sina I, Wibisono A, Nurhadiyatna A, Hardjono B, Jatmiko W, Mursanto P.

Vehicle counting and speed measurement using headlight detection.

In: Advanced Computer Science and Information Systems (ICACSIS), 2013 International Conference on. 2013. p. 149–54.

Sochor et al. [2016a]

Sochor J, Herout A, Havel J.

BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition.

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016a. .

Sochor et al. [2016b]

Sochor J, Juranek R, Spanhel J, Marsik L, Siroky A, Herout A, Zemcik P.

BrnoCompSpeed: Review of traffic camera calibration and a comprehensive dataset for monocular speed measurement.

Intelligent Transportation Systems (under review), IEEE Transactions on 2016b;.

Tomasi and Kanade [1991]

Tomasi C, Kanade T.

Detection and Tracking of Point Features.

Technical Report; International Journal of Computer Vision; 1991.

Yang et al. [2016]

Yang J, Price B, Cohen S, Lee H, Yang MH.

Object contour detection with a fully convolutional encoder-decoder network.

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. .

You and Zheng [2016]

You X, Zheng Y.

An accurate and practical calibration method for roadside camera using two vanishing points.

Neurocomputing 2016;.

Yu et al. [2009]

Yu X, Jiang N, Cheong LF, Leong HW, Yan X.

Automatic camera calibration of broadcast tennis video with applications to 3D virtual content insertion and ball detection and tracking.

Computer Vision and Image Understanding 2009;113(5):643 –52.

Computer Vision Based Analysis in Sport Environments.

Zhang [2000]

Zhang Z.

A flexible new technique for camera calibration.

IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;22(11):1330–4.

Zhang et al. [2013]

Zhang Z, Tan T, Huang K, Wang Y.

Practical camera calibration from moving objects for traffic scene surveillance.

IEEE Transactions on Circuits and Systems for Video Technology 2013;23(3):518–33.

Zheng and Peng [2014]

Zheng Y, Peng S.

A practical roadside camera calibration method based on least squares optimization.

IEEE Transactions on Intelligent Transportation Systems 2014;15:831–43.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Cathey and Dailey [2005] Cathey F, Dailey D. A novel technique to dynamically measure vehicle speed using uncalibrated roadway cameras. In: Intelligent Vehicles Symposium. 2005. p. 777–82.
2Chaperon et al. [2011] Chaperon T, Droulez J, Thibault G. Reliable camera pose and calibration from a small set of point and line correspondences: A probabilistic approach. Computer Vision and Image Understanding 2011;115(5):576 –85. Special issue on 3D Imaging and Modelling.
3Cootes et al. [1995] Cootes TF, Taylor CJ, Cooper DH, Graham J. Active shape models-their training and application. Computer vision and image understanding 1995;61(1):38–59.
4Dailey et al. [2000] Dailey D, Cathey F, Pumrin S. An algorithm to estimate mean traffic speed using uncalibrated cameras. IEEE Transactions on Intelligent Transportation Systems 2000;1(2):98–107.
5Dalal and Triggs [2005] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE; volume 1; 2005. p. 886–93.
6Do et al. [2015] Do VH, Nghiem LH, Thi NP, Ngoc NP. A simple camera calibration method for vehicle velocity estimation. In: Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2015 12th International Conference on. 2015. p. 1–5.
7Dollár et al. [2014] Dollár P, Appel R, Belongie S, Perona P. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 2014;36(8):1532–45.
8Dubská et al. [2014] Dubská M, Sochor J, Herout A. Automatic camera calibration for traffic understanding. In: BMVC. 2014. .