CED: Color Event Camera Dataset

Cedric Scheerlinck; Henri Rebecq; Timo Stoffregen; Nick Barnes; Robert; Mahony; Davide Scaramuzza

arXiv:1904.10772·cs.CV·April 25, 2019

CED: Color Event Camera Dataset

Cedric Scheerlinck, Henri Rebecq, Timo Stoffregen, Nick Barnes, Robert, Mahony, Davide Scaramuzza

PDF

TL;DR

This paper introduces the first comprehensive color event camera dataset (CED) with diverse scenes, and extends simulation tools to support color event data, facilitating research in event-based vision and image reconstruction.

Contribution

The paper presents the first color event camera dataset (CED), extends the ESIM simulator for color events, and evaluates state-of-the-art image reconstruction methods for color event streams.

Findings

01

CED enables new research in color event-based vision.

02

Color event data can be effectively reconstructed into HDR color videos.

03

The evaluation highlights strengths and limitations of current reconstruction methods.

Abstract

Event cameras are novel, bio-inspired visual sensors, whose pixels output asynchronous and independent timestamped spikes at local intensity changes, called 'events'. Event cameras offer advantages over conventional frame-based cameras in terms of latency, high dynamic range (HDR) and temporal resolution. Until recently, event cameras have been limited to outputting events in the intensity channel, however, recent advances have resulted in the development of color event cameras, such as the Color-DAVIS346. In this work, we present and release the first Color Event Camera Dataset (CED), containing 50 minutes of footage with both color frames and events. CED features a wide variety of indoor and outdoor scenes, which we hope will help drive forward event-based vision research. We also present an extension of the event camera simulator ESIM that enables simulation of color events. Finally,…

Figures40

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Bias settings used for the Color-DAVIS346.

Bias	Indoors		Outdoors
	Coarse	Fine	Coarse	Fine
DiffBn	4	39	4	39
OFFBn	4	0	4	0
ONBn	6	200	6	200
PrBp	2	58	3	0
PrSFBp	1	33	1	33
RefrBp	4	25	4	25

Table 2. Table 2: Types of scenes in our Color Event Camera Dataset.

Type	# Seq	Length (mins)	Lux	Description	Possible Applications
Simple	16	5	80 - 1e3	Simple camera motions looking at simple objects and scenes with vibrant colors such as fruit, blocks and posters.	Image reconstruction
Indoors	15	5	0.8 - 1e3	Natural indoor scenes including office, kitchen, rooms and corridors.	Object detection
People	27	10	400	Common actions and gestures such as sitting, waving, jumping, air guitar.	Action recognition
Driving	12	28	200 - 1e5	Footage from front windshield of car driving around country, suburban and city landscapes. Features tunnels, traffic lights, vehicles and pedestrians during the day in sunny conditions.	Segmentation, Optical flow
Calibration	14	2	80 - 1e5	ColorChecker and density step target: indoors, outdoors, with and without infrared filter.	Color calibration
Simulated	-	-	-	Color ESIM (adapted from [1]). Simulator can be used to generate unlimited sequences with ground truth depth, ego-motion, optical flow and more.	Optical flow, SLAM, Image reconstruction

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

CED: Color Event Camera Dataset

Cedric Scheerlinck †*

Henri Rebecq ‡*

Timo Stoffregen §

Nick Barnes †

Robert Mahony †

Davide Scaramuzza ‡

Abstract

Event cameras are novel, bio-inspired visual sensors, whose pixels output asynchronous and independent timestamped spikes at local intensity changes, called ‘events’. Event cameras offer advantages over conventional frame-based cameras in terms of latency, high dynamic range (HDR) and temporal resolution. Until recently, event cameras have been limited to outputting events in the intensity channel, however, recent advances have resulted in the development of color event cameras, such as the Color-DAVIS346. In this work, we present and release the first Color Event Camera Dataset (CED), containing 50 minutes of footage with both color frames and events. CED features a wide variety of indoor and outdoor scenes, which we hope will help drive forward event-based vision research. We also present an extension of the event camera simulator ESIM [1] that enables simulation of color events. Finally, we present an evaluation of three state-of-the-art image reconstruction methods that can be used to convert the Color-DAVIS346 into a continuous-time, HDR, color video camera to visualise the event stream, and for use in downstream vision applications.

††* Equal contribution.††† Australian National University, Canberra, ACT, Australia.††‡ Dept. Informatics, Univ. of Zurich and Dept. Neuroinformatics, Univ. of Zurich and ETH Zurich.††§ Monash University, Melbourne, VIC, Australia.

Website: http://rpg.ifi.uzh.ch/CED

1 Introduction

Since their recent addition to the computer vision community [2], event cameras have challenged conventional thinking about how to solve computer vision problems. Instead of producing global-shutter images at a fixed frame-rate as in conventional cameras, event cameras have pixels that operate independently and asynchronously. When the brightness change at a given pixel exceeds a threshold, that pixel emits an event containing its $(x,y)$ address, timestamp and polarity. Event cameras offer several advantages; they sample at the rate of scene dynamics without having to wait for an external shutter cycle, and the output is data-driven and non-redundant. This means that event cameras have extremely low latency, low power consumption and bandwidth requirements, high dynamic range and suffer essentially no motion blur. The temporal resolution of current event cameras is in the order of microseconds.

Since their introduction, event cameras have spawned a flurry of research. They have been used in feature detection and tracking [3, 4, 5, 6], depth estimation [7, 8, 9, 10], stereo [11, 12, 13, 14], optical flow [15, 16, 17, 18], image reconstruction [19, 20, 21, 22, 23, 24, 25], localization [26, 27, 28, 29], SLAM [30, 31, 32], visual-inertial odometry [33, 34, 35, 36], pattern recognition [37, 38, 39, 40], and more. In response to the growing needs of the community, several important event-based vision datasets have been released, directed at popular topics such as SLAM [28], optical flow [41, 42] and recognition [43, 37]. Event camera datasets enable better benchmarking and reproducibility, and grant researchers access to high quality event data in a range of environments without necessarily having to acquire an expensive event camera.

While existing datasets are limited to monochrome events, event camera technology has since advanced to allow color events and frames [44], which opens the door to a new generation of color event processing.

The addition of color information to event-based vision has the potential to improve performance of many tasks, such as segmentation [45] and recognition, where it is known that color is an important source of visual information [46]. Early works have shown promising results using prototype color event cameras [47, 48, 49], or a mirrored-rig with three monochrome cameras and three color filters [45], however, to-date there are no publicly available color event datasets. Further, the wider research community has limited access to color event cameras, hindering progress into color event vision research.

We present the first Color Event Camera Dataset (Fig. 1) that aims to spur research into color event vision by providing the community with high quality color event data, alongside color frames from the Color-DAVIS346 [44]. The Color-DAVIS346 (Fig. 2) is the latest color event camera, built upon the popular line of DAVIS cameras that many existing datasets and research is based off. Rather than directing our focus at a specific target application, we aim to cater for general purpose vision research by including a diverse range of scenes (simple objects, indoor/outdoor scenes, people), lighting conditions (daylight, indoor light, low-light), camera motions (linear, 6-DOF motion) and dynamics. While we do not provide ground truth labels for any specific task (e.g. optical flow estimation, object detection, etc.), we provide color images from the sensor that are naturally synchronized and registered to events. These images may be used to generate proxy labels for any task of interest (using either conventional computer vision, or manual annotation) that can be transferred to the events.

To visually unveil the color information contained in color events, we evaluate and compare three state-of-the-art event-based image reconstruction methods [22, 24, 50] on our Color Event Camera Dataset. Image reconstruction is an active field of event-based vision research [19, 20, 21, 22, 23, 24, 6, 50] that allows visualisation of the event stream, and enables application of decades of computer vision research and expertise on event data, which in its raw form is inaccessible to powerful tools such as convolutional neural networks. Further, event reconstructed images have the potential to retain desirable qualities of event cameras, such as high dynamic range, high temporal resolution and immunity to motion blur.

Contributions:

We present CED: Color Event Camera Dataset containing 50 minutes of both color events and frames in a wide range of natural scenes with static and dynamic objects, and covering a variety of camera-motions from simple translations and rotations to unconstrained 6-DOF motions. 2. 2.

We release a color event camera simulator, based on ESIM [1]. 3. 3.

We present color video reconstructions from a color event camera, comparing three state-of-the-art reconstruction methods. Video reconstruction provides a natural way to visualize the event stream and enable image-based processing on events.

2 Related Works

Many event-based vision datasets have been published since the introduction of the DVS [2]. Most of these datasets were recorded using a DAVIS [51] event camera or similar and have a particular use-case in mind, such as image reconstruction [24], recognition [43, 37, 52], optical flow [42, 53, 21], driving/SLAM [41, 29, 26]. The dataset perhaps most similar to ours is the Event-Camera Dataset and Simulator [28]. All of the above datasets are limited to monochrome temporal contrast or gray-level events. Our Color Event Camera Dataset (CED) doesn’t have a particular use-case in mind and aims simply to cover a wide range of scenarios and motions that can be used in a broad swathe of research topics.

The need for publicly available datasets of arbitrary event data is partly driven by the fact that event cameras are scarce and expensive hardware acquisitions. For this reason several event camera simulators have been developed in previous years, the most sophisticated of which is the ESIM [1]. While ESIM provides high quality, realistic event data and ground-truth from a free moving simulated camera in an arbitrary 3D modeled environment, it does not support color events. Nor does (to our knowledge) any other contemporary, publicly available event simulator. We propose an extension of ESIM to simulate color events and make it publicly available.

Thus far there have been few works that use color events. One particular counterexample is Marcireau et al. [45], who perform color segmentation on color events. However, in this work the authors felt compelled to build their own color event camera using a complex array of beam splitting mirrors and filters to channel light into three separate event cameras. Further, this setup did not allow capturing color frames, which had to be instead reconstructed from the event streams of the three sensors. Our dataset hopes to save future researchers this kind of effort.

The C-DAVIS [49] was one of the first color event cameras, based on the DAVIS [51] with VGA resolution color (RGBW) frames and QVGA monochrome events. The SDAVIS192 [48] had improved sensitivity over the DAVIS, able to output color (RGBW) events and frames at 188 $\times$ 192 pixel resolution. Moeys et al. [47] used the SDAVIS192 to demonstrate color image reconstruction from events using 1) naïve integration and 2) Poisson integration [54] of a gradient field based on the surface of active events [15]. The Color-DAVIS346 [44] is the latest color event camera at the time of writing, and outputs color (RGBG) events and frames at 346 $\times$ 260 resolution.

3 CED: Color Event Camera Dataset

The Color-DAVIS346 [44] consists of an 8 $\times$ 6mm CMOS chip patterned with RGBG filters (Fig. 3), able to output color events and standard frames at 346 $\times$ 260 pixel resolution. Table 1 displays the camera bias settings used (based off the defaults provided in the DAVIS ROS driver111https://github.com/uzh-rpg/rpg_dvs_ros). Events generated by the DAVIS are reported with microsecond timestamp precision. We provide time-stamped, raw frames from the DAVIS, as well as color frames obtained via demosaicing [55]. To minimize motion blur in the DAVIS frames, we use fixed exposure fine-tuned for each indoor sequence. We use auto-exposure for outdoor sequences since it is bright enough to drive exposure time down. No infrared filter is used unless otherwise specified. We provide binary (rosbag) files containing synchronized and time-stamped events, raw images and color images.

The Color Event Camera Dataset (Fig. 4) contains 50 minutes of footage consisting of 100k color DAVIS frames and over one billion color events. The sequences cover a wide variety of scenes that showcase some of the key properties of the technology, namely high dynamic range, high temporal resolution and immunity to motion-blur. We include five categories (Table 2): Simple, Indoors, People, Driving and Calibration. Simple contains sequences in favorable conditions, i.e. well-lit, moderate camera motions, where the DAVIS frame is typically sharp and well-exposed. Indoors contains challenging conditions such as low-light, fast camera motion, as well as natural indoor office scenes. People consists of pre-determined actions such as sitting, waving, dancing with both static and dynamic camera. Driving is filmed through the windshield of a car in sunny conditions and contains a range of environments including highways, tunnels, city and country. Calibration shows a ColorChecker and density step target in various lighting conditions including fluorescent, low-light, outdoors, with and without an infrared filter.

Color Event Simulator. In addition to the real event datasets, we extended the event camera simulator ESIM [1] to allow simulation of color events222https://github.com/uzh-rpg/rpg_esim. Our extension operates on the ground-truth color (RGB) frames generated by the rendering engine, and simulates a color filter array (specifically, an RGBG Bayer pattern, as in the DAVIS346 used for this dataset). The simulated Bayered frames are then processed by the event simulation code in ESIM, thus producing color events in the same way as the DAVIS346. ESIM can readily provide multiple ground truth modalities, such as color frames, depth maps, optical flow maps, camera poses and camera velocities. Our extension is compatible with all the rendering engines already bundled with ESIM, including a photorealistic rendering engine. Figure 5 shows an example of color event data and ground truth modalities simulated by our extension of ESIM.

4 Color Video Reconstruction

Image reconstruction from events serves two primary functions: 1) as a way to visualise events and 2) for use in downstream vision applications e.g. object detection.

4.1 Method

We evaluate and compare three state-of-the-art event-based image reconstruction methods on our Color Event Camera Dataset. While these methods were originally designed for monochrome events, we found that with minimal modification all three were able to produce convincing color reconstructions. While “ground-truth” color DAVIS frames were available, only color events were used as input to each method.

1. Manifold Regularisation (MR).333https://github.com/VLOGroup/dvs-reconstruction Reinbacher et al. [22] use integration with spatio-temporal smoothing to recover image frames from events. They use the surface of active events [15] to define a manifold that guides regularisation. We use default parameters provided by the authors; the integration window length is set to to $1,000$ events.

2. High-pass Filter (HF).444https://github.com/cedric-scheerlinck/dvs_image_reconstruction Scheerlinck et al. [24] show that a lightweight, asynchronous complementary filter can be used to obtain a continuous-time video from events and frames. If desired, the frame input to the filter can be set to zero, resulting in a simple high-pass filter that produces reasonable results from only events. Since each pixel is treated independently without spatial smoothing, the Bayer pattern is preserved, and demosaicing [56] can be used to recover an RGB image at any point in time. We use a gain of 0.06 for both cutoff_frequency and cutoff_frequency_per_event_component. As a final post-processing step, we apply a 5 $\times$ 5 bilateral filter with spatial_filter_sigma set to 1.0 for each output reconstruction.

3. E2VID Neural Network (E2VID). Rebecq et al. [50] show that a recurrent neural network trained on a large amount of event data simulated with ESIM [1] can generate high quality video reconstructions from event data only. E2VID converts the stream of events into a sequence of “event tensors”, each consisting of a fixed batch of events represented as a 3D spatio-temporal voxel grid. The sequence of event tensors is passed to a recurrent UNet that outputs a sequence of reconstructed image frames.

Manifold regularization (MR) and E2VID utilize spatial smoothing, which destroys the Bayer pattern if applied directly to events. For both of these methods, we found that color images can still be obtained by reconstructing red, green and blue channels independently (at quarter resolution), then upsampling to the original resolution using bicubic interpolation. Because of the Bayer pattern, the four different (upsampled) color channels will not be exactly aligned. Therefore, we shift each color channel by one pixel horizontally and/or vertically so that all four color channels are geometrically aligned. We fuse both green channels (after alignment) by simply taking the mean. In contrast, the High-pass filter (HF) treats each pixel independently and does not perform spatial smoothing. Thus, it can be applied directly to events, then converted to color using demosaicing [56].

4.2 Results

Figure 6 shows reconstruction results of all three methods; Manifold regularisation (MR), High-pass filter (HF) and events-to-video neural network (E2VID), alongside DAVIS frames from the Color-DAVIS346. HF and E2VID preserve color well and qualitatively match the DAVIS frame. We encourage the reader to watch the accompanying video, which convey our results better than still-images.

Figure 7 displays edge cases such as high-speed, HDR etc. that highlight strengths and weaknesses of each reconstruction method and the DAVIS frames:

Initialisation (first row). Both MR and HF are initialised at zero and rely on integration of events to build a consistent image over time. Thus, they are prone to producing edge-like images, particularly within the first few milliseconds after initialisation, until enough events ‘fill in’ the missing information. In contrast, E2VID is good at filling in gaps and can hallucinate color accurately in places with no events.

Fast Motion (second row). HF is a temporal high-pass filter, and is sensitive to temporal components in the input signal, such as frequency and speed. Thus, the quality of the reconstruction can be adversely affected by extremely fast (or slow) motions.

In addition, fast motions tend to generate noise in the event stream that is accumulated without discrimination by the integrator in HF. MR and E2VID are good at rejecting noise from fast motion and showcase the attractive properties of event cameras for challenging scenarios.

Sharpness (third row). MR and E2VID rely on spatial smoothing to filter out noise from the event stream, which can degrade sharpness of fine details. For color reconstruction, the spatial smoothing property of these two methods destroys the Bayer pattern, requiring each color to be reconstructed independently (at quarter resolution), then upscaled back to the original resolution, further losing fine details. In contrast, HF requires no spatial smoothing, so a raw intensity reconstruction at full resolution is possible, since the Bayer pattern is preserved. A demosaicing algorithm [56] can be used to convert the raw output to color without loss of resolution, resulting in a sharper reconstruction.

Memory (fourth row). The “memory” (i.e. the time span over which information in the event data can be propagated) is variable between all three methods. For HF, the size of the temporal receptive field (memory) is explicitly encoded through the cutoff frequency parameter. Hence, the duration across which information can be propagated can be set to an arbitrarily high amount of time, at the expense of integrating more noise, and creating “bleeding” patterns following moving objects. By contrast, MR and E2VID have an implicit memory, whose size can vary with the number of events used in each integration window (MR), or event tensor (E2VID). However, we observe that the memory of MR and HF is notably smaller than HF, which is particularly visible in the driving sequence (fourth row of Fig. 7), where HF is able to reconstruct slow moving objects, e.g. the clouds or the distant buildings, in contrast to MR and E2VID.

HDR (fifth row). Since the APS is limited to a uniform exposure duration for all pixels, the DAVIS frame has low dynamic range compared to events. Thus, dark regions are often underexposed while bright regions (window) are well exposed, and vice versa. Reconstructions from MR, HF and E2VID all showcase the high dynamic range property of events, i.e. both dark and bright regions are clear.

Low light (sixth row). Low lighting is a challenge for conventional cameras because the exposure duration must be increased to avoid underexposure, leading to motion blur. While the DAVIS frame is motion blurred, MR, HF and E2VID demonstrate immunity to motion blur, even in challenging low lighting conditions.

4.3 Application of Reconstructions

While many computer vision algorithms work on grayscale images, it is well established that incorporating color information can significantly boost performance for the task at hand [57]. This is because color images contain more information about the scene than grayscale images, which can only encode structural information. This is particularly true in recognition tasks, where color can be an important visual cue. Figure 8 shows one example where color improves object detection performance. We apply YOLO [58] to E2VID images reconstructed from both grayscale and color events and observe that color offers qualitative improvement. While image reconstructions can be used directly for the task at hand, they may also be used to generate proxy labels (e.g. segmentation, optical flow, recognition) that can be transferred to events.

5 Conclusion

We present the first Color Event Camera Dataset, containing both frames and events across a diverse range of scenes, motions and lighting conditions. We release an open source color event camera simulator based on ESIM [1]. We show how three state-of-the-art event-based image reconstruction methods can be adapted for color video reconstruction, and compare strengths/weaknesses of each method. We hope that our Color Event Camera Dataset and simulator will inspire future work with color events, which we believe is the next step for event-based vision.

Acknowledgements

We would like to thank Prof. Tobi Delbruck and the Sensors group at the Institute of Neuroinformatics (ETH & University of Zurich), and Inivation for providing the camera. This work was supported by (i) the Australian Government Research Training Program Scholarship (ii) the Australian Research Council through the “Australian Centre of Excellence for Robotic Vision” under Grant CE140100016 (iii) the Swiss Government Excellence Scholarship (iv) the Swiss National Center of Competence Research Robotics (NCCR) (v) Qualcomm (through the Qualcomm Innovation Fellowship Award 2018) (vi) the SNSF-ERC Starting Grant.

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Rebecq, D. Gehrig, and D. Scaramuzza, “ESIM: an open event camera simulator,” in Conf. on Robotics Learning (Co RL) , 2018.
2[2] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 × \times 128 120 d B 15 μ 𝜇 \mu s latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-State Circuits , vol. 43, no. 2, pp. 566–576, 2008.
3[3] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Asynchronous, photometric feature tracking using events and frames,” in Eur. Conf. Comput. Vis. (ECCV) , 2018.
4[4] F. Barranco, C. Fermuller, and E. Ros, “Real-time clustering and multi-target tracking using event-based sensors,” in IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS) , 2018.
5[5] I. Alzugaray and M. Chli, “Asynchronous corner detection and tracking for event cameras in real time,” IEEE Robot. Autom. Lett. , vol. 3, pp. 3177–3184, Oct. 2018.
6[6] C. Scheerlinck, N. Barnes, and R. Mahony, “Asynchronous spatial image convolutions for event cameras,” IEEE Robot. Autom. Lett. , vol. 4, pp. 816–822, Apr. 2019.
7[7] H. Rebecq, G. Gallego, and D. Scaramuzza, “EMVS: Event-based multi-view stereo,” in British Mach. Vis. Conf. (BMVC) , 2016.
8[8] H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza, “EMVS: Event-based multi-view stereo—3D reconstruction with an event camera in real-time,” Int. J. Comput. Vis. , vol. 126, pp. 1394–1414, Dec. 2018.