Real Time 3D Indoor Human Image Capturing Based on FMCW Radar
Hanqing Guo, Nan Zhang, Wenjun Shi, Saeed AlQarni, Shaoen Wu

TL;DR
This paper presents a real-time, privacy-preserving 3D indoor human imaging system using FMCW radar, capable of capturing clear human images and activities without wearable sensors or privacy concerns.
Contribution
It introduces a novel FMCW radar-based method with data preprocessing, signal processing, and deep learning filtering to achieve real-time 3D human imaging in indoor environments.
Findings
Effective background static reflection removal
High-quality 3D human image reconstruction
Real-time recognition of human activities
Abstract
Most smart systems such as smart home and smart health response to human's locations and activities. However, traditional solutions are either require wearable sensors or lead to leaking privacy. This work proposes an ambient radar solution which is a real-time, privacy secure and dark surroundings resistant system. In this solution, we use a low power, Frequency-Modulated Continuous Wave (FMCW) radar array to capture the reflected signals and then construct to 3D image frames. This solution designs a data preprocessing mechanism to remove background static reflection, a signal processing mechanism to transfer received complex radar signals to a matrix contains spacial information, and a Deep Learning scheme to filter broken frame which caused by the rough surface of human's body. This solution has been extensively evaluated in a research area and captures real-time human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndoor and Outdoor Localization Technologies · Advanced SAR Imaging Techniques · Microwave Imaging and Scattering Analysis
HICFR: Real Time 3D Indoor Human Image Capturing Based on FMCW Radar
Hanqing Guo, Nan Zhang, Saeed AlQarni, Shaoen Wu
Ball State University
Muncie, IN USA
{hguo, nzhang, saalqarni, swu}@bsu.edu
Abstract
Most smart systems such as smart home and smart health response to human’s locations and activities. However, traditional solutions are either require wearable sensors or lead to leaking privacy. This work proposes an ambient radar solution which is a real-time, privacy secure and dark surroundings resistant system. In this solution, we use a low power, Frequency-Modulated Continuous Wave (FMCW) radar array to capture the reflected signals and then construct to 3D image frames. This solution designs a data preprocessing mechanism to remove background static reflection, a signal processing mechanism to transfer received complex radar signals to a matrix contains spacial information, and a Deep Learning scheme to filter broken frame which caused by the rough surface of human’s body. This solution has been extensively evaluated in a research area and captures real-time human images that are recognizable for specific activities. Our results show that the indoor capturing is clear to be recognized frame by frame compares to camera recorded video.
I Introduction
Indoor human image capturing is of utmost important to many intelligent devices or systems. For example, robots need real-time human images to plan and change the route, and smart health system needs human images to recognize their activities thus alert when children or elderly people fall. However, most human image capturing solutions based on cameras, which makes users concern about privacy leaking problem[1, 2]. Hence, human image capturing without computer vision technology has been a very popular research topic.
While traditional camera-based solutions result in privacy issues, the wearable sensors are devised to collect and process human motion data in many smart home scenarios. However, those wearable devices are inconvenient to users because they have to remember to equip these sensors when wakeup, and take off it when typing, wash hands or strenuous exercise[3]. The worse thing is when they take off those wearable sensors, and the capture process would keep still; thus the results are unreliable at that time. Hence then, it is highly demanded to design a passive, non-invasive and real-time indoor human image capture solution so that privacy protection and convenience issue can be guaranteed. There are two more benefits to use radar or Radio Frequency (RF) technology to capture indoor human activities. One is RF solution can ”see” human in dark light condition, while the other one is RF signals can sense human activities through the wall.
In this paper, we propose a scheme of Human Image Capturing based on FMCW Radar sensors, HICFR. This work uses a FMCW radar to sense environment, then convert raw signals to 3D human images which contain spacial location information of target in real-time. It has the following highlights:
It uses antenna array combined with FMCW radar to send and receive directional beamforming to sense 3D environment, the frequency is 3.3GHz to 10GHz, and it is a very low power system, with average transmit power is below -40dbm/MHz.
It has data clean feature, and the proposed calibration algorithm can record static environment response, then remove them from raw signals, thus only useful human motion responses can be reserved, which makes visualization results more clear.
It uses deep learning algorithm to process raw 3D images, and the deep learning model is trained to recognize whether the current frame is caused by irregular reflections, if true, filter them out so that the real-time captured frame is continuous.
Its design leverages feasible low-cost devices and achieves reliable performance in real-time applications.
In the rest of this paper, Section II reviews the literature solutions related to our work. Next, Section III describes the overview system structure, includes system platform introduction, signal processing chains as well as surrounding reflection signals removing. Then, Section IV proposes the technical details to combine FMCW radar with antenna array to collect 3D array which represents the power at specific spacial voxels. Section V presents a novel solution that uses deep learning to make our scheme recognize and remove bad reflection frames, followed by Section VI evaluates the performance and feasibility of the whole system.
II Related work
Related works in this field include traditional sensors solutions, computer vision solutions and RF solutions. Recently, all of those solutions have been investigated for many smart home applications.
II-A Sensors and Computer Vision solutions
Accelerometer and Gyroscope[4] are the most common sensors applied to collect human motion data. Besides that, Inertial measurement unit (IMU) sensor which combines accelerometer and gyroscope sensor is also widely used to wearable devices[5]. Those sensors can collect linear acceleration, rotation angle, angular velocity of targets, so that human who wears sensors can be collected motion data. Based on raw data collected by those sensors, researchers proposed various algorithms to recognize human activities[6, 7, 8, 9]. Zhang et al. designed physical features based on physical parameters of human motion, then find the most critical physical features for human activities, thus to improve the recognition accuracy[6, 7]. Years later, Ferhat et al. investigated k-Nearest Neighbor (k-NN), Support Vector Machines (SVM), Gaussian Mixture Models (GMM) classification techniques to process raw data and found the best scheme to recognize human activities[8]. Recently, researchers from the UK explored how to use deep, convolutional and recurrent models to detect human activities[9]. In the meantime, many algorithms have been proposed based on computer vision techniques[10, 11, 12, 13, 14]. Sung et al. used RGB-D images to detect and track human motions, those images with depth information were properly processed to achieve capture purpose[10, 13]. Jalal’s team proposed a solution, which uses translation and scaling invariant features with depth videos to recognize human logging activity[11]. More recently, Kinect was widely used to collect human motion data because of its abundant APIs; researchers trained mechine models to do image segmentation for Kinect real-time video and capture coarse human outlines and motions[12, 14].
II-B RF Solutions
Radar sensors and RF devices usually used for military or wireless communication purpose. However, it has been recently considered for smart home applications because of its data confidentiality, and the performance does not depend on lighting conditions. Recent researches either based on FMCW radar [15, 16, 17, 18] or used off-the-shelf devices [19, 20, 21]. A research group of Massachusetts Institute of Technology (MIT) Adib et al. designed MIMO antenna sensor with FMCW technique to detect human move [15] and even capture human figure through a wall [17, 16]. Off-the-shelf devices such as ultrasonic sensor or walabot [22] were also investigated feasibility to recognize human activities [19, 20, 21]. Avrahami et al. proposed a human activity recognition scheme based on 2D heat maps generated by walabot, while Zhu et al. [20] applied traditional signal processing algorithms to filter and cluster raw data thus recognize human actions. Both of them achieve higher than 80% accuracy in their research tasks.
III System Design Overview
To enable real-time human image capturing with ambient radar in smart home scenario, we propose a solution: Human Image Capturing based on FMCW Radar sensors (HICFR) system. HICFR scans 3D surroundings reflections with FMCW chirps and 2D antenna array. While FMCW chirps are used to compute direct distance from the detected object to receive antenna, and 2D antenna array is placed to identify spatial directions. It emits FMCW chirps to scan 3D volume of surroundings, then received signals are processed to remove environment fix reflections,. After that, we calculate the reflection powers of any scanned voxel and construct it to 3D images, then a Deep Neural Network (DNN) based filter algorithm is designed to address noncontinuous reflection frames problem, thus capture real-time human activities.
III-A Sensing Platform
Since HICFR requires to emit FMCW chirps and collect received signals by 2D antenna array, there is an off-the-shelf radar sensor called Walabot [22] meets our requirement. Walabot has compact size and low-cost feature with a board size of and the average power is lower than -41dbm/MHz. The frequency range of FMCW chirp emitted by Walabot is 3.3GHz-10GHz, which is good enough to detect direct distance within 10 meters range based on gradient of FMCW chirp. It also contains 18 pair of antenna, which are arranged to 2D antenna array.
Figure 1 shows internal antenna array of Walabot. Walabot emits FMCW chirps to scan in horizontal direction and in vertical direction. Then it communicates to HICFR with USB port to send raw signals for further processing. The scanned area of walabot can be present as figure2 below:
Where is Elevation angle to detect the height of human, and is Wide angle to capture the width of human. is FMCW signals travel distance from transmit antenna to humans head, and is hypotenuse of triangle whose angle is and hypotenuse rotate degree, the scan range is the sector which triangle passed. In our case, is from to and is from to . The direct travel distance can be calculated by FMCW properly with formula (1) and figure 3 as below:
[TABLE]
While is signal travel time from transmit antenna to object and reflect back to receive antenna, and is frequency difference of transmit and receive signals. is slope of transmit or echo frequency chirp and is speed of light. To simplify description, Equation (1) and formula 3 are not considering doppler frequency shift effect.
III-B System Flowgraph
The complete HICFR system contains 3 phases. 1) Data collection and Calibration, 2) Coarse Visualization and 3) Fine Visualization. As shown in figure 4, environment sensing period is running ahead of detecting. First, HICFR emits FMCW chirps and records static background reflections, when HICFR starts to do object detection task, its 2D antenna array collects raw signals. Second phase is designed to convert signals to images. Since HICFR scans 3D surroundings with parameters and , the raw received signals then being processed to represent power of every spatial points with different and , namely voxel in scanned area. Then HICFR subtracts recorded background reflections power, and removes environment fix reflection to get pure power information of changed objects, which is a 3D matrix with dimension of , where can be computed by equations (2).
[TABLE]
In the equations above, is detect range of parameters, while is designated parameters’ sampling interval. Next, we propose a novel solution to achieve coarse-to-fine visualization. Because human body acts as an uneven reflector rather than a scatterer, thus some signals reflect back directly to antenna array, while other signals are deflected from normal path or even away from receive antenna, in this case, constructed 3D images may contain some ambiguous results. The third phase addresses this issue by using Machine Learning algorithm. We collect dataset for regular reflection and deflected reflection images, then train a Deep Neural Network (DNN) which contains Convolutional Layer, Pooling layer and Linear layer to recognize them. The trained DNN is placed to main program loop thus eliminates ambiguous frames from real-time stream.
As shown above, the key challenges and main contributions of HICFR are 1) Compute reflection powers of every voxel with and based on the received complex signals of antenna array, 2) Construct 2D/3D images with known 3D power matrix, 3) Address ambiguous images issue which caused by signal deflection with Deep Neural Network and achieve real-time filtering scheme.
IV Calibration and Visualization
In this section, we dive in the technical detail of HICFR. Since walabot antenna array collects RF-signals, which is complex signals, they can be represented by amplitude and phase as follows:
[TABLE]
Where is signals received at moment. is amplitude of signal at time , is travel distance of signal and is signals’ wavelength. Since received phase has linear function with travel distance, so is the signal phase when it reach to receive antenna at moment .
Revisit to equation 3, due to the receiver is an antenna array, so should have more complex format to specify the signal is received by which receive antenna, note as , where is antenna number, and is signals received by antenna at moment .
Another parameter needs to be clarified is . Since human body is a surface rather than a point, it reflects signals from different directions to all antennas, the received signals at moment of one antenna contains more than one points’ reflection, thus varies from multiple reflect points. Figure 5(a) shows when antenna array scans human body, his left hand reflects to antenna as blue dot line, his right hand reflects to antenna array as red dot line. Based on above description, equation 4 is designated as follow:
[TABLE]
Suppose is points on the detected object, then is number of points being scanned, is signal travel from to antenna .
IV-A Compute Voxel Power
Power of Direction: Based on equation 3 and 4, the problem can be decleared as: known signals received by antenna at moment , then compute reflection power of every scanned points. Because both angles and distance property can be reflected to phase of received signal. More specifically, the power of specific angle can be refered by antenna array property, while the power of specific distance can be calculated by FMCW feature. Revisit to figure 5(a) and change antenna array panel to a plane figure, antenna receive reflection from , the coming direction of beam is as shown in both figure 5(b) and figure 2. While are angles between antenna to , and is distance between two antennas. Thus power of direction can be presented as in equation 5:
[TABLE]
Where is how many antennas in the dimension. Because travel different distance for each antenna, and the difference can be represented by as depicting with light blue color. Thus their phase change of antenna is , is signal wavelength.
Power of Distance: The travel distance of signals also related to the direct distance from point to antenna . Frequency Modulated Continuous Wave measures reflection depth by calculating frequecy shift between transmit and receive chirp. Equation 1 shows the FMCW feature. We define is slope of frequency chirp versus time, where is equal to in figure 3. So the power of distance can be calculated by phase change of as shown below in equation 6:
[TABLE]
where is signal travel distance from point . is the duration of each chirp. Because and , we can easily get the phase change is , thus power of is summation over duration and total antenna number .
Power of Voxel: Since HICFR scans 3D surroundings, reconsider situation shown at figure 5(b), where on same panel of antenna. However, points in 3D volume need three parameters to locate, either with () in spherical coordinate system or () in cartesian coordinate system. We choose spherical coordinate system because the power of and can be calculated based on our 2D antenna array. Figure 6 shows how it works:
The 2D antenna array is on panel, where blocks is antenna. and are distance between two antenna in two dimensions. panel is the dimension drawn in figure 5(b), and is in equation 5, while is elevation angle from panel to . In 3D figure, is mapping to pannel as with , and it is mapping to panel as with , where is wide angle from to axis. Thus the distance change at panel for each antenna is as blue line, that change at panel is as light blue line shows.
[TABLE]
Since distance change represents phase change of signals, then we can calculate power of any voxel by equation 7. is the signal received by receive antenna from transmit antenna at time .
IV-B Construct 3D Image
Remove Background Reflection: To get rid of environment reflections such as desks or walls, HICFR starts a sensing process before capture humans, name as calibration. Since background reflection is static and the reflection power is fixed, so that calibration sensing, calculating and recording the background reflection power of any voxel, after that, when HICFR starts human image capturing task, it subtracts the static background reflection power from the real-time reflection power. We need to make sure there is no human enters the lab during calibration period.
Construct 2D/3D Image: Once HICFR calculates the power of every voxel and removes background reflection power, it gets a 3D matrix with the dimension of (), where can be refered from equation 2. Since a 2D image is related to either , or even . To make 2D image has a clear meaning, we choose to construct 2D image with distance and wide angle . At first, we find the highest power from , suppose the highest reflection power is from point at (), then is the highest value in , and is a 2D array because parameter is fixed as . Thus we draw a 2D heatmap image based on , where the color shows reflection power intensity, the darker color means the higher reflection power at . Figure 7(a) shows 2D image capturing scenario and its corresponding heatmap.
As can be seen from figure 7(a), the range of is from to , where is the angle from dash blue line to human, in this case, dash blue line is the base line in the middle of Walabot, thus is wide angle from base line to object. While 2D image only depicts the highest power layer of fixed , 3D image shows more information about object width, height and location. Figure 7(b) shows how to construct 2D images to 3D images. HICFR uses marching cubes algorithm to draw vertices and faces of stacked images, then it uses nomarlized filter to remove low power points. It is very clear to see that human is at a shorter direct distance to radar, and the height of human is greater than the chairs in 3D vision.
V Filter Reflection
Another challenge of real-time human image capturing is signal deviation. Since human body is not a plane surface, especially when human moves, the surface of body is extremely deformed. As a result, while our antenna array transmits siganls and scans human body, only signals that close to normal surface are reflected back toward the antennas. Other signals may be deviated from another routes and back to receiver, which makes our antenna ”misunderstand” the real distance and angle from object. This scenario is shown in figure 9:
In this case, the distance between human chest and leg is not quite large, however, signals transmitted from antenna array travel to human leg and deviate it’s coming route, thus receive antenna gets signals from , where , so that antenna ”misunderstand” human leg position with wrong distance and angles, and it results in deformed 3D shape. To address this issue, we design a Deep Neural Network to recognize whether current 3D figure is deformed or not, and remove them from image capture stream.
DNN Recognization: We use transfer learning technique to solve this problem more efficiently. Due to it’s a image processing problem, the proper Deep Neural Network should have Convolutional layer to reduce possible parameters and amount of calculation. Based on that, we choose resNet18 to classify our 3D image. Our contribution is 1) Collect regular and ambiguous images used as training dataset, 2) Change the network structure of resNet18 to make the DNN convergence faster, 3) Real-time load trained DNN parameters and handle recognization task in mainstream.
We collect training dataset from real human activities, while one person walks around in the lab, we construct 3D images and concatenate them as 3D videos. Then we classify them manually into 2 categories: regular frames and ambiguous frames. Figure 8 shows samples of dataset, while figure 8(a) shows regular 3D reflection power and 8(b) has ambiguous images. As can be seen from the dataset, the regular frames show human 3D position very clear, and the ambiguous frames always ”misunderstand” location of some part of human body.
Change resNet18 Structure: The first step to apply transfer learning is changing the last Fully Connect(FC) layer, the last FC layer dimension of normal resNet18 is , which means to FC layer is and output features. The usually feed into functions to be classified into categories. In our design, we only have categories: regular and ambiguous. Then we change the dimension of last FC layer to be . The second change of original resNet18 is changing the pooling layer before FC layer. Resnet18 uses Average Pooling layer to compress features to , but Average Pooling sometimes cannot extract good features because it takes all into count and results an average value. Since our dataset images have strong edges, and Max Pooling extracts the most important or extreme features. So we change the pooling layer to the same size of Max Pooling layer and compare the different convergence of them.
Train DNN: The DNN is trained with mini-batch strategy to make it converge more smoothly. We use as loss function shown in equation 8. Where is output of DNN, whose dimension is , and is labels for one minibatch data with dimension . We use SGD optimizer to update parameters with and , and a is applied to adjust learning rate with and . Then we compare running loss and accuracy of each iterations in figure 11. Note that running loss and accuracy will be cleared after one epoch.
[TABLE]
Figure 11(a) shows the original performance of resnet18 and figure 11(b) is our DNN result. It is clearly to see our DNN converges faster and has less strong vibration compare to original resnet18.
VI Performance
The whole process of HRCIF results in figure 10. The first row records real human motions, the second row is the results before filtering, and the third row is a final result of detecting human. At the very beginning, human is standing on the right of radar with a wide angle of , where the cube in row 2 and 3 stand around and . With human moves close to radar from frame , our captured images show wide angle and direct distance are decreasing gradually. While human move away from radar, the wide angle and direct distance are increasing. During this time, the frame ahead of last frame is ”bad frame”, so our DNN detects and recognizes the ”misunderstanding”, thus hold previous frame to the current one. More experiments are designed to see if captured stream can be used to recognize human activities such as walk, run, jump and fall. Our results show all activities stream captured by HICFR can be easily recognized by human with accuracy more than 90%.
VII Conclusion
In general, we propose a real-time 3D human image capturing scheme based on radar, this solution not only localizes human position precisely in stream, but also protect human’s privacy. Different human activities captured by our radar system results can be easily recognized by human eyes. Our future work will focus on designing a Recurrent Neural Network (RNN) to recognize human activities with our visualization result in real-time.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Daphne Townsend, Frank Knoefel, and Rafik Goubran. Privacy versus autonomy: a tradeoff model for smart home monitoring technologies. In Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE , pages 4749–4752. IEEE, 2011.
- 2[2] J Sathish Kumar and Dhiren R Patel. A survey on internet of things: Security and privacy issues. International Journal of Computer Applications , 90(11), 2014.
- 3[3] JA Stankovic, Q Cao, T Doan, L Fang, Z He, R Kiran, S Lin, S Son, R Stoleru, and A Wood. Wireless sensor networks for in-home healthcare: Potential and challenges. In High confidence medical device software and systems (HCMDSS) workshop , volume 2005, 2005.
- 4[4] Subhas Chandra Mukhopadhyay. Wearable sensors for human activity monitoring: A review. IEEE sensors journal , 15(3):1321–1330, 2015.
- 5[5] Norhafizan Ahmad, Raja Ariffin Raja Ghazilla, Nazirah M Khairi, and Vijayabaskar Kasi. Reviews on various inertial measurement unit (imu) sensor applications. International Journal of Signal Processing Systems , 1(2):256–262, 2013.
- 6[6] Mi Zhang and Alexander A Sawchuk. A feature selection-based framework for human activity recognition using wearable multimodal sensors. In Proceedings of the 6th International Conference on Body Area Networks , pages 92–98. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2011.
- 7[7] Mi Zhang and Alexander A Sawchuk. Motion primitive-based human activity recognition using a bag-of-features approach. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium , pages 631–640. ACM, 2012.
- 8[8] Ferhat Attal, Samer Mohammed, Mariam Dedabrishvili, Faicel Chamroukhi, Latifa Oukhellou, and Yacine Amirat. Physical human activity recognition using wearable sensors. Sensors , 15(12):31314–31338, 2015.
