Towards Digital Retina in Smart Cities: A Model Generation, Utilization and Communication Paradigm
Yihang Lou, Ling-Yu Duan, Yong Luo, Ziqian Chen, Tongliang Liu, Shiqi, Wang, Wen Gao

TL;DR
This paper proposes a comprehensive paradigm for digital retina systems in smart cities, integrating model reuse, prediction, and communication strategies to enhance AI-driven visual data processing and analysis.
Contribution
It introduces an integrated framework for model generation, utilization, and communication, addressing challenges in large-scale visual data analysis in smart cities.
Findings
Enhanced feasibility of digital retina with deep learning model reuse
Improved processing efficiency for large-scale visual data
Demonstrated effectiveness through experimental validation
Abstract
The digital retina in smart cities is to select what the City Eye tells the City Brain, and convert the acquired visual data from front-end visual sensors to features in an intelligent sensing manner. By deploying deep learning and/or handcrafted models in front-end devices, the compact features can be extracted and subsequently delivered to back-end cloud for search and advanced analytics. In this context, we propose a model generation, utilization, and communication paradigm, aiming to address a set of unique challenges for better artificial intelligence services in smart cities. In particular, we present an integrated multiple deep learning models reuse and prediction strategy, which greatly increases the feasibility of the digital retina in processing and analyzing the large-scale visual data in smart cities. The promise of the proposed paradigm is demonstrated through a set of…
| Scale | MSMT+CUHK | Market+CUHK | MSMT+Market |
|---|---|---|---|
| =1 | 37.91 | 39.19 | 39.63 |
| =2 | 37.42 | 39.23 | 40.16 |
| =4 | 39.37 | 39.43 | 38.70 |
| =5 | 39.39 | 39.22 | 38.96 |
| =8 | 39.33 | 38.49 | 39.18 |
| =16 | 38.37 | 38.20 | 38.84 |
| baseline | MSMT only | CUHK only | Market only |
| 31.53 | 38.52 | 36.98 | 38.09 |
| Model | mAP | Rank-1 |
| Triplet [15] | 34.31 | 54.30 |
| PCB [12] | 36.62 | 57.05 |
| DefenseTriplet [18] | 35.96 | 55.97 |
| AlignedReID[19] | 35.35 | 55.38 |
| AlignedReID+Mutual Learning[19] | 36.60 | 55.48 |
| Softmax Baseline | 31.53 | 49.55 |
| +MSMT | 38.52 | 58.61 |
| +CUHK | 36.98 | 58.34 |
| +Market | 38.09 | 57.40 |
| +MSMT+CUHK | 39.93 | 59.87 |
| +Market+CUHK | 39.22 | 60.18 |
| +MSMT+Market | 40.16 | 59.87 |
| +MSMT+CUHK+Market | 41.24 | 61.04 |
| Model | DoM(V1-V0) | DoM(V2-V1) | DoM(V3-V2) |
| Model-V1 | Model-V2 | Model-V3 | |
| original | 39.42 | 40.27 | 41.24 |
| 39.42 | 40.27 | 41.24 | |
| compression bits=7 | 39.42 | 40.27 | 41.24 |
| 39.42 | 40.27 | 41.24 | |
| compression bits=6 | 39.42 | 40.27 | 41.25 |
| 39.41 | 40.26 | 41.23 | |
| compression bits=5 | 39.44 | 40.25 | 41.25 |
| 39.36 | 40.19 | 41.17 | |
| compression bits=4 | 39.36 | 39.76 | 40.73 |
| 38.54 | 39.02 | 40.06 | |
| compression bits=3 | 33.64 | 36.48 | 39.85 |
| 0.16 | 0.16 | 0.16 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Impact of Light on Environment and Health · Visual Attention and Saliency Detection
Towards Digital Retina in Smart Cities: A Model Generation, Utilization and Communication Paradigm
Abstract
The digital retina in smart cities is to select what the City Eye tells the City Brain, and convert the acquired visual data from front-end visual sensors to features in an intelligent sensing manner. By deploying deep learning and/or handcrafted models in front-end devices, the compact features can be extracted and subsequently delivered to back-end cloud for search and advanced analytics. In this context, we propose a model generation, utilization, and communication paradigm, aiming to address a set of unique challenges for better artificial intelligence services in smart cities. In particular, we present an integrated multiple deep learning models reuse and prediction strategy, which greatly increases the feasibility of the digital retina in processing and analyzing the large-scale visual data in smart cities. The promise of the proposed paradigm is demonstrated through a set of experiments.
**Index Terms— ** Digital retina, model reuse, model communication, model compression
1 Introduction
Retina is the crucial and indispensable component of our visual system, which converts light signals into neuronal representations and acts as a filter that conveys specifically required and meaningful visual information to the brain [1][2]. In the retina, the rods and cones cells are responsible for the low and high light levels, as well as the color vision. The photosensitive retinal ganglion cells extract the complex features to complete the perception process. As such, retina not only perceives the visual information, but also works as a highly efficient visual data processing engine in the central nervous system. In analogous to the concept of retina, as illustrated in Fig. 1, the digital retina [3] that intelligently senses and processes the visual data in the City Brain is committed to revolutionize the artificial vision system of smart cities.
In particular, the central processing unit in digital retina is feature extraction, which relies on deep learning and/or handcrafted models in the front-end visual sensors to directly extract and compress the features, such that the compact features can be efficiently sent to central server for visual analysis. To facilitate this process, the model, which is usually learned at the central server by leveraging the visual data, is the core component in the City Brain, and the learned models are subsequently delivered to the front-end devices for feature extraction and compression. As such, the model generation, utilization and communication are essential in establishing the digital retina, especially in the sense that the collected visual data are featured by high variations in terms of locations, time and ambient environments. However, how the effective models can be feasibly generated by leveraging the existing massive models in different domains has not been fully exploited.
In this paper, we focus on an integrated solution for effective model generation and efficient transmission by exploiting the cross-domain and inter-model relationships for the construction of digital retina. The transfer of learning has been studied for decades from the psychological [4] , and educational [5] perspectives, and it has been widely recognized that learning from prior experience or knowledge transfer can help human learning and improve performance. This motivates us to design a novel model transfer/reuse module that is particularly suitable in the proposed digital retina system. Besdies, the energy cost of the information transmission between neurons in human brain is usually very low due to the selective and adaptive regulation mechanism [6], this inspires us to propose a low transmission cost strategy that can deliver learning models incrementally and adaptively.
Furthermore, with the constantly generated data in smart cities, the models undergo generation, updating, and distribution. In particular, for the same tasks or similar tasks, there exists high correlations among different models even with different domains of training data. In view of this, we aim to efficiently make use of these models that operate the artificial intelligence applications in city brain from two perspectives.
- The existing models can be reused for model generation, even when the source and target models are in different domains. 2) These models can be efficiently delivered for better utilization and management.
In this work, we aim to explore more efficient and effective model utilization, management and distribution methodologies in the context of digital retina. The main contributions of this paper are summarized as follows:
- •
We propose a novel model generation, utilization and communication paradigm towards digital retina to better construct the artificial vision system in smart cities.
- •
We explore an efficient model reuse strategy to learn more discriminative and domain adaptive models.
- •
We develop a scalable model communication scheme by removing the inter-model redundancy to deliver the newly generated models at low transmission cost.
2 Related Work
Model Reuse. Model reuse aims to explore the reusability of existing models for facilitating the model training in the target domain with limited resources and/or labeled data. Recently, a fixed model reuse (FMR) method was proposed in [7], which assumes that a set of additional feature is provided for each training sample. Therefore, the method is actually feature reuse and more information contained in the deep models are ignored. Besides, only one type of source feature can be handled in [7]. A modal consistency multi-model reuse strategy was presented in [8], but it assumes there are multiple modalities (e.g., features) in each domain, and cannot be used when there is only one modality. In addition, it only utilizes the labels of different models. Consequently, the abundant information contained in the models are ignored, and different source models should be presented in the testing phase. A bag of experts approach was proposed in [9] to directly reuse the output features or labels of some expert source models and hence suffers from the same drawbacks as in [7] or [8]. In this work, we propose a novel and general framework that can appropriately tackle these issues and make full use of the knowledge contained in multiple existing source models for training the target model, which greatly facilitates the model generation based on the models in the City Brain.
Deep Neural Network Communication. The deep neural network transmission aims to utilize and deliver the knowledge concentrated in the network model to facilitate different intelligent applications. In [10], the model compression is formulated from the perspective of transmission. As such, the redundancy among different models can be further exploited to facilitate many applications in front-end visual sensors. It is also shown that such scheme can be elegantly combined with the existing compression methods to form an integrated compression and communication framework.
3 Model Generation, Utilization and Communication Paradigm
In this section, we demonstrate the model generation, utilization and transmission in the digital retina system of smart cities. As illustrated in Fig.2 , we propose an integrated solution with edge computing, which serves as an intermediate layer between the visual sensor and central cloud. The edge computing is effective in offloading the computation load from the central server and caching the data from the front-end visual sensors. Accordingly, there arise multiple requirements for model generation and communication from different perspectives, which can be summarized as follows:
-
Model reuse between edge ends. The edge nodes can utilize and reuse the models from the other edge nodes to perform particular tasks. However, it is widely acknowledged that there is severe domain bias of captured data in widely deployed visual sensors. In real-world applications, the deployed front-end visual sensors in different locations may deal with the data in different domains. Regarding a specific task, multiple models trained in a particular domain can be utilized to generate a more discriminative model. For example, the trained models at several edge nodes can be transmitted to a particular edge node for multi-model reuse.
-
Model transmission from edge to front-end. For the front-end, the deployed models from the edge nodes are often updated. In such scenario, there exist high correlations among a series of updating models. To economically deliver the newly generated models, the exploration of inter deep learning model redundancy removal is able to greatly reduce the transmission cost of the to-be-deployed models.
4 Multiple Model Reuse and Prediction
4.1 Multi-Model Reuse
In real-world applications, domain can be characterized by target characteristics, geospatial information and capture conditions. There usually exists domain bias for data distributions. Due to such domain gap, the model trained in one specific domain usually cannot well generalize to other domains. In the smart artificial vision systems, there often exist similar models for the same task, which motivates us to leverage existing multiple models to obtain a domain specific one. Moreover, the acquired and cached data could also be leveraged, especially from a wide range of local visual sensors. Given these considerations, we propose a novel multi-model reuse strategy to improve the performance, especially for front-end visual sensors in a particular domain.
Given one target domain and source domains, we suppose there are a few labeled samples in the target domain. In the -th source domain, a deep learning model is already trained using abundant labeled data. The ultimate goal of multiple model reuse is to learn a model in the target domain using the pre-trained source models and limited labeled data in the target domain.
The architecture of the proposed multi-model reuse framework is shown in Fig. 3. In particular, we make a mild assumption that the pre-trained source models and target model are characterized by Convolutional Neural Networks (CNNs). To conduct reliable model reuse, we also assume that there are large amounts of unlabeled data in the target domain. Then for each (labeled or unlabeled) input , we map it as the target hidden layer representation and multiple source hidden layer representations , where is the layer index and is the number of source models. Here, each representation is an -order tensor of size . To reuse the multiple source models, different are mapped into a common subspace, and is enforced to be close to the common representation. This is achieved by employing the multi-view learning strategy [11], i.e., using to reconstruct each . As such, can be regarded as a meta-embedding of the different source layer representations. In this way, all the features in the source domains are utilized to improve the hidden layer representation, leading to a better model compared to the one trained using only the limited labeled information in the target domain. The objective function is given by,
[TABLE]
where is the set of all parameters of the target learning task, and is an activation function; is any loss that is adopted in deep learning and is a regularization term that enables model reuse. A general formulation of can be given by,
[TABLE]
where each is a transformation matrix of size , is the -mode tensor-matrix product, and are the weights that reflect the importance of different source models and satisfy . When the hidden layer representation is a vector, i.e., is an one-order tensor (), the regularization term becomes,
[TABLE]
The additional parameters would increase the risk of over-fitting. In practice, this is not an issue since can be learned using the large amounts of unlabeled data.
4.2 Theoretical Analysis
For simplicity, we treat the deep learning models as end-to-end learning algorithms. Specifically, we denote the prediction function for the target model as , while the prediction functions for the source models as . To measure the accuracy of a model , we introduce the expected risk . Then we have the following theorem, which shows that the proposed model can benefit from the source models with a theoretical guarantee on the expected risk (The least square loss is simply adopted).
Theorem 1
Assume the linearity of their input-output map, i.e., and , and the feature space is bounded, i.e., , then when adopting the least square loss, we have
[TABLE]
Here, is the expected risk of the target model trained using the labeled samples, and is the risk of the transformed source model w.r.t. the data in the target domain. We leave the proof in the supplementary material due to the limited space.111The supplementary material is available at https://github.com/PKU-IMRE/Retina. Since the source models are well-trained (such as using abundant labeled data), we believe that there exists (or we can learn) some such that is small if the source tasks are related to the target task. When and are determined appropriately, the term can be small. So the expected risk of the target model is guaranteed to be low. We do not have such guarantee when the designed regularization term does not exist, since the expected risk of the target model would be high, in which the model is prone to over-fitting for limited labeled data in the target domain.
4.3 Model Prediction
Based on the proposed methodology, the generated models deployed on front-end visual sensors will be often updated. In this context, the model communication acts as an essential component in the City Brain. In particular, with the philosophy of model reusing, it is painlessly to obtain the models frequently, such that efficient communication of these models is highly desired. Hence, we investigate the economic model communication based on the difference of models (DoM) between the existing model (e.g., existing in both sender and receiver) and the to-be-transmitted model. We compute the DoM between the prediction and the to-be-compressed models by computing the difference for each corresponding weight layer-by-layer. Denote the prediction model as and the to-be compressed model as , then the DoM between and can be computed as follows,
[TABLE]
where and signify the layer index and weight index in each layer, respectively. Subsequently, the weight differences are quantized to according to scalar quantization,
[TABLE]
Here, is the amplification of the weights, and determines the degree of quantization. The higher indicates the less coding bits. denotes the round operation. In the receiver end, the is further de-quantized to recover the weight as follows,
[TABLE]
Then model compensation is performed to recover the model that is desired to be transmitted, i.e.,
[TABLE]
In addition, the recovered source models from model prediction are also allowed to facilitate the model training in the target domain through the multi-model reuse method.
5 Experimental Results
5.1 Experimental setup
We conduct experiments on a typical task of person ReID [12] in smart city applications, which aims to find person images in the database as the query person image. The reason for choosing this task is that the variant capture conditions and data collection scenarios result in severe domain bias in data distribution. Four different person ReID datasets Duke [13], Market1501 [14], MSMT17 [15], CUHK03 [16] are used in experiments, and the details of them are shown in Table 1. In all experiments, the reused models are trained on MSMT, CUHK03 and Market1501 datasets, and the target model is trained and tested on Duke dataset.
Network Architecture. We adopt the ResNet50 network [17] as our base network, and use softmax loss as supervision.
Evaluation Metrics. The person ReID task is regarded as a retrieval task. The mean Average Precision (mAP) and Top1 Accuracy are used for evaluation.
Unlabeled Data Usage. To perform the unlabeled data experiments, we split the above training set into two parts, i.e., the labeled and unlabeled. In the unlabeled part, the label information are not used in the experiments. For example, 30% Duke means only 30% samples are used as labeled samples and the rest serve as unlabeled samples during training.
5.2 Results of Model Reuse
Single Model Reuse. We first demonstrate the performance of single model reuse in Fig. 4(a)(b). The baseline model is trained with the labeled training part of Duke dataset. Three different models trained on MSMT17, CUHK03 and Market1501 are used for model reuse and the target model is trained and tested on the Duke dataset. It can be observed that additional model reuse significantly boosts the performance of ReID model over the baseline. Moreover, with different settings on the percentage of unlabeled data, the model trained by reusing strategy can consistently outperform the baseline.
The hyper-parameter analysis. The balances the empirical loss and regularization term for reuse in the training objective. Properly choosing the value of can improve the performance. To investigate the sensitivity of model w.r.t. , we vary the and the performances are shown in Fig. 4. Our model remains fairly stable across a wide range of from to . Besides, in Table 2, we present the performances of reusing two models on different scales. It can be observed that the overall performances under different scales are close. As such, we select in the following experiments.
Multi-Model Reuse. The results of multi-model reuse are shown in Table 3. Compared to the baseline models, reusing additional models achieves better performance both in mAP and Rank1. Moreover, the incremental performance gains can be consistently achieved by increasing the number of reused models. With three reused models, we can achieve 41.2% mAP, which significantly outperforms the baseline 31.5% mAP. We also compare with the state-of-the-art methods such as PCB [12], which has superior performance over our adopted softmax baseline. With incremental multi-model reuse, our baseline model has been significantly improved. When incorporating three models into the model reusing, we outperform the PCB [12] by about 5% mAP and 4% Rank 1.222The models trained with multi-model reuse strategy are available at https://github.com/PKU-IMRE/Retina.
5.3 Model Prediction
The enhanced model produced by multi-model reuse is deployed to the front-end by the incremental model updating strategy. We present the performance in Table 4 with the decrease of compression bits (). The smaller compression bits means coarser quantization, and the is set to 12 and is set to 0.3. Moreover, the model v1/v2/v3 represent the different versions of the same models with incrementally better performance. Given the same to-be-compressed models, higher compression ratios and better performances can be achieved with the DoM, say compression bits = 5 to 3. For better understanding of Table 4, we plot the compression ratio changes in terms of different quantization with/without DoM in Fig. 5. The DoM strategy significantly outperforms the simple single model compression scheme.
The model prediction is able to consistently improve the performance by using model sharing information. When compression bits = 3, the DoM strategy can well maintain the performance (39.4233.64) while the single model compression strategy even collapses (39.420.16). The reason is that DoM compresses the differences of the models while the single model compression is applied to the whole model. Thus, the DoM is more suitable for delivering incremental information under constrained transmission environment.
6 Conclusion
In this work, we show the potentials of model generation, utilization and communication in constructing the digital retina for artificial intelligence applications in smart cities. To demonstrate the benefits of adopting model reuse and prediction, we use the challenging problem of person ReID to reveal that properly reusing models is effective to deal with the data collected in a wide range of visual sensors. In the future, we will systematically integrate model generation, utilization, communication and standardization for establishing the intelligent, economic and efficient digital retina in smart cities.
Acknowledgement:
This work was supported by the National Natural Science Foundation of China under Grant 61661146005 and Grant U1611461, and in part by the National Basic Research Program of China under Grant 2015CB351806, and in part by Australian Research Council Project DE-1901014738, and in part by Hong Kong RGC Early Career Scheme under Grant 9048122 (CityU 21211018).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Wässle, “Parallel processing in the mammalian retina,” Nature Reviews Neuroscience 5 , 747, 2004.
- 2[2] L. Bao and J. Kai et al , “Artificial shape perception retina network based on tunable memristive neurons,” Scientific reports 8 , 13727, 2018.
- 3[3] W. Gao and Y. Tian, “Digital retina: revolutionizing camera systems for the smart city,” Science China Information Science , vol. 48, no. 8, pp. 1076–1082, 2018.
- 4[4] R. Woodworth and E. Thorndike, “The influence of improvement in one mental function upon the efficiency of other functions,” Psychological review 8, 247 , 1901.
- 5[5] National Research Council, How people learn: Brain, mind, experience, and school: Expanded edition , National Academies Press, 2000.
- 6[6] Peter Sterling and Simon Laughlin, Principles of neural design , MIT Press, 2015.
- 7[7] Y. Yang and D. Zhan et al , “Deep learning for fixed model reuse,” in Proc. AAAI , 2017.
- 8[8] Y. Xiang and Z. De et al , “Modal consistency based pre-trained multi-model reuse,” in Proc. IJCAI , 2017.
