簡易檢索 / 詳目顯示

研究生: 莊皓翔
Hao-Hsiang Chuang
論文名稱: 基於多尺度注意機制之編碼解碼器人群計數網路
An Encoder-decoder Network for Crowd Counting Based on Multi-scale Attention Mechanism
指導教授: 林昌鴻
Chang-Hong Lin
口試委員: 林淵翔
Yuan-Hsiang Lin
吳晋賢
Chin-Hsien Wu
陳維美
Wei-Mei Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 72
中文關鍵詞: 人群計數密度估計注意力機制跳躍連接多尺度注意力
外文關鍵詞: Crowd counting, Density estimation, Attention mechanism, Skip-connection, Multi-scale attention
相關次數: 點閱:222下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人群計數是一項具有挑戰性的計算機視覺任務,它已被廣泛地應用於影像監控和公共安全等應用中。隨著照相機或監視器的解析度提高和人群影像複雜度的提升,如何準確預測人群密度和人群數量已成為重要的議題。近年來,採用基於深度學習神經網路(Convolutional Neural Network,簡稱CNN)密度估計的方法(CNN-based density estimation)來計數人群,其可以有效評估密集場景中的人群數量,並已展現出其優異的準確率。在本論文中,我們提出了一種用於人群計數的多尺度注意力網路(Multi-Scale Attention Network),其採用 U-Net [1]架構作為具有注意力機制的骨幹網路。注意機制(Attention mechanism)和跳躍連接(Skip-connection)可以調整特徵圖的權重,同時能夠保持不同尺度下的特徵。我們使用最近用於人群計數的資料集進行訓練和測試:ShanghaiTech Part_A&B資料集[2]和UCF-QNRF資料集[3]。根據定量結果顯示我們的網路與其他方法相比能夠達到更低的錯誤率(ShanghaiTech Part_A MAE/RMSE:60.0/104.9、Part_B MAE/RMSE:7.8/13.8和UCF-QNRF MAE/RMSE:98.6/179.7)。另外,因為網路中加入了多尺度注意力機制,所以從定性結果中可以觀察出我們網路能夠有效地防止密度圖中出現異常點。


    Crowd counting is a challenging computer vision task, and it has been widely applied in applications, such as video surveillance and public safety. With the resolution of camera and the complexity of crowd image increasing, how to precisely predict the crowd density and the crowd count has become important issues. In recent year, CNN-based density estimation methods can effectively evaluate the number of crowd in dense scenes, and it has demonstrated the excellent performance. In this thesis, we propose an Encoder-Decoder Multi-Scale Attention Network for crowd counting, adopting the U-Net architecture [1] as the backbone network with attention mechanism. The attention mechanism and the skip-connections can adjust the weights of feature maps while maintaining features from different scales. We train and test the proposed network on three recently used datasets for the crowd counting task: ShanghaiTech Part_A&B datasets [2], and UCF-QNRF dataset [3]. The quantitative results demonstrate that our network can achieve better performances in MAE and RMSE evaluations on different datasets (ShanghaiTech Part_A MAE/RMSE: 60.0/104.9, Part_B MAE/RMSE: 7.8/13.8, and UCF-QNRF MAE/RMSE: 98.6/179.7). In addition, the qualitative result shows that the visualized result of our network can effectively prevent the outlier from happening in density map since the multi-scale attention is added into the network.

    摘要 I ABSTRACT II 致謝 III LIST OF CONTENTS IV LIST OF FIGURES VII LIST OF TABLES IX CHAPTER 1 INTRODUCTIONS 1 1.1 Motivation 1 1.2 Contributions 2 1.3 Thesis Organization 3 CHAPTER 2 RELATED WORKS 4 2.1 Detection-based Approaches 4 2.2 Density Estimation-based Approaches 5 2.3 CNN-based Approaches 6 CHAPTER 3 PROPOSED METHODS 9 3.1 Data Augmentation 11 3.1.1 Crop Strategy 13 3.1.2 Brightness and Contrast Adjustments 16 3.1.3 Saturation Adjustment 19 3.2 Network Architecture 21 3.2.1 The Overall Model 21 3.2.2 The Backbone Network 24 3.2.3 Multi-scale Attention Model 27 3.2.4 Density Map Generator 29 3.3 Loss Functions 32 3.3.1 MAE & RMSE Losses 32 3.3.2 Attention Loss 33 3.3.3 SSIM Loss 34 3.4 Training Setting 35 3.4.1 Optimizer 35 3.4.2 Learning Rate Decay 36 CHAPTER 4 EXPERIMENTAL RESULTS 37 4.1 Experimental Environment 37 4.2 Crowd Dataset 38 4.2.1 ShanghaiTech Dataset 38 4.2.2 UCF-QNRF Dataset 40 4.3 Evaluation Metrics 41 4.4 Visualization Results 42 4.4.1 Feature Maps 42 4.4.2 Qualitative Evaluation 44 4.4.3 Quantitative Evaluation 48 4.5 Ablation Study 51 CHAPTER 5 CONCLUSIONS AND FUTURE WORKS 53 5.1 Conclusions 53 5.2 Future Works 54 REFERENCES 55

    [1] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation," in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234-241.
    [2] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, "Single-Image Crowd Counting via Multi-Column Convolutional Neural Network," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589-597.
    [3] H. Idrees M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, M. Shah, "Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds," in European Conference on Computer Vision (ECCV), 2018, pp. 532-546.
    [4] H. Song, X. Liu, X. Zhang, and J. Hu, "Real-Time Monitoring for Crowd Counting Using Video Surveillance and GIS," in International Conference on Remote Sensing, Environment and Transportation Engineering (ICRSETE), 2012, pp. 1-4.
    [5] A. Albiol, and J. Silla, "Statistical Video Analysis for Crowds Counting," in IEEE International Conference on Image Processing (ICIP), 2009, pp. 2569-2572.
    [6] A. Dubey, Akshdeep, and S. Rane, "Implementation of An Intelligent Traffic Control System and Real Time Traffic Statistics Broadcasting," in International Conference of Electronics, Communication and Aerospace Technology (ICECA), 2017, vol. 2, pp. 33-37.
    [7] U. E. Prakash, K. T. Vishnupriya, A. Thankappan, and A. A. Balakrishnan, "Density Based Traffic Control System Using Image Processing," in International Conference on Emerging Trends and Innovations In Engineering And Technological Research (ICETIETR), 2018, pp. 1-4.
    [8] W. Xiu, B. Zhang, Z. Gao, W. Liang, W. Qi, and X. Peng, "The Research and Realization of Public Safety Orientated Panoramic Video Hotspot Interaction Technique," in International Conference on Electronics Information and Emergency Communication (ICEIEC), 2019, pp. 479-482.
    [9] L. Qian, Y. Fu, and T. Liu, "An Efficient Model Compression Method for CNN Based Object Detection," in International Conference on Software Engineering and Service Science (ICSESS), 2018, pp. 766-769.
    [10] P. Fang and Y. Shi, "Small Object Detection Using Context Information Fusion in Faster R-CNN," in International Conference on Computer and Communications (ICCC), 2018, pp. 1537-1540.
    [11] W. Zhang, S. Wang, S. Thachan, J. Chen, and Y. Qian, "Deconv R-CNN for Small Object Detection on Remote Sensing Images," in International Geoscience and Remote Sensing Symposium (IGARSS), 2018, pp. 2483-2486.
    [12] J. Tang and G. Wen, "Object Recognition via Classifier Interaction with Multiple Features," in International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016, vol. 02, pp. 337-340.
    [13] X. Yang, K. Fu, H. Sun, X. Sun, M. Yan, W. Diao, Z. Guo, "Object Detection with Head Direction in Remote Sensing Images Based on Rotational Region CNN," in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2018, pp. 2507-2510.
    [14] F. J. Shen, J. H. Chen, W. Y. Wang, D. L. Tsai, L. C. Shen, and C. T. Tseng, "A CNN-Based Human Head Detection Algorithm Implemented on Edge AI Chip," in International Conference on System Science and Engineering (ICSSE), 2020, pp. 1-5.
    [15] C. Lin, X. Gu, K. Ping, F. Li, and M. He, "A Human Head Detection Method Based on Center Point Estimation for Crowded Scene," in International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2020, pp. 105-108.
    [16] A. C. Tsai, Y. Y. Ou, W. C. Wu, and J. F. Wang, "Occlusion Resistant Face Detection and Recognition System," in International Conference on Orange Technology (ICOT), 2020, pp. 1-4.
    [17] Y. Zhou, H. Ni, F. Ren, and X. Kang, "Face and Gender Recognition System Based on Convolutional Neural Networks," in IEEE International Conference on Mechatronics and Automation (ICMA), 2019, pp. 1091-1095.
    [18] Y. Li, X. Zhang, and D. Chen, "CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1091-1100.
    [19] V. A. Sindagi and V. M. Patel, "Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs," in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1879-1888.
    [20] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, G. Zheng, "Crowd Counting with Deep Negative Correlation Learning," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5382-5390.
    [21] X. Liu, J. Weijer, and A. Bagdanov, "Leveraging Unlabeled Data for Crowd Counting by Learning to Rank," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7661-7669.
    [22] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in preprint arXiv:1409.1556, 2015.
    [23] X. Zheng and T. Chen, "Segmentation of High Spatial Resolution Remote Sensing Image based On U-Net Convolutional Networks," in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2020, pp. 2571-2574.
    [24] B. Zhao, J. Soraghan, G. D. Caterina, and D. Grose, "Segmentation of Head and Neck Tumours Using Modified U-net," in European Signal Processing Conference (EUSIPCO), 2019, pp. 1-4.
    [25] R. H. M. Rafi, B. Tang, Q. Du, and N. H. Younan, "Attention-based Domain Adaptation for Hyperspectral Image Classification," in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019, pp. 67-70.
    [26] J. Babaud, A. P. Witkin, M. Baudin, and R. O. Duda, "Uniqueness of the Gaussian Kernel for Scale-Space Filtering," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 1, 1986, pp. 26-33.
    [27] X. Wu, G. Liang, K. K. Lee, and Y. Xu, "Crowd Density Estimation Using Texture Analysis and Learning," in IEEE International Conference on Robotics and Biomimetics, 2006, pp. 214-219.
    [28] S. An, W. Liu, and S. Venkatesh, "Face Recognition Using Kernel Ridge Regression," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1-7.
    [29] X. Zhang and L. Zhang, "Real Time Crowd Counting with Human Detection and Human Tracking," in International Conference on Neural Information Processing (ICONIP), 2014, pp. 1-8.
    [30] V. B. Subburaman, A. Descamps, and C. Carincotte, "Counting People in The Crowd Using a Generic Head Detector," in IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, 2012, pp. 470-475.
    [31] V. Lempitsky and A. Zisserman, "Learning To Count Objects in Images," in Neural Information Processing Systems (NIPS), vol. 1, 2010, pp.1324-1332.
    [32] V. Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, "COUNT Forest: CO-Voting Uncertain Number of Targets Using Random Forest for Crowd Density Estimation," in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3253-3261.
    [33] K. Chen, C. C. Loy, S. Gong, and T. Xiang, "Feature Mining for Localised Crowd Counting," in British Machine Vision Conference (BMVC), 2012, pp. 21.1-21.11.
    [34] G. Terejanu, P. Singla, T. Singh, and P. D. Scott, "Adaptive Gaussian Sum Filter for Nonlinear Bayesian Estimation," IEEE Transactions on Automatic Control, vol. 56, no. 9, 2011, pp. 2151-2156.
    [35] C. Wang, H. Zhang, L. Yang, S. Liu, and X. Cao, "Deep People Counting in Extremely Dense Crowds," in ACM international conference on Multimedia, 2015, pp. 1299-1302.
    [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks,"in Neural Information Processing Systems (NIPS), vol. 60, no. 6, 2017, pp. 84–90.
    [37] M. Fu, P. Xu, X. Li, Q. Liu, M. Ye, and C. Zhu, "Fast Crowd Density Estimation with Convolutional Neural Networks," in Engineering Applications of Artificial Intelligence, vol. 43, 2015, pp. 81-88.
    [38] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features," in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp.6022-6031.
    [39] E. Walach and L. Wolf, "Learning to Count with CNN Boosting,"in European Conference on Computer Vision (ECCV), vol. 9906, 2016, pp. 660-676.
    [40] X. Liu, J. van de Weijer, and A. D. Bagdanov, "Exploiting Unlabeled Data in CNNs by Self-supervised Learning to Rank," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, 2019, pp. 1862-1878.
    [41] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao, "Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6126-6135.
    [42] J. Gao, Q. Wang, and X. Li, "PCC Net: Perspective Crowd Counting via Spatial Convolutional Network," IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, 2020, pp. 3486-3498.
    [43] M. Oh, P. Olsen, and K. N. Ramamurthy, "Crowd Counting with Decomposed Uncertainty," in AAAI Conference on Artificial Intelligence, vol. 34, 2019, pp.11799-11806.
    [44] H. T. Søgaard and H. J. Olsen, "Determination of Crop Rows by Image Analysis without Segmentation," in Computers and Electronics in Agriculture, vol. 38, no. 2, 2003, pp. 141-158.
    [45] G. Krawczyk, R. Mantiuk, D. Zdrojewska, and H. P. Seidel, "Brightness Adjustment for HDR and Tone Mapped Images," in Pacific Conference on Computer Graphics and Applications, 2007, pp. 373-381.
    [46] M. Zhou, K. Jin, S. Wang, J. Ye, and D. Qian, "Color Retinal Image Enhancement Based on Luminosity and Contrast Adjustment," in IEEE Transactions on Biomedical Engineering, vol. 65, no. 3, 2018, pp. 521-527.
    [47] S. Wang, W. Cho, J. Jang, M. A. Abidi, and J. Paik, "Contrast-dependent Saturation Adjustment for Outdoor Image Enhancement," in Journal of the Optical Society of America A, vol. 34, no. 1, 2017, pp. 7-17.
    [48] A. Lotfi and A. Benyettou, "Over-fitting Avoidance in Probabilistic Neural Networks," in World Congress on Information Technology and Computer Applications (WCITCA), 2015, pp. 1-6.
    [49] G. Cao, L. Huang, H. Tian, H. Xianglin, Y. Wang, and R. Zhi, "Contrast Enhancement of Brightness-distorted Images by Improved Adaptive Gamma Correction," in Computers & Electrical Engineering, vol. 66, 2018, pp. 569-582.
    [50] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu, "ADCrowdNet: An Attention-injective Deformable Convolutional Network for Crowd Understanding," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp.3220-3229.
    [51] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 770-778.
    [52] X. Pan, H. Mo, Z. Zhou, and W. Wu, "Attention Guided Region Division for Crowd Counting," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2568-2572.
    [53] S. Zhang, H. Fu, Y. Yan, Y. Zhang, Q. Wu, M. Yang, M. Tan, and Y. Xu "Attention Guided Network for Retinal Image Segmentation," in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019, pp. 797-805.
    [54] J. Gao, Q. Wang, and Y. Yuan, "SCAR: Spatial-/Channel-wise Attention Regression Networks for Crowd Counting,"in Neurocomputing, vol. 363, 2019, pp. 1-8.
    [55] A. Sagar, "DMSANet: Dual Multi Scale Attention Network," preprint. arXiv:2106.08382, 2021.
    [56] V. A. Sindagi and V. M. Patel, "Inverse Attention Guided Deep Crowd Counting Network," in IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1-8.
    [57] V. A. Sindagi and V. M. Patel, "HA-CCN: Hierarchical Attention-Based Crowd Counting Network," in IEEE Transactions on Image Processing, vol. 29, 2020, p. 323-335.
    [58] Z. Duan, Y. Xie, and J. Deng, "HAGN: Hierarchical Attention Guided Network for Crowd Counting," IEEE Access, vol. 8, 2020, pp. 36376-36385.
    [59] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, "ImageNet: A Large-scale Hierarchical Image Database," in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255.
    [60] P. Hurtik and N. Madrid, "Bilinear Interpolation over Fuzzified Images: Enlargement," in IEEE International Conference on Fuzzy Systems, 2015, pp. 1-8.
    [61] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, "Residual Attention Network for Image Classification," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3156-3164.
    [62] W. Zhou, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image Quality Assessment: from Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, 2004, pp. 600-612.
    [63] Z. Zhang and M. R. Sabuncu, "Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels," in Neural Information Processing Systems (NIPS), 2018, pp. 8792–8802.
    [64] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization,"in International Conference for Learning Representations (ICLR), vol. 3, 2015, no.11, pp. 1-15.
    [65] S. Ruder, "An Overview of Gradient Descent Optimization Algorithms," preprint. arXiv:1609.04747, 2016.
    [66] J. Duchi, E. Hazan, and Y. Singer, "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," in Journal of Machine Learning Research, vol. 12, 2011, pp. 2121-2159.
    [67] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. Devito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, "PyTorch: An Imperative Style, High-Performance Deep Learning Library," in Neural Information Processing Systems (NIPS), 2019.
    [68] C. Liu, X. Weng, and Y. Mu, "Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1217-1226.
    [69] J. Wan, Q. Wang, and A. B. Chan, "Kernel-based Density Map Generation for Dense Object Counting," in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020 (Early Access), pp. 1-14.
    [70] D. Kang, Z. Ma, and A. B. Chan, "Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks—Counting, Detection, and Tracking," IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 5, 2019, pp. 1408-1422.
    [71] J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, "DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5197-5206.
    [72] H. Y. Yao, W. G. Wan, and X. Li, "Mask Guided GAN for Density Estimation and Crowd Counting," IEEE Access, vol. 8, 2020, pp. 31432-31443.

    無法下載圖示 全文公開日期 2027/01/21 (校內網路)
    全文公開日期 2027/01/21 (校外網路)
    全文公開日期 2027/01/21 (國家圖書館:臺灣博碩士論文系統)
    QR CODE