研究生: 莊皓翔
Hao-Hsiang Chuang
論文名稱: 基於多尺度注意機制之編碼解碼器人群計數網路
An Encoder-decoder Network for Crowd Counting Based on Multi-scale Attention Mechanism
指導教授: 林昌鴻
Chang-Hong Lin
口試委員: 林淵翔
Yuan-Hsiang Lin
Chin-Hsien Wu
Wei-Mei Chen
學位類別: 碩士
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 72
中文關鍵詞: 人群計數密度估計注意力機制跳躍連接多尺度注意力
外文關鍵詞: Crowd counting, Density estimation, Attention mechanism, Skip-connection, Multi-scale attention
  • 人群計數是一項具有挑戰性的計算機視覺任務,它已被廣泛地應用於影像監控和公共安全等應用中。隨著照相機或監視器的解析度提高和人群影像複雜度的提升,如何準確預測人群密度和人群數量已成為重要的議題。近年來,採用基於深度學習神經網路(Convolutional Neural Network,簡稱CNN)密度估計的方法(CNN-based density estimation)來計數人群,其可以有效評估密集場景中的人群數量,並已展現出其優異的準確率。在本論文中,我們提出了一種用於人群計數的多尺度注意力網路(Multi-Scale Attention Network),其採用 U-Net [1]架構作為具有注意力機制的骨幹網路。注意機制(Attention mechanism)和跳躍連接(Skip-connection)可以調整特徵圖的權重,同時能夠保持不同尺度下的特徵。我們使用最近用於人群計數的資料集進行訓練和測試:ShanghaiTech Part_A&B資料集[2]和UCF-QNRF資料集[3]。根據定量結果顯示我們的網路與其他方法相比能夠達到更低的錯誤率(ShanghaiTech Part_A MAE/RMSE:60.0/104.9、Part_B MAE/RMSE:7.8/13.8和UCF-QNRF MAE/RMSE:98.6/179.7)。另外,因為網路中加入了多尺度注意力機制,所以從定性結果中可以觀察出我們網路能夠有效地防止密度圖中出現異常點。

    Crowd counting is a challenging computer vision task, and it has been widely applied in applications, such as video surveillance and public safety. With the resolution of camera and the complexity of crowd image increasing, how to precisely predict the crowd density and the crowd count has become important issues. In recent year, CNN-based density estimation methods can effectively evaluate the number of crowd in dense scenes, and it has demonstrated the excellent performance. In this thesis, we propose an Encoder-Decoder Multi-Scale Attention Network for crowd counting, adopting the U-Net architecture [1] as the backbone network with attention mechanism. The attention mechanism and the skip-connections can adjust the weights of feature maps while maintaining features from different scales. We train and test the proposed network on three recently used datasets for the crowd counting task: ShanghaiTech Part_A&B datasets [2], and UCF-QNRF dataset [3]. The quantitative results demonstrate that our network can achieve better performances in MAE and RMSE evaluations on different datasets (ShanghaiTech Part_A MAE/RMSE: 60.0/104.9, Part_B MAE/RMSE: 7.8/13.8, and UCF-QNRF MAE/RMSE: 98.6/179.7). In addition, the qualitative result shows that the visualized result of our network can effectively prevent the outlier from happening in density map since the multi-scale attention is added into the network.

    摘要 I ABSTRACT II 致謝 III LIST OF CONTENTS IV LIST OF FIGURES VII LIST OF TABLES IX CHAPTER 1 INTRODUCTIONS 1 1.1 Motivation 1 1.2 Contributions 2 1.3 Thesis Organization 3 CHAPTER 2 RELATED WORKS 4 2.1 Detection-based Approaches 4 2.2 Density Estimation-based Approaches 5 2.3 CNN-based Approaches 6 CHAPTER 3 PROPOSED METHODS 9 3.1 Data Augmentation 11 3.1.1 Crop Strategy 13 3.1.2 Brightness and Contrast Adjustments 16 3.1.3 Saturation Adjustment 19 3.2 Network Architecture 21 3.2.1 The Overall Model 21 3.2.2 The Backbone Network 24 3.2.3 Multi-scale Attention Model 27 3.2.4 Density Map Generator 29 3.3 Loss Functions 32 3.3.1 MAE & RMSE Losses 32 3.3.2 Attention Loss 33 3.3.3 SSIM Loss 34 3.4 Training Setting 35 3.4.1 Optimizer 35 3.4.2 Learning Rate Decay 36 CHAPTER 4 EXPERIMENTAL RESULTS 37 4.1 Experimental Environment 37 4.2 Crowd Dataset 38 4.2.1 ShanghaiTech Dataset 38 4.2.2 UCF-QNRF Dataset 40 4.3 Evaluation Metrics 41 4.4 Visualization Results 42 4.4.1 Feature Maps 42 4.4.2 Qualitative Evaluation 44 4.4.3 Quantitative Evaluation 48 4.5 Ablation Study 51 CHAPTER 5 CONCLUSIONS AND FUTURE WORKS 53 5.1 Conclusions 53 5.2 Future Works 54 REFERENCES 55

