應用深度估計與語義分割進行行人偵測｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	Albert Christianto Albert Christianto
論文名稱：	應用深度估計與語義分割進行行人偵測 Pedestrian Detection Using Depth Estimation Maps and Semantic Segmentation
指導教授：	方文賢 Wen-Hsien Fang 陳郁堂 Yie-Tarng Chen
口試委員:	徐繼聖 Gee-Sern Hsu 賴坤財 Kuen-Tsair Lay 丘建青 Chien-Ching Chiu
學位類別：	博士 Doctor
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	58
中文關鍵詞：	Depth estimation maps 、fusion network 、multi-scale 、pedestrian detection 、semantic segmentation maps
外文關鍵詞：	Depth estimation maps, fusion network, multi-scale, pedestrian detection, semantic segmentation maps
相關次數：	點閱：245 下載：17
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

This thesis presents a pedestrian detection framework using the combination
of depth estimation maps and semantic segmentation maps. It consists of two main
components, which are a depth segmentation Region Proposal Network (ds-RPN)
and a depth segmentation Region-based Convolutional Neural Network (ds-RCNN).
We employ a Depth Input Network (DIN) as the input to the depth maps and rene
inaccurate depth estimation maps. Thereafter, a segmentation infusion network is
invoked to infuse semantic features into the shared feature maps. Afterward, a fusion
strategy is employed to eectively combine the shared feature maps, the semantic
feature maps, and the depth maps. Finally, the combined feature maps are passed
on to the ds-RPN and the ds-RCNN to perform pedestrian detection. Experiment
results show that the proposed method achieves a competitive result in term of Miss
Rate (MR) on the widespread Caltech dataset.

Table of contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Face Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Overall Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Depth Segmentation Region Proposal Network . . . . . . . . . . . . . 9
3.3 Depth Segmentation RCNN . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Semantic Segmentation Infusion Layer and Semantic Self-Attention
Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Depth Estimation Maps and Depth Input Network . . . . . . . . . . 13
3.6 Multi-Scale Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Fusion Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Experimental Setup and Evaluation Protocol . . . . . . . . . . . . . . 16
4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.1 The Impact of Semantic Segmentation Infusion Layer . . . . . 18
4.3.2 The Impact of Depth Estimation Maps . . . . . . . . . . . . . 19
4.3.3 The Impact of Semantic Self-Attention . . . . . . . . . . . . . 23
4.3.4 The Impact of Multi-Scale Features . . . . . . . . . . . . . . . 24
4.4 Comparisons with State-of-the-Art Works . . . . . . . . . . . . . . . 28
5 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Appendix A: Example images and depth images from the datasets . . . . . . . 40
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

List of Figures
3.1 Overall pipeline of the proposed method. . . . . . . . . . . . . . . . . 7
3.2 The detail of ds-RPN architecture. . . . . . . . . . . . . . . . . . . . 9
3.3 The detail of ds-RCNN architecture. . . . . . . . . . . . . . . . . . . 11
3.4 The details of feature fusion network. . . . . . . . . . . . . . . . . . . 14
4.1 The visualization of the rst 16 feature maps of the proposal layers. . 19
4.2 Image examples of Caltech and its corresponding depth maps . . . . . 20
4.3 Visualization of the detection suppression caused by the inaccurate
depth maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Visualization of some cases where the depth maps increase false alarm
rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Depth maps help to detect undetected pedestrians. . . . . . . . . . . 22
4.6 Visualization of some cases where the depth maps can help suppress
false positives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 The comparison of feature map visualizations from ds-RPN about
using and not using semantic self-attention mechanism. . . . . . . . . 24
4.8 The comparison of feature map visualizations from lower layers about
using and not using multi-scale feature maps. . . . . . . . . . . . . . 26
4.9 The comparison of feature map visualizations from ds-RPN about
using and not using multi-scale feature maps. . . . . . . . . . . . . . 27
4.10 The comparison of detection results about using and not using multiscale
feature maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.11 Comparison of MR vs FPPI curve between our proposed method and
the state-of-the-art methods using the Reasonable setting. . . . . . . 30
4.12 Comparison of MR vs FPPI curve between our proposed method and
the state-of-the-art methods using the Occlusion=NONE setting. . . 31
4.13 Comparison of MR vs FPPI curve between our proposed method and
the state-of-the-art methods using the Occlusion=Partial setting. . . 31
4.14 Comparison of MR vs FPPI curve between our proposed method and
the state-of-the-art methods using the Occlusion=Heavy setting. . . . 32
4.15 The snapshots of successful detection by our proposed method. . . . . 34
4.16 The snapshots of failed detection by our proposed method. . . . . . . 35
4.17 Comparison of detection results between our proposed method and [1]. 36
4.18 Comparison of detection results between our proposed method and [2]. 37
5.1 Snapshots of Caltech Pedestrian dataset. . . . . . . . . . . . . . . . . 40
5.2 The visualization of the depth maps generated by monodepth [3] on
the Caltech dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

List of Tables
4.1 Pedestrian detection results with various combinations of strategies
on the Caltech dataset. The best results are marked in bold. . . . . . 18
4.2 Performance comparison of our proposed method with the state-inthe-
art methods. The best results are marked in bold. . . . . . . . . . 28
                                

References
[1] X. Zhang, L. Cheng, B. Li, and H. Hu, \Too Far to See? Not Really!
Pedestrian Detection With Scale-Aware Localization Policy," IEEE
Transactions on Image Processing, vol. 27, pp. 3703{3715, Aug 2018.
[2] G. Brazil, X. Yin, and X. Liu, \Illuminating Pedestrians via Simultaneous
Detection and Segmentation," in Proceedings of the IEEE International Con-
ference on Computer Vision, pp. 4960{4969, 2017.
[3] C. Godard, O. Mac Aodha, and G. J. Brostow, \Unsupervised Monocular Depth
Estimation with Left-Right Consistency," in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 270{279, 2017.
[4] P. Dollar, C. Wojek, B. Schiele, and P. Perona, \Pedestrian Detection: An
Evaluation of the State of the Art," IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 34, pp. 743{761, April 2012.
[5] L. Zhang, L. Lin, X. Liang, and K. He, \Is Faster R-CNN Doing Well for
Pedestrian Detection?," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 443{457, 2016.
[6] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, \Scale-Aware Fast RCNN
for Pedestrian Detection," IEEE Transactions on Multimedia, vol. 20,
no. 4, pp. 985{996, 2017.
[7] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, \A Unied Multi-scale Deep
Convolutional Neural Network for Fast Object Detection," in Proceedings of
the European Conference on Computer Vision, pp. 354{370, 2016.
[8] X. Du, M. El-Khamy, J. Lee, and L. Davis, \Fused DNN: A Deep Neural
Network Fusion Approach to Fast and Robust Pedestrian Detection," in Pro-
ceedings of the IEEE Winter Conference on Applications of Computer Vision,
pp. 953{961, IEEE, 2017.
[9] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu, \Small-Scale Pedestrian Detection
Based on Topological Line Localization and Temporal Feature Aggregation,"
in Proceedings of the European Conference on Computer Vision, pp. 554{569,
Springer International Publishing, 2018.
[10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, \Realtime Multi-Person 2D Pose
Estimation Using Part Anity Fields," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 7291{7299, 2017.
[11] C. Zhou, M. Wu, and S.-K. Lam, \SSA-CNN: Semantic Self-Attention CNN
for Pedestrian Detection," arXiv preprint arXiv:1902.09080, 2019.
[12] S. Ren, K. He, R. Girshick, and J. Sun, \Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks," IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137{1149, 2016.
[13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C.
Berg, \SSD: Single Shot MultiBox Detector," in Proceedings of the European
Conference on Computer Vision, 2016.
[14] J. Redmon and A. Farhadi, \YOLO9000: Better, Faster, Stronger," in Pro-
ceedings of the IEEE conference on Computer Vision and Pattern Recognition,
pp. 7263{7271, 2017.
[15] J. Redmon and A. Farhadi, \Yolov3: An Incremental Improvement," arXiv
preprint arXiv:1804.02767, 2018.
[16] X. Zhou, D. Wang, and P. Krahenbuhl, \Objects as Points," in arXiv preprint
arXiv:1904.07850, 2019.
[17] X. Zhou, J. Zhuo, and P. Krahenbuhl, \Bottom-Up Object Detection by Grouping
Extreme and Center Points," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 850{859, 2019.
[18] H. Law and J. Deng, \CornerNet: Detecting Objects as Paired Keypoints,"
in Proceedings of the European Conference on Computer Vision, pp. 734{750,
2018.
[19] A. Newell, K. Yang, and J. Deng, \Stacked Hourglass Networks for Human Pose
Estimation," in Proceedings of the European Conference on Computer Vision,
pp. 483{499, 2016.
[20] H. Fang, S. Xie, Y. Tai, and C. Lu, \RMPE: Regional Multi-person Pose Estimation,"
in Proceedings of the International Conference on Computer Vision,
pp. 2353{2362, Oct 2017.
[21] P. Hu and D. Ramanan, \Finding Tiny Faces," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 951{959, 2017.
[22] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, \SSH: Single Stage
Headless Face Detector," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 4875{4884, 2017.
[23] Y. Bai and B. Ghanem, \Multi-Branch Fully Convolutional Network for Face
Detection," arXiv preprint arXiv:1707.06330, 2017.
[24] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, \Finding Tiny Faces in the Wild
with Generative Adversarial Network," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 21{30, 2018.
[25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, \Generative Adversarial Nets," in Proceedings of
the Neural Information Processing Systems, pp. 2672{2680, 2014.
[26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, \Pyramid Scene Parsing Network,"
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pp. 6230{6239, 2017.
[27] P. Bilinski and V. Prisacariu, \Dense Decoder Shortcut Connections for Single-
Pass Semantic Segmentation," in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, June 2018.
[28] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, \Aggregated Residual Transformations
for Deep Neural Networks," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1492{1500, 2017.
[29] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, \BiSeNet: Bilateral
Segmentation Network for Real-Time Semantic Segmentation," in Proceedings
of the European Conference on Computer Vision (ECCV), pp. 325{341, 2018.
[30] D. Xu, W.Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, \Structured Attention
Guided Convolutional Neural Fields for Monocular Depth Estimation," in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3917{3925, 2018.
[31] C. Godard, O. Mac Aodha, M. Firman, and G. Brostow, \Digging into Self-
Supervised Monocular Depth Estimation," arXiv preprint arXiv:1806.01260,
2018.
[32] K. Simonyan and A. Zisserman, \Very Deep Convolutional Networks for Largescale
Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
[33] C. Lin, J. Lu, G. Wang, and J. Zhou, \Graininess-Aware Deep Feature Learning
for Pedestrian Detection," in Proceedings of the European Conference on
Computer Vision, September 2018.
[34] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele, \The Cityscapes Dataset for Semantic Urban
Scene Understanding," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3213{3223, 2016.
[35] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, \Sparsity
Invariant CNNs," in Proceedings of the International Conference on 3D
Vision, 2017.
[36] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, \Cae: Convolutional Architecture for Fast Feature Embedding,"
arXiv preprint arXiv:1408.5093, 2014.
[37] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, \ImageNet: A Large-
Scale Hierarchical Image Database," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 248{255, 2009.
[38] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, and X. Wang, \Jointly Learning
Deep Features, Deformable Parts, Occlusion and Classication for Pedestrian
Detection," IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 40, pp. 1874{1887, Aug 2018.

簡易檢索 / 詳目顯示

相關論文