簡易檢索 / 詳目顯示

研究生: Albert Christianto
Albert Christianto
論文名稱: 應用深度估計與語義分割進行行人偵測
Pedestrian Detection Using Depth Estimation Maps and Semantic Segmentation
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 徐繼聖
Gee-Sern Hsu
賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 58
中文關鍵詞: Depth estimation mapsfusion networkmulti-scalepedestrian detectionsemantic segmentation maps
外文關鍵詞: Depth estimation maps, fusion network, multi-scale, pedestrian detection, semantic segmentation maps
相關次數: 點閱:245下載:17
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • This thesis presents a pedestrian detection framework using the combination
    of depth estimation maps and semantic segmentation maps. It consists of two main
    components, which are a depth segmentation Region Proposal Network (ds-RPN)
    and a depth segmentation Region-based Convolutional Neural Network (ds-RCNN).
    We employ a Depth Input Network (DIN) as the input to the depth maps and re ne
    inaccurate depth estimation maps. Thereafter, a segmentation infusion network is
    invoked to infuse semantic features into the shared feature maps. Afterward, a fusion
    strategy is employed to e ectively combine the shared feature maps, the semantic
    feature maps, and the depth maps. Finally, the combined feature maps are passed
    on to the ds-RPN and the ds-RCNN to perform pedestrian detection. Experiment
    results show that the proposed method achieves a competitive result in term of Miss
    Rate (MR) on the widespread Caltech dataset.


    This thesis presents a pedestrian detection framework using the combination
    of depth estimation maps and semantic segmentation maps. It consists of two main
    components, which are a depth segmentation Region Proposal Network (ds-RPN)
    and a depth segmentation Region-based Convolutional Neural Network (ds-RCNN).
    We employ a Depth Input Network (DIN) as the input to the depth maps and re ne
    inaccurate depth estimation maps. Thereafter, a segmentation infusion network is
    invoked to infuse semantic features into the shared feature maps. Afterward, a fusion
    strategy is employed to e ectively combine the shared feature maps, the semantic
    feature maps, and the depth maps. Finally, the combined feature maps are passed
    on to the ds-RPN and the ds-RCNN to perform pedestrian detection. Experiment
    results show that the proposed method achieves a competitive result in term of Miss
    Rate (MR) on the widespread Caltech dataset.

    Table of contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Face Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.5 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overall Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Depth Segmentation Region Proposal Network . . . . . . . . . . . . . 9 3.3 Depth Segmentation RCNN . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Semantic Segmentation Infusion Layer and Semantic Self-Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.5 Depth Estimation Maps and Depth Input Network . . . . . . . . . . 13 3.6 Multi-Scale Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.7 Fusion Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Experimental Setup and Evaluation Protocol . . . . . . . . . . . . . . 16 4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.1 The Impact of Semantic Segmentation Infusion Layer . . . . . 18 4.3.2 The Impact of Depth Estimation Maps . . . . . . . . . . . . . 19 4.3.3 The Impact of Semantic Self-Attention . . . . . . . . . . . . . 23 4.3.4 The Impact of Multi-Scale Features . . . . . . . . . . . . . . . 24 4.4 Comparisons with State-of-the-Art Works . . . . . . . . . . . . . . . 28 5 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Appendix A: Example images and depth images from the datasets . . . . . . . 40 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 List of Figures 3.1 Overall pipeline of the proposed method. . . . . . . . . . . . . . . . . 7 3.2 The detail of ds-RPN architecture. . . . . . . . . . . . . . . . . . . . 9 3.3 The detail of ds-RCNN architecture. . . . . . . . . . . . . . . . . . . 11 3.4 The details of feature fusion network. . . . . . . . . . . . . . . . . . . 14 4.1 The visualization of the rst 16 feature maps of the proposal layers. . 19 4.2 Image examples of Caltech and its corresponding depth maps . . . . . 20 4.3 Visualization of the detection suppression caused by the inaccurate depth maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Visualization of some cases where the depth maps increase false alarm rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.5 Depth maps help to detect undetected pedestrians. . . . . . . . . . . 22 4.6 Visualization of some cases where the depth maps can help suppress false positives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.7 The comparison of feature map visualizations from ds-RPN about using and not using semantic self-attention mechanism. . . . . . . . . 24 4.8 The comparison of feature map visualizations from lower layers about using and not using multi-scale feature maps. . . . . . . . . . . . . . 26 4.9 The comparison of feature map visualizations from ds-RPN about using and not using multi-scale feature maps. . . . . . . . . . . . . . 27 4.10 The comparison of detection results about using and not using multiscale feature maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.11 Comparison of MR vs FPPI curve between our proposed method and the state-of-the-art methods using the Reasonable setting. . . . . . . 30 4.12 Comparison of MR vs FPPI curve between our proposed method and the state-of-the-art methods using the Occlusion=NONE setting. . . 31 4.13 Comparison of MR vs FPPI curve between our proposed method and the state-of-the-art methods using the Occlusion=Partial setting. . . 31 4.14 Comparison of MR vs FPPI curve between our proposed method and the state-of-the-art methods using the Occlusion=Heavy setting. . . . 32 4.15 The snapshots of successful detection by our proposed method. . . . . 34 4.16 The snapshots of failed detection by our proposed method. . . . . . . 35 4.17 Comparison of detection results between our proposed method and [1]. 36 4.18 Comparison of detection results between our proposed method and [2]. 37 5.1 Snapshots of Caltech Pedestrian dataset. . . . . . . . . . . . . . . . . 40 5.2 The visualization of the depth maps generated by monodepth [3] on the Caltech dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 List of Tables 4.1 Pedestrian detection results with various combinations of strategies on the Caltech dataset. The best results are marked in bold. . . . . . 18 4.2 Performance comparison of our proposed method with the state-inthe- art methods. The best results are marked in bold. . . . . . . . . . 28

    References
    [1] X. Zhang, L. Cheng, B. Li, and H. Hu, \Too Far to See? Not Really!
    Pedestrian Detection With Scale-Aware Localization Policy," IEEE
    Transactions on Image Processing, vol. 27, pp. 3703{3715, Aug 2018.
    [2] G. Brazil, X. Yin, and X. Liu, \Illuminating Pedestrians via Simultaneous
    Detection and Segmentation," in Proceedings of the IEEE International Con-
    ference on Computer Vision, pp. 4960{4969, 2017.
    [3] C. Godard, O. Mac Aodha, and G. J. Brostow, \Unsupervised Monocular Depth
    Estimation with Left-Right Consistency," in Proceedings of the IEEE Confer-
    ence on Computer Vision and Pattern Recognition, pp. 270{279, 2017.
    [4] P. Dollar, C. Wojek, B. Schiele, and P. Perona, \Pedestrian Detection: An
    Evaluation of the State of the Art," IEEE Transactions on Pattern Analysis
    and Machine Intelligence, vol. 34, pp. 743{761, April 2012.
    [5] L. Zhang, L. Lin, X. Liang, and K. He, \Is Faster R-CNN Doing Well for
    Pedestrian Detection?," in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 443{457, 2016.
    [6] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, \Scale-Aware Fast RCNN
    for Pedestrian Detection," IEEE Transactions on Multimedia, vol. 20,
    no. 4, pp. 985{996, 2017.
    [7] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, \A Uni ed Multi-scale Deep
    Convolutional Neural Network for Fast Object Detection," in Proceedings of
    the European Conference on Computer Vision, pp. 354{370, 2016.
    [8] X. Du, M. El-Khamy, J. Lee, and L. Davis, \Fused DNN: A Deep Neural
    Network Fusion Approach to Fast and Robust Pedestrian Detection," in Pro-
    ceedings of the IEEE Winter Conference on Applications of Computer Vision,
    pp. 953{961, IEEE, 2017.
    [9] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu, \Small-Scale Pedestrian Detection
    Based on Topological Line Localization and Temporal Feature Aggregation,"
    in Proceedings of the European Conference on Computer Vision, pp. 554{569,
    Springer International Publishing, 2018.
    [10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, \Realtime Multi-Person 2D Pose
    Estimation Using Part Anity Fields," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 7291{7299, 2017.
    [11] C. Zhou, M. Wu, and S.-K. Lam, \SSA-CNN: Semantic Self-Attention CNN
    for Pedestrian Detection," arXiv preprint arXiv:1902.09080, 2019.
    [12] S. Ren, K. He, R. Girshick, and J. Sun, \Faster R-CNN: Towards Real-Time
    Object Detection with Region Proposal Networks," IEEE Transactions on Pat-
    tern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137{1149, 2016.
    [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C.
    Berg, \SSD: Single Shot MultiBox Detector," in Proceedings of the European
    Conference on Computer Vision, 2016.
    [14] J. Redmon and A. Farhadi, \YOLO9000: Better, Faster, Stronger," in Pro-
    ceedings of the IEEE conference on Computer Vision and Pattern Recognition,
    pp. 7263{7271, 2017.
    [15] J. Redmon and A. Farhadi, \Yolov3: An Incremental Improvement," arXiv
    preprint arXiv:1804.02767, 2018.
    [16] X. Zhou, D. Wang, and P. Krahenbuhl, \Objects as Points," in arXiv preprint
    arXiv:1904.07850, 2019.
    [17] X. Zhou, J. Zhuo, and P. Krahenbuhl, \Bottom-Up Object Detection by Grouping
    Extreme and Center Points," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 850{859, 2019.
    [18] H. Law and J. Deng, \CornerNet: Detecting Objects as Paired Keypoints,"
    in Proceedings of the European Conference on Computer Vision, pp. 734{750,
    2018.
    [19] A. Newell, K. Yang, and J. Deng, \Stacked Hourglass Networks for Human Pose
    Estimation," in Proceedings of the European Conference on Computer Vision,
    pp. 483{499, 2016.
    [20] H. Fang, S. Xie, Y. Tai, and C. Lu, \RMPE: Regional Multi-person Pose Estimation,"
    in Proceedings of the International Conference on Computer Vision,
    pp. 2353{2362, Oct 2017.
    [21] P. Hu and D. Ramanan, \Finding Tiny Faces," in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 951{959, 2017.
    [22] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, \SSH: Single Stage
    Headless Face Detector," in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 4875{4884, 2017.
    [23] Y. Bai and B. Ghanem, \Multi-Branch Fully Convolutional Network for Face
    Detection," arXiv preprint arXiv:1707.06330, 2017.
    [24] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, \Finding Tiny Faces in the Wild
    with Generative Adversarial Network," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 21{30, 2018.
    [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
    A. Courville, and Y. Bengio, \Generative Adversarial Nets," in Proceedings of
    the Neural Information Processing Systems, pp. 2672{2680, 2014.
    [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, \Pyramid Scene Parsing Network,"
    in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
    nition, pp. 6230{6239, 2017.
    [27] P. Bilinski and V. Prisacariu, \Dense Decoder Shortcut Connections for Single-
    Pass Semantic Segmentation," in Proceedings of the IEEE Conference on Com-
    puter Vision and Pattern Recognition, June 2018.
    [28] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, \Aggregated Residual Transformations
    for Deep Neural Networks," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 1492{1500, 2017.
    [29] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, \BiSeNet: Bilateral
    Segmentation Network for Real-Time Semantic Segmentation," in Proceedings
    of the European Conference on Computer Vision (ECCV), pp. 325{341, 2018.
    [30] D. Xu, W.Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, \Structured Attention
    Guided Convolutional Neural Fields for Monocular Depth Estimation," in Pro-
    ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 3917{3925, 2018.
    [31] C. Godard, O. Mac Aodha, M. Firman, and G. Brostow, \Digging into Self-
    Supervised Monocular Depth Estimation," arXiv preprint arXiv:1806.01260,
    2018.
    [32] K. Simonyan and A. Zisserman, \Very Deep Convolutional Networks for Largescale
    Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
    [33] C. Lin, J. Lu, G. Wang, and J. Zhou, \Graininess-Aware Deep Feature Learning
    for Pedestrian Detection," in Proceedings of the European Conference on
    Computer Vision, September 2018.
    [34] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
    U. Franke, S. Roth, and B. Schiele, \The Cityscapes Dataset for Semantic Urban
    Scene Understanding," in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 3213{3223, 2016.
    [35] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, \Sparsity
    Invariant CNNs," in Proceedings of the International Conference on 3D
    Vision, 2017.
    [36] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
    and T. Darrell, \Ca e: Convolutional Architecture for Fast Feature Embedding,"
    arXiv preprint arXiv:1408.5093, 2014.
    [37] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, \ImageNet: A Large-
    Scale Hierarchical Image Database," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 248{255, 2009.
    [38] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, and X. Wang, \Jointly Learning
    Deep Features, Deformable Parts, Occlusion and Classi cation for Pedestrian
    Detection," IEEE Transactions on Pattern Analysis and Machine Intelligence,
    vol. 40, pp. 1874{1887, Aug 2018.

    QR CODE