Basic Search / Detailed Display

Author: Arren Matthew C. Antioquia
Arren Matthew C. Antioquia
Thesis Title: Bigger is Not Better: Towards Faster Multi-Scale Object Detectors
Bigger is Not Better: Towards Faster Multi-Scale Object Detectors
Advisor: 花凱龍
Kai-Lung Hua
Committee: 楊朝龍
Chao-Lung Yang
鮑興國
Hsing-Kuo Pao
Arnulfo Azcarraga
Arnulfo Azcarraga
楊傳凱
Chuan-Kai Yang
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2019
Graduation Academic Year: 107
Language: 英文
Pages: 51
Keywords (in Chinese): Object DetectionFeature FusionObject RecognitionConvolutional Neural NetworksDeep Learning
Keywords (in other languages): Object Detection, Feature Fusion, Object Recognition, Convolutional Neural Networks, Deep Learning
Reference times: Clicks: 417Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • Despite recent improvements, the arbitrary sizes of objects still impede the predictive ability of object detectors. Recent solutions combine feature maps of different receptive fields to detect multi-scale objects. However, these methods have large computational costs resulting to slower inference time, which is not practical for real-time applications. Contrarily, fusion methods depending on large networks with many skip connections require larger memory footprint, prohibiting usage in devices with limited memory. In this paper, we propose a simpler novel fusion method which integrates multiple feature maps using a single concatenation operation. Our method can flexibly adapt to any base network, allowing for tailored performance for different computational requirements. Our approach achieves 81.7% mAP at 41 FPS on the PASCAL VOC dataset using ResNet-50 as the base network, which is superior in terms of both speed and mAP as compared to several state-of-the-art baselines that uses larger base networks.


    Despite recent improvements, the arbitrary sizes of objects still impede the predictive ability of object detectors. Recent solutions combine feature maps of different receptive fields to detect multi-scale objects. However, these methods have large computational costs resulting to slower inference time, which is not practical for real-time applications. Contrarily, fusion methods depending on large networks with many skip connections require larger memory footprint, prohibiting usage in devices with limited memory. In this paper, we propose a simpler novel fusion method which integrates multiple feature maps using a single concatenation operation. Our method can flexibly adapt to any base network, allowing for tailored performance for different computational requirements. Our approach achieves 81.7% mAP at 41 FPS on the PASCAL VOC dataset using ResNet-50 as the base network, which is superior in terms of both speed and mAP as compared to several state-of-the-art baselines that uses larger base networks.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgements . . . . . . . . . . . . . . . . . . ii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Convolutional Neural Networks. . . 4 2.1.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 VGGNet . . . . . . . . . . . . . . . . . . . . . . . .9 2.1.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . .12 2.1.4 DenseNet . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Object Detection . . . . . . . . . . . . . . . . 18 2.2.1 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . .18 2.2.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Faster R-CNN . . . . . . . . . . . . . . . . . . 21 2.2.4 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . .22 2.2.5 YOLOv2 . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.6 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.7 DSSD . . . . . . . . . . . . . . . . . . . . . . . . . .25 2.2.8 STDN . . . . . . . . . . . . . . . . . . . . . . . . . .28 2.3 Feature Fusion . . . . . . . . . . . . . . . . . . .30 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Base Network . . . . . . . . . . . . . . . . . . . . 33 3.2 Fusion Module . . . . . . . . . . . . . . . . . . . 33 3.3 Detection Module . . . . . . . . . . . . . . . . 35 3.3.1 Feature Pyramid . . . . . . . . . . . . . . . . .35 3.3.2 Subnetworks . . . . . . . . . . . . . . . . . . . . 36 3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . .37 3.4.1 Anchor Boxes . . . . . . . . . . . . . . . . . . . .37 3.4.2 Matching Strategy . . . . . . . . . . . . . . . 39 3.4.3 Hard Negative Mining . . . . . . . . . . . .39 3.4.4 Data Augmentation . . . . . . . . . . . . . . .40 3.4.5 Training Objective . . . . . . . . . . . . . . . . 41 4 Results and Analysis . . . . . . . . . . . . . . . . . .42 4.1 PASCAL VOC dataset . . . . . . . . . . . . . . . 42 4.2 Implementation Details . . . . . . . . . . . . . 42 4.3 Results on Pascal VOC 2007 . . . . . . . . . 43 4.3.1 Mean Average Precision (mAP) . . . . .43 4.3.2 Frames Per Second (FPS) . . . . . . . . . . .45 4.3.3 Tradeoff between mAP and FPS . . . . 46 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

    [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
    [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    [3] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
    [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
    [5] R. Girshick, “Fast r-cnn,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
    [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 91–99, Curran Associates, Inc., 2015.
    [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real time object detection,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision – ECCV 2016 (B. Leibe, J. Matas, N. Sebe, and M. Welling, eds.), (Cham), pp. 21–37, Springer International Publishing, 2016.
    [9] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD : Deconvolutional single shot detector,” CoRR, vol. abs/1701.06659, 2017.
    [10] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, “Scale-transferrable object detection,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    [11] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
    [12] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European conference on computer vision, pp. 346–361, Springer, 2014.
    [13] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
    [14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, pp. 303–338, June 2010.
    [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
    [17] M. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Computer Vision, ECCV 2014 - 13th European Conference, Proceedings, vol. 8689 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 818–833, Springer Verlag, 2014.
    [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
    [19] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
    [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886– 893, IEEE, 2005.
    [21] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
    [22] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
    [23] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,”in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

    QR CODE