簡易檢索 / 詳目顯示

研究生: 陳俊翰
Jiun-Han Chen
論文名稱: 利用可變形卷積與梯度流增強非線性函數之基於輔助學習的單目鏡頭三維物件偵測
Monocular 3D Object Detection utilizing Auxiliary Learning with Deformable Convolutions and Gradient-Flow-Enhanced Nonlinear Function
指導教授: 阮聖彰
Shanq-Jang Ruan
林昌鴻
Chang-Hong Lin
口試委員: 阮聖彰
Shanq-Jang Ruan
林昌鴻
Chang-Hong Lin
呂政修
Jenq-Shiou Leu
彭文志
Wen-Chih Peng
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 68
中文關鍵詞: 三維物件偵測單目相機駕駛場域理解輔助學習深度學習
外文關鍵詞: deep learning, monocular 3D object detection, driving scene understanding, auxiliary learning, autonomous driving
相關次數: 點閱:300下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在自動駕駛系統中,三維物件偵測演算法扮演著極為重要的角色,自駕車的安全性仰賴於設計良好的偵測系統,因此,開發穩健且高效的三維物體偵測演算法是許多研究者、機構、公司的重要目標。相較於基於雙目鏡頭和雷達的偵測方法,單目鏡頭三維物體偵測僅利用二維訊息推論出複雜的三維特徵,可降低其計算成本,因此具有巨大的潛力,然而,由於缺乏三維的深度資訊,基於單目鏡頭的方法性能受到影響。因此,我們提出了一種端到端並有效的單目三維物體偵測模型,且不需要使用外部訓練數據,受到輔助學習的啟發,我們使用穩健的主架構模型做特徵提取,並利用多個輔助模塊去學習輔助知識,這些輔助模塊會在訓練後被移除,以提高推論效率,使我們能夠利用輔助學習,讓模型更加有效地學習關鍵特徵。所提出之方法在 KITTI 測試集和驗證集中,對於車輛類別的中等級別,分別達到了 17.28% 和 20.10%,超越了之前的單目三維物件偵測方法。


    In autonomous driving systems, 3D object detection algorithms are an essential part, and the safety of self-driving cars relies on well-designed detection systems. Therefore, developing robust and efficient 3D object detection algorithms is an essential goal for researchers, institutions, and companies. Compared to detection methods based on stereo-based and Lidar-based methods, monocular camera-based 3D object detection only uses 2D information to infer complex 3D features, reducing its computational cost. However, the lack of depth information affects the performance of monocular camera-based methods. Therefore, we propose an effective, end-to-end monocular 3D object detection model that does not require external training data. Inspired by auxiliary learning, we use a robust feature extractor and multiple auxiliary heads to learn auxiliary knowledge. These auxiliary heads will be removed after training to improve inference efficiency, enabling us to use auxiliary learning to understand key features effectively. The proposed method achieved 17.28% and 20.10% for the medium level of the Car category in the KITTI test set and validation set, respectively, surpassing previous monocular 3D object detection methods.

    Table of Contents 摘要............................................................................ IV ABSTRACT..........................................................................V ACKNOWLEDGMENTS ..................................................... VI TABLE OF CONTENTS.................................................... VIII LIST OF FIGURES ..........................................................X LIST OF TABLES............................................................. XIV CHAPTER 1...................................................................15 INTRODUCTION .................................................................15 CHAPTER 2.....................................................................20 RELATED WORKS................................................20 2.1 Auxiliary Learning...........................................20 2.2 Monocular 3D Object Detection....................................22 2.2.1 Monocular 3D Object Detection without Additional Information...............23 2.2.2 Monocular 3D Object Detection with Additional Information....................25 CHAPTER 3.....................................................................30 MONOCULAR 3D OBJECT DETECTION UTILIZING AUXILIARY LEARNING...................30 3.1 Framework Overview .......................................30 3.2 Backbone.................................................................32 3.3 Regression Heads...........................................................33 3.4 Auxiliary Regression Heads ..........................................38 3.5 Loss Function.............................................................40 RESULTS ....................................................................43 3.6 3D Object Detection Dataset ..............................................43 3.7 Setup .....................................................................44 3.8 Comparison with Prior Methods..........................................45 3.9 Ablation Study .......................................................54 CHAPTER 4..........................................................................59 LIMITATION......................................................59 DISCUSSION ...............................................................60 CONCLUSIONS.....................................................................62 REFERENCES......................................................................63 List of Figures Figure 1-1. Different types of sensors...........................................16 Figure 1-2. An example of monocular 3D object detection approach. ...............17 Figure 1-3. The KITTI dataset is extensively employed as a benchmark dataset to assess computer vision algorithms pertaining to autonomous driving. It comprises a vast collection of real-world road scenes captured by a vehicle (positioned on the left) utilizing diverse sensors, including a high-resolution camera, a LiDAR sensor, and GPS/IMU inertial measurement units. This dataset offers a wealth of data, including color images (positioned at the bottom-right) and point-cloud data (positioned at the top-right), which serve as valuable resources for researchers and institutions working on the advancement of autonomous driving systems. ........18 Figure 2-1. An example of auxiliary learning. When input data is processed by the backbone, the main task can be solved by the regression head (located at the topright). Meanwhile, auxiliary regression heads (located at the bottom-right) can help the learning process by solving sub-tasks....................................................21 Figure 2-2. There are two categories of training for monocular 3D object detection models. Monocular 3D object detection with additional information (located at the bottom) usually requires external data such as depth maps, CAD models, LiDAR, XI and stereo image pairs to be involved in the training process. In contrast, monocular 3D object detection (located at the top) only requires images to complete the training............................................................................................22 Figure 2-3. CenterNet [10] models objects as points..................................................23 Figure 2-4. Illustration of MonoDTR [17]. Initially, the input image undergoes feature extraction using the backbone. The Depth-Aware Feature Enhancement (DFE) module enhances the features by incorporating depth information under auxiliary supervision. Simultaneously, convolution layers extract context-aware features in parallel. Subsequently, the Depth-Aware Transformer (DTR) module combines the two types of features, and the Depth Positional Encoding (DPE) module adds depth positional hints to the transformer. Ultimately, the detection head predicts the 3D bounding boxes. .................................................................................24 Figure 2-5. DeepMANTA [19] utilizes CAD models to aid in its operations. DeepMANTA matches the features to a CAD model from their database (located at the top). This matching process produces a 3D car template with vertices (located at the bottom), which helps facilitate the output of bounding boxes (located in the middle). .................................................................................26 XII Figure 2-6. Illustration of D4LCN [22]. Initially, the depth map is estimated from the RGB image and utilized alongside the RGB image as input for a two-branch network. Subsequently, the depth-guided filtering module is employed to fuse the two information sources within each residual block............................................27 Figure 2-7. Illustration of MonoDistill [27]. The process begins by generating "imagelike" LiDAR maps from the LiDAR signals. These maps serve as the input for both the teacher model and the student model, which employ identical network architectures. Subsequently, the researchers introduce three distillation schemes to train the student model. These schemes leverage the knowledge and guidance from the well-trained teacher network during the training process..............................28 Figure 3-1. Overall architecture of the proposed model. The proposed architecture is composed of a robust DLA-34 with the ReLU activation function replaced by the mish activation function. Also, multiple regression heads are adopted for learning different information......................................................................31 Figure 3-2. Comparison of Mish and ReLU activation function ................................33 Figure 3-3. Each regression head is composed of two convolutional layers. The first layer is a 3×3 convolutional layer that utilizes a Mish activation function and attentive normalization. The second layer is a 1×1 convolutional layer. The number XIII of output channels for each layer is determined by the specific task of the regression head........................................................................34 Figure 3-4. Qualitative results produced by the proposed 3D object detection model. ............................................................................48 Figure 3-5. Qualitative results produced by the proposed 3D object detection model. ...............................................................................49 Figure 3-6. Qualitative results produced by the proposed 3D object detection model. ...............................................................................50 Figure 3-7. Qualitative results produced by the proposed 3D object detection model. ................................................................................51 Figure 3-8. Qualitative results produced by the proposed 3D object detection model. ...............................................................................52 Figure 3-9. Qualitative results produced by the proposed 3D object detection model. .............................................................................53 List of Tables Table 3-1. Hyperparameters adopted for training proposed model.............................45 Table 3-2. Results on KITTI test set for Car category.................................................46 Table 3-3. Results on KITTI test split of Pedestrian and Cyclist categories...............47 Table 3-4. Results of ablation studies on different components in the proposed method. ...............................................................................54 Table 3-5. Results of ablation studies on each auxiliary regression heads..................57

    References
    [1] L. Zhang, M. Yu, T. Chen, Z. Shi, C. Bao, and K. Ma, “Auxiliary training: Towards
    accurate and robust models,” in Proceedings of the IEEE/CVF conference on
    computer vision and pattern recognition, 2020, pp. 372–381.
    [2] L. Liebel and M. Korner, “Auxiliary tasks in multi-task learning,” arXiv preprint
    arXiv:1805.06334, 2018.
    [3] S. Liu, A. Davison, and E. Johns, “Self-supervised generalization with meta
    auxiliary learning,” Advances in Neural Information Processing Systems, vol. 32,
    2019.
    [4] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti
    vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern
    recognition. IEEE, 2012, pp. 3354–3361.
    [5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed
    representations of words and phrases and their compositionality,” Advances in
    neural information processing systems, vol. 26, 2013.
    [6] Y. Tang, J. Pino, X. Li, C. Wang, and D. Genzel, “Improving speech translation by
    understanding and learning from the auxiliary text translation task,” arXiv preprint
    arXiv:2107.05782, 2021.
    [7] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Largescale study of curiosity-driven learning,” arXiv preprint arXiv:1808.04355, 2018.
    [8] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation
    using deep learning and geometry,” in Proceedings of the IEEE conference on
    Computer Vision and Pattern Recognition, 2017, pp. 7074–7082.
    [9] Z. Liu, Z. Wu, and R. Toth, “Smoke: Single-stage monocular 3d object´ detection
    via keypoint estimation,” in Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition Workshops, 2020, pp. 996–997.
    [10] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint
    triplets for object detection,” in Proceedings of the IEEE/CVF international
    conference on computer vision, 2019, pp. 6569–6578.
    [11] A. Simonelli, S. R. Bulo, L. Porzi, M. Lopez-Antequera,´ and P. Kontschieder,
    “Disentangling monocular 3d object detection,” in Proceedings of the IEEE/CVF
    International Conference on Computer Vision, 2019, pp. 1991–1999.
    [12] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object detection
    using pairwise spatial relationships,” in Proceedings of the IEEE/CVF Conference
    on Computer Vision and Pattern Recognition, 2020, pp. 12093–12102.
    [13] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object
    detection,” in Proceedings of the IEEE/CVF International Conference on Computer
    Vision, 2019, pp. 9287–9296.
    [14] Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular 3d object
    detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, 2021, pp. 3289–3298.
    [15] Y. Zhou, Y. He, H. Zhu, C. Wang, H. Li, and Q. Jiang, “Monoef: Extrinsic
    parameter free monocular 3d object detection,” IEEE Transactions on Pattern
    Analysis and Machine Intelligence, 2021.
    [16] Z. Qin and X. Li, “Monoground: Detecting monocular 3d objects from the ground,”
    in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition, 2022, pp. 3793–3802.
    [17] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “Monodtr: Monocular 3d object
    detection with depth-aware transformer,” in Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition, 2022, pp. 4012–4021.
    [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
    and I. Polosukhin, “Attention is all you need,” Advances in neural information
    processing systems, vol. 30, 2017.
    [19] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep manta:
    A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from
    monocular image,” in Proceedings of the IEEE conference on computer vision and
    pattern recognition, 2017, pp. 2040–2049.
    [20] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate monocular
    3d object detection via color-embedded 3d reconstruction for autonomous driving,”
    in Proceedings of the IEEE/CVF International Conference on Computer Vision,
    2019, pp. 6851–6860.
    [21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets
    for 3d classification and segmentation,” in Proceedings of the IEEE conference on
    computer vision and pattern recognition, 2017, pp. 652–660.
    [22] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided
    convolutions for monocular 3d object detection,” in Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp.
    1000–1001.
    [23] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection
    in monocular video,” in European Conference on Computer Vision. Springer, 2020,
    pp. 135–152.
    [24] L. Wang, L. Du, X. Ye, Y. Fu, G. Guo, X. Xue, J. Feng, and L. Zhang, “Depthconditioned dynamic message propagation for monocular 3d object detection,” in
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition, 2021, pp. 454–463.
    [25] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth
    distribution network for monocular 3d object detection,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
    8555–8564.
    [26] H. Sun, Z. Fan, Z. Song, Z. Wang, K. Wu, and J. Lu, “Monopcns: Monocular 3d
    object detection via point cloud network simulation,” arXiv preprint
    arXiv:2208.09446, 2022.
    [27] Z. Chong, X. Ma, H. Zhang, Y. Yue, H. Li, Z. Wang, and W. Ouyang, “Monodistill:
    Learning spatial features for monocular 3d object detection,” arXiv preprint
    arXiv:2201.10830, 2022.
    [28] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudolidar needed for
    monocular 3d object detection?” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision, 2021, pp. 3142–3152.
    [29] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in
    Proceedings of the IEEE conference on computer vision and pattern recognition,
    2018, pp. 2403–2412.
    [30] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable,”
    Better Results, 2018.
    [31] D. Misra, “Mish: A self regularized non-monotonic activation function,” arXiv
    preprint arXiv:1908.08681, 2019.
    [32] X. Li, W. Sun, and T. Wu, “Attentive normalization,” in European Conference on
    Computer Vision. Springer, 2020, pp. 70–87.
    [33] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of
    the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–
    7141.
    [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
    by reducing internal covariate shift,” in International conference on machine
    learning. PMLR, 2015, pp. 448–456.
    [35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss´ for dense object
    detection,” in Proceedings of the IEEE international conference on computer vision,
    2017, pp. 2980–2988.
    [36] D. J. C. LAW H, “Detecting objects as paired keypoints,” Lecture Notes in
    Computer Science, pp. 765–781, 2018.
    [37] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang,“Delving into
    localization errors for monocular 3d object detection,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
    4721–4730.
    [38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,
    “Generalized intersection over union: A metric and a loss for bounding box
    regression,” in Proceedings of the IEEE/CVF conference on computer vision and
    pattern recognition, 2019, pp. 658–666.
    [39] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv
    preprint arXiv:1711.05101, 2017.
    [40] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vggstyle convnets great again,” in Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition, 2021, pp. 13 733–13 742.

    無法下載圖示 全文公開日期 2025/07/27 (校內網路)
    全文公開日期 2025/07/27 (校外網路)
    全文公開日期 2025/07/27 (國家圖書館:臺灣博碩士論文系統)
    QR CODE