研究生: |
陳俊翰 Jiun-Han Chen |
---|---|
論文名稱: |
利用可變形卷積與梯度流增強非線性函數之基於輔助學習的單目鏡頭三維物件偵測 Monocular 3D Object Detection utilizing Auxiliary Learning with Deformable Convolutions and Gradient-Flow-Enhanced Nonlinear Function |
指導教授: |
阮聖彰
Shanq-Jang Ruan 林昌鴻 Chang-Hong Lin |
口試委員: |
阮聖彰
Shanq-Jang Ruan 林昌鴻 Chang-Hong Lin 呂政修 Jenq-Shiou Leu 彭文志 Wen-Chih Peng |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 68 |
中文關鍵詞: | 三維物件偵測 、單目相機 、駕駛場域理解 、輔助學習 、深度學習 |
外文關鍵詞: | deep learning, monocular 3D object detection, driving scene understanding, auxiliary learning, autonomous driving |
相關次數: | 點閱:300 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在自動駕駛系統中,三維物件偵測演算法扮演著極為重要的角色,自駕車的安全性仰賴於設計良好的偵測系統,因此,開發穩健且高效的三維物體偵測演算法是許多研究者、機構、公司的重要目標。相較於基於雙目鏡頭和雷達的偵測方法,單目鏡頭三維物體偵測僅利用二維訊息推論出複雜的三維特徵,可降低其計算成本,因此具有巨大的潛力,然而,由於缺乏三維的深度資訊,基於單目鏡頭的方法性能受到影響。因此,我們提出了一種端到端並有效的單目三維物體偵測模型,且不需要使用外部訓練數據,受到輔助學習的啟發,我們使用穩健的主架構模型做特徵提取,並利用多個輔助模塊去學習輔助知識,這些輔助模塊會在訓練後被移除,以提高推論效率,使我們能夠利用輔助學習,讓模型更加有效地學習關鍵特徵。所提出之方法在 KITTI 測試集和驗證集中,對於車輛類別的中等級別,分別達到了 17.28% 和 20.10%,超越了之前的單目三維物件偵測方法。
In autonomous driving systems, 3D object detection algorithms are an essential part, and the safety of self-driving cars relies on well-designed detection systems. Therefore, developing robust and efficient 3D object detection algorithms is an essential goal for researchers, institutions, and companies. Compared to detection methods based on stereo-based and Lidar-based methods, monocular camera-based 3D object detection only uses 2D information to infer complex 3D features, reducing its computational cost. However, the lack of depth information affects the performance of monocular camera-based methods. Therefore, we propose an effective, end-to-end monocular 3D object detection model that does not require external training data. Inspired by auxiliary learning, we use a robust feature extractor and multiple auxiliary heads to learn auxiliary knowledge. These auxiliary heads will be removed after training to improve inference efficiency, enabling us to use auxiliary learning to understand key features effectively. The proposed method achieved 17.28% and 20.10% for the medium level of the Car category in the KITTI test set and validation set, respectively, surpassing previous monocular 3D object detection methods.
References
[1] L. Zhang, M. Yu, T. Chen, Z. Shi, C. Bao, and K. Ma, “Auxiliary training: Towards
accurate and robust models,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2020, pp. 372–381.
[2] L. Liebel and M. Korner, “Auxiliary tasks in multi-task learning,” arXiv preprint
arXiv:1805.06334, 2018.
[3] S. Liu, A. Davison, and E. Johns, “Self-supervised generalization with meta
auxiliary learning,” Advances in Neural Information Processing Systems, vol. 32,
2019.
[4] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti
vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern
recognition. IEEE, 2012, pp. 3354–3361.
[5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed
representations of words and phrases and their compositionality,” Advances in
neural information processing systems, vol. 26, 2013.
[6] Y. Tang, J. Pino, X. Li, C. Wang, and D. Genzel, “Improving speech translation by
understanding and learning from the auxiliary text translation task,” arXiv preprint
arXiv:2107.05782, 2021.
[7] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Largescale study of curiosity-driven learning,” arXiv preprint arXiv:1808.04355, 2018.
[8] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation
using deep learning and geometry,” in Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, 2017, pp. 7074–7082.
[9] Z. Liu, Z. Wu, and R. Toth, “Smoke: Single-stage monocular 3d object´ detection
via keypoint estimation,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, 2020, pp. 996–997.
[10] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint
triplets for object detection,” in Proceedings of the IEEE/CVF international
conference on computer vision, 2019, pp. 6569–6578.
[11] A. Simonelli, S. R. Bulo, L. Porzi, M. Lopez-Antequera,´ and P. Kontschieder,
“Disentangling monocular 3d object detection,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2019, pp. 1991–1999.
[12] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object detection
using pairwise spatial relationships,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020, pp. 12093–12102.
[13] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object
detection,” in Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 9287–9296.
[14] Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular 3d object
detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 3289–3298.
[15] Y. Zhou, Y. He, H. Zhu, C. Wang, H. Li, and Q. Jiang, “Monoef: Extrinsic
parameter free monocular 3d object detection,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2021.
[16] Z. Qin and X. Li, “Monoground: Detecting monocular 3d objects from the ground,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2022, pp. 3793–3802.
[17] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “Monodtr: Monocular 3d object
detection with depth-aware transformer,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp. 4012–4021.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
and I. Polosukhin, “Attention is all you need,” Advances in neural information
processing systems, vol. 30, 2017.
[19] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep manta:
A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from
monocular image,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 2040–2049.
[20] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate monocular
3d object detection via color-embedded 3d reconstruction for autonomous driving,”
in Proceedings of the IEEE/CVF International Conference on Computer Vision,
2019, pp. 6851–6860.
[21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets
for 3d classification and segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2017, pp. 652–660.
[22] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided
convolutions for monocular 3d object detection,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp.
1000–1001.
[23] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection
in monocular video,” in European Conference on Computer Vision. Springer, 2020,
pp. 135–152.
[24] L. Wang, L. Du, X. Ye, Y. Fu, G. Guo, X. Xue, J. Feng, and L. Zhang, “Depthconditioned dynamic message propagation for monocular 3d object detection,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 454–463.
[25] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth
distribution network for monocular 3d object detection,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
8555–8564.
[26] H. Sun, Z. Fan, Z. Song, Z. Wang, K. Wu, and J. Lu, “Monopcns: Monocular 3d
object detection via point cloud network simulation,” arXiv preprint
arXiv:2208.09446, 2022.
[27] Z. Chong, X. Ma, H. Zhang, Y. Yue, H. Li, Z. Wang, and W. Ouyang, “Monodistill:
Learning spatial features for monocular 3d object detection,” arXiv preprint
arXiv:2201.10830, 2022.
[28] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudolidar needed for
monocular 3d object detection?” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021, pp. 3142–3152.
[29] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 2403–2412.
[30] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable,”
Better Results, 2018.
[31] D. Misra, “Mish: A self regularized non-monotonic activation function,” arXiv
preprint arXiv:1908.08681, 2019.
[32] X. Li, W. Sun, and T. Wu, “Attentive normalization,” in European Conference on
Computer Vision. Springer, 2020, pp. 70–87.
[33] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–
7141.
[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” in International conference on machine
learning. PMLR, 2015, pp. 448–456.
[35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss´ for dense object
detection,” in Proceedings of the IEEE international conference on computer vision,
2017, pp. 2980–2988.
[36] D. J. C. LAW H, “Detecting objects as paired keypoints,” Lecture Notes in
Computer Science, pp. 765–781, 2018.
[37] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang,“Delving into
localization errors for monocular 3d object detection,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
4721–4730.
[38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,
“Generalized intersection over union: A metric and a loss for bounding box
regression,” in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2019, pp. 658–666.
[39] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv
preprint arXiv:1711.05101, 2017.
[40] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vggstyle convnets great again,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2021, pp. 13 733–13 742.