利用可變形卷積與梯度流增強非線性函數之基於輔助學習的單目鏡頭三維物件偵測

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳俊翰 Jiun-Han Chen
論文名稱：	利用可變形卷積與梯度流增強非線性函數之基於輔助學習的單目鏡頭三維物件偵測 Monocular 3D Object Detection utilizing Auxiliary Learning with Deformable Convolutions and Gradient-Flow-Enhanced Nonlinear Function
指導教授：	阮聖彰 Shanq-Jang Ruan 林昌鴻 Chang-Hong Lin
口試委員:	阮聖彰 Shanq-Jang Ruan 林昌鴻 Chang-Hong Lin 呂政修 Jenq-Shiou Leu 彭文志 Wen-Chih Peng
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	68
中文關鍵詞：	三維物件偵測、單目相機、駕駛場域理解、輔助學習、深度學習
外文關鍵詞：	deep learning, monocular 3D object detection, driving scene understanding, auxiliary learning, autonomous driving
相關次數：	點閱：300 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在自動駕駛系統中，三維物件偵測演算法扮演著極為重要的角色，自駕車的安全性仰賴於設計良好的偵測系統，因此，開發穩健且高效的三維物體偵測演算法是許多研究者、機構、公司的重要目標。相較於基於雙目鏡頭和雷達的偵測方法，單目鏡頭三維物體偵測僅利用二維訊息推論出複雜的三維特徵，可降低其計算成本，因此具有巨大的潛力，然而，由於缺乏三維的深度資訊，基於單目鏡頭的方法性能受到影響。因此，我們提出了一種端到端並有效的單目三維物體偵測模型，且不需要使用外部訓練數據，受到輔助學習的啟發，我們使用穩健的主架構模型做特徵提取，並利用多個輔助模塊去學習輔助知識，這些輔助模塊會在訓練後被移除，以提高推論效率，使我們能夠利用輔助學習，讓模型更加有效地學習關鍵特徵。所提出之方法在 KITTI 測試集和驗證集中，對於車輛類別的中等級別，分別達到了 17.28% 和 20.10%，超越了之前的單目三維物件偵測方法。

In autonomous driving systems, 3D object detection algorithms are an essential part, and the safety of self-driving cars relies on well-designed detection systems. Therefore, developing robust and efficient 3D object detection algorithms is an essential goal for researchers, institutions, and companies. Compared to detection methods based on stereo-based and Lidar-based methods, monocular camera-based 3D object detection only uses 2D information to infer complex 3D features, reducing its computational cost. However, the lack of depth information affects the performance of monocular camera-based methods. Therefore, we propose an effective, end-to-end monocular 3D object detection model that does not require external training data. Inspired by auxiliary learning, we use a robust feature extractor and multiple auxiliary heads to learn auxiliary knowledge. These auxiliary heads will be removed after training to improve inference efficiency, enabling us to use auxiliary learning to understand key features effectively. The proposed method achieved 17.28% and 20.10% for the medium level of the Car category in the KITTI test set and validation set, respectively, surpassing previous monocular 3D object detection methods.

Table of Contents
摘要............................................................................ IV
ABSTRACT..........................................................................V
ACKNOWLEDGMENTS ..................................................... VI
TABLE OF CONTENTS.................................................... VIII
LIST OF FIGURES ..........................................................X
LIST OF TABLES............................................................. XIV
CHAPTER 1...................................................................15
INTRODUCTION .................................................................15
CHAPTER 2.....................................................................20
RELATED WORKS................................................20
2.1 Auxiliary Learning...........................................20
2.2 Monocular 3D Object Detection....................................22
2.2.1 Monocular 3D Object Detection without Additional Information...............23
2.2.2 Monocular 3D Object Detection with Additional Information....................25
CHAPTER 3.....................................................................30
MONOCULAR 3D OBJECT DETECTION UTILIZING AUXILIARY LEARNING...................30
3.1 Framework Overview .......................................30
3.2 Backbone.................................................................32
3.3 Regression Heads...........................................................33
3.4 Auxiliary Regression Heads ..........................................38
3.5 Loss Function.............................................................40
RESULTS ....................................................................43
3.6 3D Object Detection Dataset ..............................................43
3.7 Setup .....................................................................44
3.8 Comparison with Prior Methods..........................................45
3.9 Ablation Study .......................................................54
CHAPTER 4..........................................................................59
LIMITATION......................................................59
DISCUSSION ...............................................................60
CONCLUSIONS.....................................................................62
REFERENCES......................................................................63

List of Figures
Figure 1-1. Different types of sensors...........................................16
Figure 1-2. An example of monocular 3D object detection approach. ...............17
Figure 1-3. The KITTI dataset is extensively employed as a benchmark dataset to
assess computer vision algorithms pertaining to autonomous driving. It comprises
a vast collection of real-world road scenes captured by a vehicle (positioned on
the left) utilizing diverse sensors, including a high-resolution camera, a LiDAR
sensor, and GPS/IMU inertial measurement units. This dataset offers a wealth of
data, including color images (positioned at the bottom-right) and point-cloud data
(positioned at the top-right), which serve as valuable resources for researchers and
institutions working on the advancement of autonomous driving systems. ........18
Figure 2-1. An example of auxiliary learning. When input data is processed by the
backbone, the main task can be solved by the regression head (located at the topright). Meanwhile, auxiliary regression heads (located at the bottom-right) can
help the learning process by solving sub-tasks....................................................21
Figure 2-2. There are two categories of training for monocular 3D object detection
models. Monocular 3D object detection with additional information (located at the
bottom) usually requires external data such as depth maps, CAD models, LiDAR,
XI
and stereo image pairs to be involved in the training process. In contrast,
monocular 3D object detection (located at the top) only requires images to
complete the training............................................................................................22
Figure 2-3. CenterNet [10] models objects as points..................................................23
Figure 2-4. Illustration of MonoDTR [17]. Initially, the input image undergoes feature
extraction using the backbone. The Depth-Aware Feature Enhancement (DFE)
module enhances the features by incorporating depth information under auxiliary
supervision. Simultaneously, convolution layers extract context-aware features in
parallel. Subsequently, the Depth-Aware Transformer (DTR) module combines the
two types of features, and the Depth Positional Encoding (DPE) module adds
depth positional hints to the transformer. Ultimately, the detection head predicts
the 3D bounding boxes. .................................................................................24
Figure 2-5. DeepMANTA [19] utilizes CAD models to aid in its operations.
DeepMANTA matches the features to a CAD model from their database (located
at the top). This matching process produces a 3D car template with vertices
(located at the bottom), which helps facilitate the output of bounding boxes
(located in the middle). .................................................................................26
XII
Figure 2-6. Illustration of D4LCN [22]. Initially, the depth map is estimated from the
RGB image and utilized alongside the RGB image as input for a two-branch
network. Subsequently, the depth-guided filtering module is employed to fuse the
two information sources within each residual block............................................27
Figure 2-7. Illustration of MonoDistill [27]. The process begins by generating "imagelike" LiDAR maps from the LiDAR signals. These maps serve as the input for
both the teacher model and the student model, which employ identical network
architectures. Subsequently, the researchers introduce three distillation schemes to
train the student model. These schemes leverage the knowledge and guidance from
the well-trained teacher network during the training process..............................28
Figure 3-1. Overall architecture of the proposed model. The proposed architecture is
composed of a robust DLA-34 with the ReLU activation function replaced by the
mish activation function. Also, multiple regression heads are adopted for learning
different information......................................................................31
Figure 3-2. Comparison of Mish and ReLU activation function ................................33
Figure 3-3. Each regression head is composed of two convolutional layers. The first
layer is a 3×3 convolutional layer that utilizes a Mish activation function and
attentive normalization. The second layer is a 1×1 convolutional layer. The number
XIII
of output channels for each layer is determined by the specific task of the
regression head........................................................................34
Figure 3-4. Qualitative results produced by the proposed 3D object detection model.
............................................................................48
Figure 3-5. Qualitative results produced by the proposed 3D object detection model.
...............................................................................49
Figure 3-6. Qualitative results produced by the proposed 3D object detection model.
...............................................................................50
Figure 3-7. Qualitative results produced by the proposed 3D object detection model.
................................................................................51
Figure 3-8. Qualitative results produced by the proposed 3D object detection model.
...............................................................................52
Figure 3-9. Qualitative results produced by the proposed 3D object detection model.
.............................................................................53

List of Tables
Table 3-1. Hyperparameters adopted for training proposed model.............................45
Table 3-2. Results on KITTI test set for Car category.................................................46
Table 3-3. Results on KITTI test split of Pedestrian and Cyclist categories...............47
Table 3-4. Results of ablation studies on different components in the proposed method.
...............................................................................54
Table 3-5. Results of ablation studies on each auxiliary regression heads..................57

                                

References
[1] L. Zhang, M. Yu, T. Chen, Z. Shi, C. Bao, and K. Ma, “Auxiliary training: Towards
accurate and robust models,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2020, pp. 372–381.
[2] L. Liebel and M. Korner, “Auxiliary tasks in multi-task learning,” arXiv preprint
arXiv:1805.06334, 2018.
[3] S. Liu, A. Davison, and E. Johns, “Self-supervised generalization with meta
auxiliary learning,” Advances in Neural Information Processing Systems, vol. 32,
2019.
[4] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti
vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern
recognition. IEEE, 2012, pp. 3354–3361.
[5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed
representations of words and phrases and their compositionality,” Advances in
neural information processing systems, vol. 26, 2013.
[6] Y. Tang, J. Pino, X. Li, C. Wang, and D. Genzel, “Improving speech translation by
understanding and learning from the auxiliary text translation task,” arXiv preprint
arXiv:2107.05782, 2021.
[7] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Largescale study of curiosity-driven learning,” arXiv preprint arXiv:1808.04355, 2018.
[8] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation
using deep learning and geometry,” in Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, 2017, pp. 7074–7082.
[9] Z. Liu, Z. Wu, and R. Toth, “Smoke: Single-stage monocular 3d object´ detection
via keypoint estimation,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, 2020, pp. 996–997.
[10] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint
triplets for object detection,” in Proceedings of the IEEE/CVF international
conference on computer vision, 2019, pp. 6569–6578.
[11] A. Simonelli, S. R. Bulo, L. Porzi, M. Lopez-Antequera,´ and P. Kontschieder,
“Disentangling monocular 3d object detection,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2019, pp. 1991–1999.
[12] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object detection
using pairwise spatial relationships,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020, pp. 12093–12102.
[13] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object
detection,” in Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 9287–9296.
[14] Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular 3d object
detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 3289–3298.
[15] Y. Zhou, Y. He, H. Zhu, C. Wang, H. Li, and Q. Jiang, “Monoef: Extrinsic
parameter free monocular 3d object detection,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2021.
[16] Z. Qin and X. Li, “Monoground: Detecting monocular 3d objects from the ground,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2022, pp. 3793–3802.
[17] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “Monodtr: Monocular 3d object
detection with depth-aware transformer,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp. 4012–4021.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
and I. Polosukhin, “Attention is all you need,” Advances in neural information
processing systems, vol. 30, 2017.
[19] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep manta:
A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from
monocular image,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 2040–2049.
[20] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate monocular
3d object detection via color-embedded 3d reconstruction for autonomous driving,”
in Proceedings of the IEEE/CVF International Conference on Computer Vision,
2019, pp. 6851–6860.
[21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets
for 3d classification and segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2017, pp. 652–660.
[22] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided
convolutions for monocular 3d object detection,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp.
1000–1001.
[23] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection
in monocular video,” in European Conference on Computer Vision. Springer, 2020,
pp. 135–152.
[24] L. Wang, L. Du, X. Ye, Y. Fu, G. Guo, X. Xue, J. Feng, and L. Zhang, “Depthconditioned dynamic message propagation for monocular 3d object detection,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 454–463.
[25] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth
distribution network for monocular 3d object detection,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
8555–8564.
[26] H. Sun, Z. Fan, Z. Song, Z. Wang, K. Wu, and J. Lu, “Monopcns: Monocular 3d
object detection via point cloud network simulation,” arXiv preprint
arXiv:2208.09446, 2022.
[27] Z. Chong, X. Ma, H. Zhang, Y. Yue, H. Li, Z. Wang, and W. Ouyang, “Monodistill:
Learning spatial features for monocular 3d object detection,” arXiv preprint
arXiv:2201.10830, 2022.
[28] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudolidar needed for
monocular 3d object detection?” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021, pp. 3142–3152.
[29] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 2403–2412.
[30] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable,”
Better Results, 2018.
[31] D. Misra, “Mish: A self regularized non-monotonic activation function,” arXiv
preprint arXiv:1908.08681, 2019.
[32] X. Li, W. Sun, and T. Wu, “Attentive normalization,” in European Conference on
Computer Vision. Springer, 2020, pp. 70–87.
[33] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–
7141.
[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” in International conference on machine
learning. PMLR, 2015, pp. 448–456.
[35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss´ for dense object
detection,” in Proceedings of the IEEE international conference on computer vision,
2017, pp. 2980–2988.
[36] D. J. C. LAW H, “Detecting objects as paired keypoints,” Lecture Notes in
Computer Science, pp. 765–781, 2018.
[37] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang,“Delving into
localization errors for monocular 3d object detection,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
4721–4730.
[38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,
“Generalized intersection over union: A metric and a loss for bounding box
regression,” in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2019, pp. 658–666.
[39] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv
preprint arXiv:1711.05101, 2017.
[40] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vggstyle convnets great again,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2021, pp. 13 733–13 742.

全文公開日期 2025/07/27 (校內網路)
全文公開日期 2025/07/27 (校外網路)
全文公開日期 2025/07/27 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文