簡易檢索 / 詳目顯示

研究生: 蕭元昇
Yuan-Sheng Hsiao
論文名稱: 基於深度學習之 RGB-D 視覺辨識系統
Deep Learning Techniques for RGB-D Visual Recognition System
指導教授: 花凱龍
Kai-Lung Hua
口試委員: 吳怡樂
Yi-Leh Wu
楊傳凱
Chuan-Kai Yang
葉梅珍
Mei-Chen Yeh
鄭文皇
Wen-Huang Cheng
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 50
中文關鍵詞: 影像辨識多模態深度學習深度影像
外文關鍵詞: Image Recognition, Multimodal, Deep Learning, Depth Image
相關次數: 點閱:241下載:24
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在多媒體資訊分析領域中,多模態混合 (Multimodal fusion) 已經被廣泛的研究與討論了。但是,近年來各種不同裝置和感測器的出現與儲存設備成本的降低,使得資料收集變得越來越容易,因此越來越大量的資料被收集,這些大量的資料雖然讓資訊採礦 (Information Mining) 變得更加容易,可是多媒體大資料的分析還沒有被深入的研究。另一方面,先進的機器學習演算法,像是深度學習 (Deep Learning),已經在不同應用中被證實,是改善辨識效果與達成良好辨識率的關鍵技術,因此 Data fusion 的議題應該為了這些新出現的演算法被重新討論。這些問題包含了: 什麼是最有效的方式用來結合不同型態的資料? 使用的分類器是否會受到混合方法影響其效能? 為了回答這些問題,在本篇論文中,我們進行了一系列的研究與實驗,來評估前期混合 (Early fusion) 與後期混合 (Late fusion) 兩種架構在使用 SVM 與各種深度學習分類器所能達到效果,我們選擇了兩個具有挑戰性的 RGB-D 影像辨識資料庫做為實驗對象,一個是各類型物件辨識資料庫,另一個則是我們自己收集的手勢辨識資料庫。本實驗的結果為基於深度學習算法的影像辨識系統之開發提供了有用的策略與實際引導。


Data fusion from different modalities has been extensively studied for a better understanding of multimedia contents. On one hand, the emergence of new devices and decreasing storage costs cause growing amounts of data being collected. Though bigger data makes it easier to mine information, methods for big data analytics are not well investigated. On the other hand, new machine learning techniques, such as deep learning, have been shown to be one of the the key elements in achieving state-of-the-art inference performances in a variety of applications. Therefore, some of the old questions in data fusion are in need to be addressed again for these new changes. These questions are: What is the most effective way to combine data for various modalities? Does the fusion method affect the performance with different classifiers? To answer these questions, in this paper, we present a comparative study for evaluating early and late fusion schemes with several types of SVM and deep learning classifiers on two challenging RGB-D based visual recognition tasks: hand gesture recognition and generic object recognition. The findings from this study provide useful policy and practical guidance for the development of visual recognition systems.

教授推薦書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 論文口試委員審定書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 英文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 表目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 圖目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1 介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 相關研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 合成方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 分類器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 Kernel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.2 資料輸入 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 深度學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 CNN、F-RCNN . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 SAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.3 RBM、DBN . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.4 資料輸入 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 辨識問題 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.1 RGB-D 物件資料庫 . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 LaRED 手勢資料庫 . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6 實驗與討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2 結果與討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 7 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

[1] C. G. M. Snoek, M. Worring, and A. W. M. Smeulders, “Early versus late fusion in semantic video analysis,” in ACM Multimedia, 2005.
[2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, “Multimodal fusion for multimedia analysis: A survey,” Multimedia Systems, vol. 16, no. 6, pp. 345–379, 2010.
[3] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 3, pp. 226–239, 1998.
[4] G. Fanelli, J. Gall, and L. J. V. Gool, “Hough transform-based mouth localization for audio-visual speech recognition.,” in British Machine Vision Conference, 2009.
[5] A. V. Nefian, L. Liang, X. Pi, X. Liu, C. Mao, and K. P. Murphy, “A coupled hmm for audio-visual speech recognition.,” in International Conference on Acoustics, Speech and Signal Processing, 2002.
[6] A. C. Müller and S. Behnke, “Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images,” in IEEE International Conference on Robotics and Automation, 2014.
[7] H. Jiang and J. Xiao, “A linear approach to matching cuboids in RGBD images,” in IEEE Computer Vision and Pattern Recognition, 2013.
[8] L. Spinello and K. O. Arras, “People detection in RGB-D data.,” in International Conference on Intelligent Robots and Systems, 2011.
[9] L. Liu and L. Shao, “Learning discriminative representations from RGB-D video data,” in International Joint Conference on Artificial Intelligence, 2013.
[10] L. Bo, K. Lai, X. Ren, and D. Fox, “Object recognition with hierarchical kernel descriptors,” in IEEE Computer Vision and Pattern Recognition, 2011.
[11] Y. Ming, Q. Ruan, and A. Hauptmann, “Activity recognition from RGB-D camera with 3D local spatio-temporal features,” in IEEE International Conference on Multimedia and Expo, 2012.
[12] L. Bo, X. Ren, and D. Fox, “Unsupervised feature learning for RGB-D based object recognition,” in In Symposium on Experimental Robotics, 2013.
[13] Y. Sun, X. Wang, and X. Tang, “Hybrid deep learning for face verification,” in IEEE International Conference on Computer Vision, 2013.
[14] J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition.,” in International Conference on Acoustics, Speech and Signal Processing, 2013.
[15] S. Shahbandi and P. Lucidarme, “Object recognition based on radial basis function neural networks: Experiments with RGB-D camera embedded on mobile robots,” in International Conference on Systems and Computer Science, 2012.
[16] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,” in IEEE International Conference on Computer Vision, 2013.
[17] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell, “A category-level 3-D object dataset: Putting the kinect to work,” in IEEE International Conference on Computer Vision Workshop, 2011.
[18] N. Hoft, H. Schulz, and S. Behnke, “Fast semantic segmentation of rgb-d scenes with gpu-accelerated deep neural networks,” in KI 2014: Advances in Artificial Intelligence, vol. 8736 of Lecture Notes in Computer Science, pp. 80–85, 2014.
[19] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in IEEE European Conference on Computer Vision, 2014.
[20] C. Laurier, J. Grivolla, and P. Herrera, “Multimodal music mood classification using audio and lyrics,” in International Conference on Machine Learning and Applications, 11/12/2008 2008.
[21] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Skubic, and S. Lao, Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor, pp. 525–538. Springer Berlin Heidelberg.
[22] L. Bo, X. Ren, and D. Fox, Unsupervised Feature Learning for RGB-D Based Object Recognition, pp. 387–402. Springer International Publishing, 2013.
[23] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset.,” in IEEE International Conference on Robotics and Automation, 2011.
[24] Y.-S. Hsiao, J. Sanchez-Riera, T. Lim, K.-L. Hua, and W.-H. Cheng, “LaRED: A large RGB-D extensible hand gesture dataset,” in ACM Multimedia Systems Conference, 2014.
[25] G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012.
[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, 1998.
[27] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” pp. 153–160, 2007.
[28] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, July 2006.
[29] D. Y. Li Deng, “Deep learning: Methods and applications,” tech. rep., May 2014.
[30] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
[31] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, p. 2012.
[33] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
[34] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01, CVPR ’05, pp. 886–893, 2005.
[35] M. Brown and D. G. Lowe, “Recognising panoramas,” in IEEE International Conference on Computer Vision, 2003.
[36] S. Sonnenburg, G. Ratsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, and V. Franch, “The shogun machine learning toolbox,” Machine Learning Research, vol. 99, pp. 1799–1802, 2010.
[37] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Computer Vision and Pattern Recognition, 2014.
[38] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504 – 507, 2006.
[39] K. I. Kim, K. Jung, S. H. Park, and H. J. Kim, “Support vector machines for texture classification,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 11, pp. 1542–1550, 2002.
[40] “Evaluation of gender classification methods with automatically detected and aligned faces,” IEEE Trans. Pattern Anal. Machine Intell., vol. 30, pp. 541–547, March 2008.
[41] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robust features (surf),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346 – 359, 2008.
[42] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor applied to wide-baseline stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp. 815–830, 2010.
[43] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in Proceedings of the British Machine Vision Conference, pp. 36.1–36.10, 2002.
[44] K. van de Sande, T. Gevers, and C. Snoek, “Evaluating color descriptors for object and scene recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 32, no. 9, p. 1582–1596, 2010.
[45] A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamic hand gesture recognition with a depth sensor,” in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European, pp. 1975–1979, 2012.
[46] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” Trans. Intell. System and Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011.
[47] R. B. Palm, “Prediction as a candidate for learning deep hierarchical models of data,” Master’s thesis, Technical University of Denmark, DTU Informatics, 2012.

QR CODE