Basic Search / Detailed Display

Author: 鄭元棓
Yuan-Bang Cheng
Thesis Title: 自動產生Google街景導覽影片並提供物件偵測、影像修補與3D虛擬實境顯示
Automatic Generation of Video Navigation from Google Street View Database with Object Detection, Image Inpainting and Stereoscopic Virtual Reality Display
Advisor: 楊傳凱
Chuan-Kai Yang
Teng-Wen Chang
Committee: 王照明
Chao-Ming Wang
Pei-Li Sun
Kai-Lung Hua
Chuan-Kai Yang
Teng-Wen Chang
Degree: 博士
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2019
Graduation Academic Year: 107
Language: 英文
Pages: 167
Keywords (in Chinese): Google 街景影像物件偵測影像修補HOG and Exemplar-SVMsHaar and AdaboostGPUCaffe and Faster R-CNN深度圖預測(檢測)基於深度影像的渲染三維虛擬實境360度顯示(3DVR360)Unity and HTC Vive
Keywords (in other languages): Google Street View, Object Detection, Image Inpainting, HOG and Exemplar-SVMs, Haar and Adaboost, GPU, Caffe and Faster R-CNN, Depth Map Prediction, DIBR, Stereoscopic Virtual Reality 360 (3DVR360), Unity and HTC Vive
Reference times: Clicks: 631Downloads: 0
School Collection Retrieve National Library Collection Retrieve Error Report
  • 近十年間,在電腦科學領域已有許多關於人工智慧與深度學習的研究。同時,Google街景影像服務是我們時常會使用到的,我們能夠透過Google街景影像服務去查詢到我們想要到達目的地的街景圖。然而,卻只有很少的研究是在從事於能自動化地將Google街景影像直接轉變成一個導覽影片並且還能包括一些物件偵測與影像修補的功能;再加上,也只有很少的研究能夠將這個導覽影片轉變成一個三維虛擬實境360度的顯示,能夠讓使用者配載HTC Vive去觀看這個效果。
    在我的研究裡,我嘗試結合目前最受歡迎的二項電腦科學領域的研究-深度學習(人工智慧)與虛擬實境。對於導覽影片的產生,總共我已經開發了我的系統有三個版本。第一,是稱作GSVPlayer-HH&I(Google街景播放器,具有HOG+Haar物件偵測與影像修補),我主要使用基於CPU的方法去做物件偵測與影像修補。第二,是稱作GSVPlayer-FRRCNN&I(Google街景播放器,具有Faster R-CNN物件偵測與影像修補),這版本是基於在GSVPlayer-HH&I的基礎,反而我是使用基於GPU的方法(Faster R-CNN)去做物件偵測。第三,是稱作GSVPlayer-3DVR360(Google街景播放器,具有三維虛擬實境360度的顯示)。在這版本中,我實作一系列的影像處理、單視圖的深度圖檢測、基於深度影像的渲染、與三維虛擬實境360度顯示。對於這版本,結果顯示:即使這系統有較長的運算時間的需求,但是所有的使用者仍然是對於GSVPlayer-3DVR360感到滿意。

    In recent years, there are abundant researches in artificial intelligence and deep learning. At the same time, Google Street View images are often used by us. We can use Google Street View to look up the scene views of destination where we want to go to. However, there is not much work that can automatically transform Google Street View images directly to a navigation video with the functionalities of object detection and image inpainting, and there is also not much work that can make the generated navigation video used together with a HTC Vive for displaying the 3DVR360 effect.
    In my works, this study tries to combine currently the two most popular computer science researches of deep learning (or artificial intelligence) and virtual reality. Totally, this study has developed the three versions of my system for the navigation video generation. First, in this GSVPlayer-HH&I (i.e. Google Street View Player with HOG+Haar and Inpainting), the system mainly adopts the CPU-based methods for object detection and image inpainting. Second, in this GSVPlayer-FRRCNN&I (i.e. Google Street View Player with Faster R-CNN and Inpainting), based on the foundation of GSVPlayer-HH&I, the system instead uses the GPU-based methods (Faster R-CNN) for object detection. Third, in this GSVPlayer-3DVR360 (i.e. Google Street View Player with Stereoscopic Virtual Reality 360 Display), the system implements a series of image processing, monocular depth map estimation, DIBR and 3DVR360 display. One of the results gained is that, even though there is a problem of longer computation time in this system, all users are still satisfied with this GSVPlayer-3DVR360.
    In my dissertation, for the three versions of my system, the results and evaluations regarding both quantities and qualities are presented respectively, and the discussion and limitation are explicitly explained. In conclusion, briefly speaking, the system I proposed is a complete integrated framework.
    In future works, there are several potential directions can be explored and researched, including the use of multiple computing servers, a new CNN of monocular depth estimation with the temporal sequence, synthesizing novel frames, the YOLO object detection method, and object detection and image inpainting on high-resolution images.

    摘要 I ABSTRACT III 誌謝 V TABLE OF CONTENT VI LIST OF FIGURES IX LIST OF TABLES XIV Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Purposes 2 1.3 Contribution 3 1.4 Scope 3 1.5 Organization 4 Chapter 2 Related Works 6 2.1 Applications of Google Earth and Google Street View 6 2.2 Object Detection 7 2.2.1. CPU-Based Machine Learning 7 2.2.2. GPU-Based Deep Learning 10 2.3 Foreground Extraction 12 2.4 Image Inpainting 12 2.5 Depth Map Prediction 14 2.6 Depth-Image-Based Rendering 18 2.7 Stereoscopic VR360 19 Chapter 3 Google Street View Player with HOG+Haar and Inpainting 22 3.1 System Architecture 22 3.2 System Flow 25 3.3 Implementation 26 3.3.1. Preprocessing and First-Staged Inpainting 26 3.3.2. Transformation Matrices between Two Consecutive Images 30 3.3.3. Object Detection using HOG+Haar and Segmentation 33 3.3.4. Road Structure Propagation and Second-Staged Inpainting 37 3.3.5. Generation of the Inpainted Continuous Navigation Animation 39 Chapter 4 Google Street View Player with Faster R-CNN and Inpainting 43 4.1 System Architecture 43 4.2 System Flow 48 4.3 Implementation 48 4.3.1. Preprocessing and First-Staged Inpainting 49 4.3.2. Transformation Matrices between Two Consecutive Images 49 4.3.3. Object Detection using Faster R-CNN and Segmentation 49 4.3.4. Road Structure Propagation and Second-Staged Inpainting 50 4.3.5. Generation of the Inpainted Continuous Navigation Animation 51 Chapter 5 Google Street View Player with Stereoscopic Virtual Reality 360 Display 52 5.1 System Architecture 52 5.2 System Flow 57 5.3 Implementation 59 5.3.1. Image Fetching and Downloading 59 5.3.2. Image Stitching 61 5.3.3. Monocular Depth Map Estimation 66 5.3.4. Depth-Image-Based Rendering 74 5.3.5. Compressing and Uploading 84 5.3.6. Unity and 3D VR 360 Display 84 Chapter 6 Results and Evaluations 93 6.1 GSVPlayer-HH&I 93 6.1.1. System Setup 93 6.1.2. Result and Evaluation 96 6.1.3. Discussion and Limitation 113 6.2 GSVPlayer-FRRCNN&I 116 6.2.1. System Setup 116 6.2.2. Result and Evaluation 116 6.2.3. Discussion and Limitation 124 6.3 GSVPlayer-3DVR360 127 6.3.1. System Setup 127 6.3.2. Result and Evaluation 127 6.3.3. Discussion and Limitation 145 Chapter 7 Conclusion and Future Works 148 7.1 Conclusion 148 7.2 Future Works 149 7.2.1. The Use of Multiple Computing Servers 149 7.2.2. A New CNN of Monocular Depth Estimation with the Temporal Sequence 150 7.2.3. Synthesizing Novel Frames 150 7.2.4. The YOLO Object Detection Method 150 7.2.5. Object Detection and Image Inpainting on High-Resolution Images 151 7.2.6. Objective Method to Evaluate the “Smoother” Issue 151 7.2.7. More Improvements of System and Function 152 References 155 Appendix I 162 Appendix II 165 Appendix III 167

    1. Aaron (2016) CycleVR. In, UK
    2. Anand A, Saxena A (2010) Converting movie-grade 2D videos to 3D. In, CiteSeerx, p 1-7
    3. Barnes C, Shechtman E, Finkelstein A, Goldman DB (2009) PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing. In: ACM SIGGRAPH 2009 Papers. ACM, New Orleans, Louisiana, p 24:21-24:11
    4. Bertalmio M, Sapiro G, Caselles V, Ballester C (2000) Image inpainting. In: ACM SIGGRAPH 2010 Papers. ACM, p 417-424
    5. Boykov Y, Veksler O, Zabih R (2001) Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:1222-1239
    6. Chen YY, Ning C, Zhou YY, Wu KH, Zhang WW (2014) Pedestrian detection and tracking for counting applications in metro station. Discrete Dynamics in Nature and Society 2014
    7. Cheng Y-B, Yang C-K, Chang G-C, Chang T-W (2018) Automatic Generation of Video Navigation from Google Street View Data with Car Detection and Inpainting. Multimedia Tools and Applications:in press
    8. Chu W-T, Chao Y-C, Chang Y-S (2015) Street sweeper: detecting and removing cars in street view images. Multimedia Tools and Applications 74:10965-10988
    9. Criminisi A, Perez P, Toyama K (2004) Region Filling and Object Removal by Exemplar-Based Image Inpainting. IEEE Trans Image Process 13:1200:1201-1212
    10. Diener E, Emmons RA, Larsen RJ, Griffin S (1985) The Satisfaction with Life Scale. Journal of Personality Assessment 49:71-75
    11. Eigen D, Puhrsch C, Fergus R (2014) Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In: The 27th Advances in Neural Information Processing System (NIPS 2014). p 1-9
    12. Fehn C (2003) A 3D-TV Approach Using Depth-image-based Rendering (DIBR). In: The 3rd International conference, Visualization imaging and image processing. Visualization imaging and image processing, Benalmadena, Spain, p 482-487
    13. Fehn C (2004) Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV. In: Proc. SPIE 5291, Stereoscopic Displays and Virtual Reality Systems XI. San Jose, California, United States, p 93-104
    14. Felzenszwalb PF, Girshick RB, McAllester D (2010) Cascade object detection with deformable part models. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. p 2241-2248
    15. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32:1627-1645
    16. Flynn J, Neulander I, Philbin J, Snavely N (2016) DeepStereo: Learning to Predict New Views From the World's Imagery. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016. IEEE
    17. Garg R, BG VK, Carneiro G, Reid I (2016) Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In: The 14th European Conference on Computer Vision (ECCV 2016). Amsterdam, the Netherlands, p 1-16
    18. Girshick RB (2015) Fast R-CNN. In: IEEE ICCV 2015. arXiv - CoRR
    19. Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE CVPR 2014. arXiv - CoRR
    20. Godard C, Aodha OM, Brostow GJ (2017) Unsupervised Monocular Depth Estimation with Left-Right Consistency. In: IEEE CVPR 2017. arXiv
    21. Guy R, Truong K (2012) CrossingGuard: Exploring Information Content in Navigation Aids for Visually Impaired Pedestrians. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Austin, Texas, USA, p 405-414
    22. Hao D, Feng X, Fan W, Chengxi Y (2015) A fast pedestrians counting method based on haar features and spatio-temporal correlation analysis. In: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service. ACM, Zhangjiajie, Hunan, China, p 1-4
    23. He K, Zhang X, Ren S, Sun J (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2015. arXiv - CoRR
    24. Huang J-B, Kang SB, Ahuja N, Kopf J (2014) Image Completion Using Planar Structure Guidance. ACM Trans. Graph. 33:129:121-129:110
    25. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia. ACM, Orlando, Florida, USA, p 675-678
    26. Kansal S, Jain P (2015) Automatic Seed Selection Algorithm for Image Segmentation using Region Growing. International Journal of Advances in Engineering & Technology 8:362-367
    27. Karsch K, Liu C, Kang SB (2012) Depth Extraction from Video Using Non-parametric Sampling. In: The 12th European Conference on Computer Vision (ECCV 2012). Florence, Italy, p 775-788
    28. Karsch K, Liu C, Kang SB (2014) Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2144-2158
    29. Kim G, Cho JS (2012) Vision-based vehicle detection and inter-vehicle distance estimation. In: 2012 12th International Conference on Control, Automation and Systems. p 625-629
    30. Kopf J, Chen B, Szeliski R, Cohen M (2010) Street Slide: Browsing Street Level Imagery. ACM Trans. Graph. 29:96:91-96:98
    31. Kuznietsov Y, Stückler J, Leibe B (2017) Semi-Supervised Deep Learning for Monocular Depth Map Prediction. In: IEEE CVPR 2017. IEEE, p 6647-6655
    32. Li Y, Sun J, Tang C-K, Shum H-Y (2004) Lazy snapping. In: ACM SIGGRAPH 2004 Papers. ACM, Los Angeles, California, p 303-308
    33. Liu F, Shen C, Lin G, Reid I (2016) Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38:2024-2039
    34. Liu G, Reda FA, Shih KJ, Wang T-C, Tao A, Catanzaro B (2018) Image Inpainting for Irregular Holes Using Partial Convolutions. arXiv - CoRR abs/1804.07723
    35. Malisiewicz T, Gupta A, Efros AA (2011) Ensemble of exemplar-SVMs for object detection and beyond. In: 2011 International Conference on Computer Vision. p 89-96
    36. Malisiewicz T, Shrivastava A, Gupta A, Efros AA (2012) Exemplar-SVMs for visual object detection, label transfer and image retrieval. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012. p lxix-lxx
    37. Meur OL, Gautier J, Guillemot C (2011) Examplar-based inpainting based on local geometry. In: 2011 18th IEEE International Conference on Image Processing. p 3401-3404
    38. Mortensen EN, Barrett WA (1995) Intelligent scissors for image composition. In: SIGGRAPH '95. SIGGRAPH '95 Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, New York, NY, USA, p 191-198
    39. Oliveira MR, Santos VMF (2008) Automatic Detection of Cars in Real Roads using Haar-like Features. In: CONTROL2008. Proceedings of the 8th Portuguese Conference on Automatic Control (CONTROL2008), p 1-6
    40. Peng Y, Xu M, Jin JS, Luo S, Zhao G (2011) Cascade-Based License Plate Localization with Line Segment Features and Haar-Like Features. In: 2011 Sixth International Conference on Image and Graphics. p 1023-1028
    41. Prananta E, Pranowo, Budianto D (2016) GPU CUDA Accelerated Image Inpainting using Fourth Order PDE Equation. Telkomnika 14:1009-1015
    42. Rasmussen M (2011) boxcutter. In:
    43. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You Only Look Once: Unified, Real-Time Object Detection. In: IEEE CVPR 2016. IEEE, Las Vegas, NV, USA, p 1-10
    44. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39:1137-1149
    45. Rother C, Kolmogorov V, Blake A (2004) "GrabCut" - interactive foreground extraction using iterated graph cuts. In: ACM SIGGRAPH 2004 Papers. ACM, Los Angeles, California, p 309-314
    46. Rybski PE, Huber D, Morris DD, Hoffman R (2010) Visual classification of coarse vehicle orientation using Histogram of Oriented Gradients features. In: 2010 IEEE Intelligent Vehicles Symposium. p 921-928
    47. Saxena A, Sun M, Ng AY (2007) Learning 3-D Scene Structure from a Siingle Still Image. In: IEEE 11th International Conference on Computer Vision, workshop on 3D Representation for Recognition (3dRR-07). IEEE, Rio de Janeiro, Brazil, p 1-8
    48. Saxena A, Sun M, Ng AY (2008) Make3D: Depth Perception from a Single Still Image. In: The 23rd AAAI Conference on Artificial Intelligence and the 20th Innovative Applications of Artificial Intelligence Conference (AAAI 2008). Chicago, IL, United States, p 1571-1576
    49. Saxena A, Sun M, Ng AY (2009) Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:824-840
    50. Shih FY, Cheng S (2005) Automatic seeded region growing for color image segmentation. Image and Vision Computing 23:877-886
    51. Silva DVSXD, Fernando WAC, Arachchi HK (2010) A New Mode Selection Technique for Coding Depth Maps of 3D Video. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Dallas, TX, USA
    52. Silva DVSXD, Fernando WAC, Yasakethu SLP (2009) Object Based Coding of the Depth Maps for 3D Video Coding. IEEE Transactions on Consumer Electronics 55:1699-1706
    53. Tsai S-F, Cheng C-C, Li C-T, Chen L-G (2011) A Real-Time 1080p 2D-to-3D Video Conversion System. In: 2011 IEEE International Conference on Consumer Electronics (ICCE). IEEE, Las Vegas, NV, USA
    54. Viola P, Jones M (2001) Rapid Object Detection using a Boosted Cascade of Simple Features. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001). Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p I-511-I-518
    55. Wang J, Agrawala M, Cohen MF (2007) Soft Scissors: An Interactive Tool for Realtime High Quality Matting. In: ACM SIGGRAPH 2007 Papers. ACM, San Diego, California, p 9-1 - 9-6
    56. Xie J, Girshick R, Farhadi A (2016) Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks. In: The 14th European Conference on Computer Vision (ECCV 2016). Amsterdam, the Netherlands, p 1-15
    57. Yang C, Lu X, Lin Z, Shechtman E, Wang O, Li H (2017) High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis. arXiv - CoRR abs/1611.09969
    58. Yoshimoto Y, Dang TH, Kimura A, Shibata F, Tamura H (2011) Interaction Design of 2D/3D Map Navigation on Wall and Tabletop Displays. In: Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces. ACM, Kobe, Japan, p 254-255
    59. Yu J, Lin Z, Yang J, Shen X, Lu X, Huang TS (2018) Generative Image Inpainting with Contextual Attention. In: IEEE CVPR 2018. arXiv
    60. Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised Learning of Depth and Ego-Motion from Video. In: IEEE CVPR 2017. IEEE, p 1-10

    無法下載圖示 Full text public date 2024/01/30 (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)