簡易檢索 / 詳目顯示

研究生: 謝禾彥
He-Yen Hsieh
論文名稱: 基於嵌入式系統實現即時影像情境辨識系統
Implementing a Real Time Image Captioning System for Scene Identification Using Embedded System
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 袁錦鋒
Kam-Fung Yuen
林昌鴻
Chang-Hong Lin
林敬舜
Ching-Shun Lin
陳省隆
Hsing-Lung Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 整份PDF總頁數59頁
中文關鍵詞: 居家看護影像標註影像分析卷積神經網路自然語言處理長短期記憶網路注意力機制嵌入式系統
外文關鍵詞: home care,, image captioning, image processing, convolutional neural network, natural language processing, long short term memory network, attention mechanism, embedded system
相關次數: 點閱:252下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,居家看護越來越受重視,也越來越多人思考如何利用現今的技術來做協助。
    隨著無線通訊技術與物聯網的快速發展,加上現代人身邊都有行動裝置,因此利用網路攝影機遠端觀看家中的情況越來越普遍。
    然而將拍攝的影像傳送到使用者的裝置中,會導致需要多花時間了解影像中的意義以及過多的影像會佔用太多裝置的容量。
    因此藉由透過一個模型能夠幫助我們辨識影像中的情境並且萃取影像的內容轉化成人類能夠閱讀的語句。

    在這篇論文中,我們基於嵌入式系統實現即時的影像情境辨識系統。
    透過安裝在嵌入式系統的網路攝影機拍攝影像,並且利用移植在嵌入式系統中的影像標註模型將拍攝到的影像轉化成人類能夠閱讀的語句,
    提供使用者能夠快速了解影像中的情境。
    藉由影像標註模型中的深層卷積神經網路進行影像分析,
    透過注意力機制協助長短期記憶網路利用影像的特徵來生成相對應的語句。
    得益於嵌入式系統的可攜性,我們能夠將我們的情境辨識系統放置在家中的任何地方。
    為了驗證我們的系統,我們使用數個裝置以比較情境辨識系統的執行時間,
    並且展示將拍攝的實際場景轉化成自然語言。


    Recently, people have gradually paid their attention to home care,
    and are considering how to use technology to assist them.
    With the rapid development of wireless communication technology and the Internet of Things,
    and the fact that modern people have mobile devices around them,
    it is more and more common to use a webcam to view the home at a remote location.
    However, transmitting the captured images to the user's device may result in the need to spend more time
    understanding the meaning of the image. In addition, too many images consume storage space on the device.
    Therefore, we use a model to extract the content of the image into a sentence that humans can read.

    In this paper, we implement a real-time image captioning system for scene identification using an embedded system.
    Our system captures images through webcam, and uses the image captioning
    model ported in the embedded system to convert the captured images into human-readable sentences.
    Users can understand the meaning of the image quickly with the assistance of our system.
    There are two steps in the image captioning model which converts captured images into human-readable sentences.
    First, the images features are extracted through deep convolutional neural networks.
    And then, the long short-term memory network produces corresponding words by using the images features.
    Due to the portability of embedded systems, we are able to place our image captioning system
    for scene identification anywhere in the home.
    To validate our proposed system,
    we compare the execution time on several different devices.
    In addition, we show the generated sentences converted from captured images.

    論文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II 誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III 目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV 圖目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI 表目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.1 forget gate layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.2 input gate layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.3 output gate layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.4 LSTM unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Image Annotation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1 4.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.1 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.2 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 IV4.2.1 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2.2 LTSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 Experiment and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1 5.2 5.3 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1.1 Raspberry Pi 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.2 Upboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.1.3 Personal Computer . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.1 Build up experiement envionment in Upboard embedded system . . 19 5.2.2 Build up experiement envionment in Raspberry Pi 3 . . . . . . . . 26 5.2.3 Implement a image captioning system for scene identification . . . 34 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3.1 Generated Captioning . . . . . . . . . . . . . . . . . . . . . . . . 42 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 附錄:影像標註生成範例 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    [1] Yang, Y., Teo, C. L., Daumé III, H., Aloimonos, Y. (2011, July). Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical
    Methods in Natural Language Processing (pp. 444-454). Association for Computational
    Linguistics.
    [2] Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., Choi, Y. (2011, June). Composing simple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Con-
    ference on Computational Natural Language Learning (pp. 220-228). Association for
    Computational Linguistics.
    [3] Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., ... Daumé III, H.
    (2012, April). Midge: Generating image descriptions from computer vision detections.
    In Proceedings of the 13th Conference of the European Chapter of the Association for
    Computational Linguistics (pp. 747-756). Association for Computational Linguistics.
    [4] Elliott, D., Keller, F. (2013). Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1292-1302).
    [5] Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., ... Berg, T. L. (2013).
    Babytalk: Understanding and generating simple image descriptions. IEEE Transactions
    on Pattern Analysis and Machine Intelligence, 35(12), 2891-2903.
    [6] Kuznetsova, P., Ordonez, V., Berg, A. C., Berg, T. L., Choi, Y. (2012, July). Collective
    generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of
    the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 359-368).
    Association for Computational Linguistics.
    [7] Kuznetsova, P., Ordonez, V., Berg, T., Choi, Y. (2014). Treetalk: Composition and
    compression of trees for image descriptions. Transactions of the Association of Computational Linguistics, 2(1), 351-362.
    [8] Kiros, R., Salakhutdinov, R., Zemel, R. (2014, January). Multimodal neural language
    models. In International Conference on Machine Learning (pp. 595-603).
    [9] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A. (2014). D eep captioning
    with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.
    44[10] Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and
    pattern recognition (pp. 3156-3164).
    [11] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S.,
    Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual
    recognition and description. In Proceedings of the IEEE conference on computer vision
    and pattern recognition (pp. 2625-2634).
    [12] Karpathy, A., Fei-Fei, L. (2015). Deep visual-semantic alignments for generating
    image descriptions. In Proceedings of the IEEE conference on computer vision and
    pattern recognition (pp. 3128-3137).
    [13] Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., ... Lawrence
    Zitnick, C. (2015). From captions to visual concepts and back. In Proceedings of the
    IEEE conference on computer vision and pattern recognition (pp. 1473-1482).
    [14] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
    H., Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder
    for statistical machine translation. arXiv preprint arXiv:1406.1078.
    [15] Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly
    learning to align and translate. arXiv preprint arXiv:1409.0473.
    [16] Larochelle, H., Hinton, G. E. (2010). Learning to combine foveal glimpses with a
    third-order Boltzmann machine. In Advances in neural information processing systems
    (pp. 1243-1251).
    [17] Denil, M., Bazzani, L., Larochelle, H., de Freitas, N. (2012). Learning where to attend
    with deep architectures for image tracking. Neural computation, 24(8), 2151-2184.
    [18] Tang, Y., Srivastava, N., Salakhutdinov, R. R. (2014). Learning generative models
    with visual attention. In Advances in Neural Information Processing Systems (pp. 1808-
    1816).
    [19] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... Bengio, Y.
    (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057).
    [20] Bengio, Y., Simard, P., Frasconi, P. (1994). Learning long-term dependencies with
    gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.
    45[21] Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    [22] Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-
    scale image recognition. arXiv preprint arXiv:1409.1556.
    [23] Zaremba, W., Sutskever, I., Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
    [24] Adrian Rosebrock (2016). Ubuntu 16.04: How to install OpenCV. Retrieved from
    https://www.pyimagesearch.com/2016/10/24/ubuntu-16-04-how-to-install-opencv/
    [25] Adrian Rosebrock (2017). Raspbian Stretch: Install OpenCV 3 + Python on your
    Raspberry Pi. Retrieved from https://www.pyimagesearch.com/2016/10/24/ubuntu-16-
    04-how-to-install-opencv/

    QR CODE