簡易檢索 / 詳目顯示

研究生: 吳文綺
Wen-Chi Wu
論文名稱: 運用排序學習進行影片摘要之研究
The Study of Rank Pooling for Video Summarization Using Deep-Learned Features
指導教授: 陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Yie-Tarng Chen
呂政修
Jenq-Shiou Leu
吳乾彌
Chen-Mie Wu
林銘波
Ming-Bo Lin
陳省隆
Hsing-Lung Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 38
中文關鍵詞: 影片摘要關鍵幀排序學習深度學習
外文關鍵詞: Video Sumarization, Keyframe, Convolutional Neural Network, Rank Machine
相關次數: 點閱:272下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今科技蓬勃發展,社群網站發展多元、多媒體串流系統為主流,許多影像裝置也越來越普及,例如:GoPro、Google Glass、Smart Phone... 等。觀看多媒體串流的使用者也越來越多,在這些多媒體串流系統上附加方便的快速瀏覽功能成為一個相當迫切的需求。選取索引瀏覽的一個方法是選取出重點的畫面(Keyframe),選取的重點畫面就是影片摘要(Video Summarization)。我們的目標是希望能夠從一部影片中,精準地取出最具代表這部影片的重要畫面,且包含時間資訊的前後因果關係讓瀏覽的使用者能夠明確知道這部影片的劇情走向。
    首先,我們利用迭代量化(Iterative Quantization) 演算法來解決保留二進制碼相似性的問題,在這邊將相似的畫面移除,使用的特徵資料是深度學習特徵(Deep-learned Feature)。接下來,針對這些特徵找出時間軸與外觀的代表進行評分(VideoDarwin)。最後,我們使用SNIP 演算法(Statistics-sensitive Non-linearIterative Peak-clipping Algorithm) 來定義影片滑動視窗(Sliding Windows) 來選擇關鍵幀作為影片摘要。
    在實驗中,我們提出的方法選出之影片摘要具有影片中完整故事結構的特性;
    實驗部分採用Open Video Project 和YouTube dataset 蒐集的影片及其他不同類型影
    片作為測試影片,從實驗結果可以看出,證明我們提出的方法成功將時間資訊及
    外觀兩種不同資訊合併編碼,且選取結果關鍵幀有很好的準確度。


    Video summarization becomes more important than ever, for it can help us digest, browse, and search for today’s fast-growing videos. Most existing works do not fully utilize the appearance and temporal information well. In this paper, we propose a novel approach for video summarization. Based on a rank pooling schems aggregating appearance and temporal information. First, we delete reduandent image frames by learning a similarity preserving binary code, called iterative quantization (ITQ) algorithm, which is based on the deep-learned features. Then, the ranking machine scores the frames in appearance and temporal information. Finally, we use the Statistics-sensitive Non-linear Iterative Peak-clipping algorithm to learn the sliding windows selecting keyframes for video summarization. Experimental results show the proposed video summarization scheme outperforms
    state-of-the-art approaches on Open Video Project and YouTube datasets.

    中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . VIII List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . 7 3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Iteration Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Temporal Encoding by Video Evolution . . . .. . . . . . . . . . 9 3.4 Low-pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.5 Filter of Peak Region . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Performance Metric and Parameter Setting . .. . . . . . . 15 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    [1] M. Furini, F. Geraci, M. Montangero, and M. Pellegrini, “STIMO: STIll and MOving
    video storyboard for the web scenario,” in European Conference on Multimedia
    Tools and Applications, pp. 47–69, Springer, 2010.
    [2] S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albuquerque Araújo,
    “VSUMM: A mechanism designed to produce static video summaries and a novel
    evaluation method,” pp. 56–68, Elsevier, 2011.
    [3] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection
    for supervised video summarization,” in Advances in Neural Information Processing
    Systems, pp. 2069–2077, 2014.
    [4] K. Zhang, W. L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplarbased
    subset selection for video summarization,” in IEEE transactions on Computer
    Vision and Pattern Recognition, pp. 1059–1067, 2016.
    [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
    convolutional neural networks,” in Advances in neural information processing systems,
    pp. 1097–1105, 2012.
    [6] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with
    adversarial LSTM networks,” in IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 2982–2991, 2017.
    [7] B. Fernando, E. Gavves, M. J. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling
    video evolution for action recognition,” in IEEE transactions on Computer Vision
    and Pattern Recognition, pp. 5378–5387, 2015.
    [8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean
    approach to learning binary codes for large-scale image retrieval,” in IEEE
    Transactions on Pattern Analysis and Machine Intelligence, pp. 2916–2929, 2013.
    [9] Z. Cernekova, I. Pitas, and C. Nikou, “Information theory-based shot cut/fade detection
    and video summarization,” in IEEE Transactions on Circuits and Systems for
    Video Technology, pp. 82–91, 2006.
    [10] M. Morháč, “An algorithm for determination of peak regions and baseline elimination
    in spectroscopic data,” pp. 478–487, Elsevier, 2009.
    [11] W. Wolf, “Key frame selection by motion analysis,” in IEEE International Conference
    on Acoustics, Speech, and Signal Processing Conference Proceedings,
    pp. 1228–1231, 1996.
    [12] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for contentbased
    video retrieval and browsing,” pp. 643–658, Elsevier, 1997.
    [13] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for video object summarization,”
    in IEEE transactions on Pattern Analysis and Machine Intelligence,
    pp. 2178–2190, 2010.
    [14] D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz, “Schematic storyboarding for
    video visualization and editing,” in Proceedings of ACM International Conference
    on Graphics, pp. 862–871, ACM, 2006.
    [15] R. Laganière, R. Bacco, A. Hocevar, P. Lambert, G. Pan”nis, and B. E. Ionescu,
    “Video summarization from spatio-temporal features,” in Proceedings of the 2nd
    ACM International Conference on Multimedia, pp. 144–148, ACM, 2008.
    [16] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Automatic video summarization by graph
    modeling,” in IEEE transactions on Computer Vision and Pattern Recognition,
    pp. 104–109, 2003.
    [17] P. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based video summarization using
    Delaunay clustering,” in European Conference on Digital Libraries, pp. 219–232,
    Springer, 2006.
    [18] J. Wu, S.-h. Zhong, J. Jiang, and Y. Yang, “A novel clustering method for static
    video summarization,” in European Conference on Computer Vision, pp. 9625–9641,
    Springer, 2017.
    [19] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long
    short-term memory,” in European Conference on Computer Vision, pp. 766–782,
    Springer, 2016.
    [20] U. Gargi, R. Kasturi, and S. H. Strayer, “Performance characterization of video-shotchange
    detection methods,” pp. 1–13, 2000.
    [21] J. Wang, S. Kumar, and S. Chang, “Semi-supervised hashing for large-scale image
    retrieval,” in IEEE Transactions on Computer Vision and Pattern Recognition,
    pp. 3424–3431, 2010.
    [22] T.-Y. Liu and Others, “Learning to rank for information retrieval,” pp. 225–331,
    2009.
    [23] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries
    from user videos,” in European Conference on Computer Vision, pp. 505–520,
    Springer, 2014.
    [24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
    and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in
    Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–
    678, ACM, 2014.

    QR CODE