研究生: |
吳文綺 Wen-Chi Wu |
---|---|
論文名稱: |
運用排序學習進行影片摘要之研究 The Study of Rank Pooling for Video Summarization Using Deep-Learned Features |
指導教授: |
陳郁堂
Yie-Tarng Chen |
口試委員: |
陳郁堂
Yie-Tarng Chen 呂政修 Jenq-Shiou Leu 吳乾彌 Chen-Mie Wu 林銘波 Ming-Bo Lin 陳省隆 Hsing-Lung Chen |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 38 |
中文關鍵詞: | 影片摘要 、關鍵幀 、排序學習 、深度學習 |
外文關鍵詞: | Video Sumarization, Keyframe, Convolutional Neural Network, Rank Machine |
相關次數: | 點閱:272 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現今科技蓬勃發展,社群網站發展多元、多媒體串流系統為主流,許多影像裝置也越來越普及,例如:GoPro、Google Glass、Smart Phone... 等。觀看多媒體串流的使用者也越來越多,在這些多媒體串流系統上附加方便的快速瀏覽功能成為一個相當迫切的需求。選取索引瀏覽的一個方法是選取出重點的畫面(Keyframe),選取的重點畫面就是影片摘要(Video Summarization)。我們的目標是希望能夠從一部影片中,精準地取出最具代表這部影片的重要畫面,且包含時間資訊的前後因果關係讓瀏覽的使用者能夠明確知道這部影片的劇情走向。
首先,我們利用迭代量化(Iterative Quantization) 演算法來解決保留二進制碼相似性的問題,在這邊將相似的畫面移除,使用的特徵資料是深度學習特徵(Deep-learned Feature)。接下來,針對這些特徵找出時間軸與外觀的代表進行評分(VideoDarwin)。最後,我們使用SNIP 演算法(Statistics-sensitive Non-linearIterative Peak-clipping Algorithm) 來定義影片滑動視窗(Sliding Windows) 來選擇關鍵幀作為影片摘要。
在實驗中,我們提出的方法選出之影片摘要具有影片中完整故事結構的特性;
實驗部分採用Open Video Project 和YouTube dataset 蒐集的影片及其他不同類型影
片作為測試影片,從實驗結果可以看出,證明我們提出的方法成功將時間資訊及
外觀兩種不同資訊合併編碼,且選取結果關鍵幀有很好的準確度。
Video summarization becomes more important than ever, for it can help us digest, browse, and search for today’s fast-growing videos. Most existing works do not fully utilize the appearance and temporal information well. In this paper, we propose a novel approach for video summarization. Based on a rank pooling schems aggregating appearance and temporal information. First, we delete reduandent image frames by learning a similarity preserving binary code, called iterative quantization (ITQ) algorithm, which is based on the deep-learned features. Then, the ranking machine scores the frames in appearance and temporal information. Finally, we use the Statistics-sensitive Non-linear Iterative Peak-clipping algorithm to learn the sliding windows selecting keyframes for video summarization. Experimental results show the proposed video summarization scheme outperforms
state-of-the-art approaches on Open Video Project and YouTube datasets.
[1] M. Furini, F. Geraci, M. Montangero, and M. Pellegrini, “STIMO: STIll and MOving
video storyboard for the web scenario,” in European Conference on Multimedia
Tools and Applications, pp. 47–69, Springer, 2010.
[2] S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albuquerque Araújo,
“VSUMM: A mechanism designed to produce static video summaries and a novel
evaluation method,” pp. 56–68, Elsevier, 2011.
[3] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection
for supervised video summarization,” in Advances in Neural Information Processing
Systems, pp. 2069–2077, 2014.
[4] K. Zhang, W. L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplarbased
subset selection for video summarization,” in IEEE transactions on Computer
Vision and Pattern Recognition, pp. 1059–1067, 2016.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing systems,
pp. 1097–1105, 2012.
[6] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with
adversarial LSTM networks,” in IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2982–2991, 2017.
[7] B. Fernando, E. Gavves, M. J. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling
video evolution for action recognition,” in IEEE transactions on Computer Vision
and Pattern Recognition, pp. 5378–5387, 2015.
[8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean
approach to learning binary codes for large-scale image retrieval,” in IEEE
Transactions on Pattern Analysis and Machine Intelligence, pp. 2916–2929, 2013.
[9] Z. Cernekova, I. Pitas, and C. Nikou, “Information theory-based shot cut/fade detection
and video summarization,” in IEEE Transactions on Circuits and Systems for
Video Technology, pp. 82–91, 2006.
[10] M. Morháč, “An algorithm for determination of peak regions and baseline elimination
in spectroscopic data,” pp. 478–487, Elsevier, 2009.
[11] W. Wolf, “Key frame selection by motion analysis,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing Conference Proceedings,
pp. 1228–1231, 1996.
[12] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for contentbased
video retrieval and browsing,” pp. 643–658, Elsevier, 1997.
[13] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for video object summarization,”
in IEEE transactions on Pattern Analysis and Machine Intelligence,
pp. 2178–2190, 2010.
[14] D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz, “Schematic storyboarding for
video visualization and editing,” in Proceedings of ACM International Conference
on Graphics, pp. 862–871, ACM, 2006.
[15] R. Laganière, R. Bacco, A. Hocevar, P. Lambert, G. Pan”nis, and B. E. Ionescu,
“Video summarization from spatio-temporal features,” in Proceedings of the 2nd
ACM International Conference on Multimedia, pp. 144–148, ACM, 2008.
[16] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Automatic video summarization by graph
modeling,” in IEEE transactions on Computer Vision and Pattern Recognition,
pp. 104–109, 2003.
[17] P. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based video summarization using
Delaunay clustering,” in European Conference on Digital Libraries, pp. 219–232,
Springer, 2006.
[18] J. Wu, S.-h. Zhong, J. Jiang, and Y. Yang, “A novel clustering method for static
video summarization,” in European Conference on Computer Vision, pp. 9625–9641,
Springer, 2017.
[19] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long
short-term memory,” in European Conference on Computer Vision, pp. 766–782,
Springer, 2016.
[20] U. Gargi, R. Kasturi, and S. H. Strayer, “Performance characterization of video-shotchange
detection methods,” pp. 1–13, 2000.
[21] J. Wang, S. Kumar, and S. Chang, “Semi-supervised hashing for large-scale image
retrieval,” in IEEE Transactions on Computer Vision and Pattern Recognition,
pp. 3424–3431, 2010.
[22] T.-Y. Liu and Others, “Learning to rank for information retrieval,” pp. 225–331,
2009.
[23] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries
from user videos,” in European Conference on Computer Vision, pp. 505–520,
Springer, 2014.
[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in
Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–
678, ACM, 2014.