簡易檢索 / 詳目顯示

研究生: 周儷潔
Li-Chieh Chou
論文名稱: 卷積長短期記憶與雙向遞迴神經網路結合自注意力機制之深度強化學習於影片摘要
Convolutional LSTM based Bidirectional RNN with Self-attention for deep reinforcement learning in Video Summarization
指導教授: 蘇順豐
Shun-Feng Su
口試委員: 郭重顯
Chung-Hsien Kuo
王偉彥
Wei-Yen Wang
鍾聖倫
Sheng-Luen Chung
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 51
中文關鍵詞: 影片摘要雙向遞迴神經網路卷積長短期記憶網路自注意力機制深度強化學習
外文關鍵詞: video summarization, bi-directional recurrent neural network, convolutional LSTM, self-attention, deep reinforcement learning
相關次數: 點閱:668下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本研究中,我們提出了將預訓練卷積神經網路(pretrained CNN network)從GoogLeNet替換成RexNeXt-50 [1], 並將雙向遞迴神經網絡(BRNN) [2]與卷積長短期記憶網路(ConvLSTM) [3]做結合。除此之外,再加入自注意力機制[4]的架構去改善系統的表現。影片摘要的任務是保留原始影片的內容、掌握影片的關鍵,輸出貼近觀眾想法的影片摘要。實現方法使用不需要標記基準真相(ground-truth)的深度強化學習(DRL)來進行訓練。除此之外,我們還分別添加了兩種損失函數,正規化損失函數以及重建損失函數,這樣的做法有助於提高穩定性和性能。我們提出的方法在 SumMe [5]數據集上獲得了 53.1% 的準確度。本研究提供了一個影片摘要的方法來獲得更具信息性和代表性的影片摘要結果。


    In this study, an architecture which replaces GoogLeNet in baseline approach by ResNeXt-50 [1] as the CNN pre-trained network as our model and combines the
    Bi-directional Recurrent Neural Network [2] with Convolutional Long Short-Term
    Memory (ConvLSTM) [3] in the system is proposed for video summarization. In addition, self-attention mechanisms [4] are added to improve the system performance. The video summarization task is to summarize close to the audience's thoughts, to preserve the content of the original videos, and to grasp the key points of the video summary. The implemented method is to consider Deep Reinforcement Learning for training, which does not require labeled data. In addition, two kinds of loss functions, regularization loss and reconstruction loss are considered in our approach and with those loss functions, it helps in improving the stability and performance in video summarization. The proposed method achieves state-of-the-art performance of 53.1% on the SumMe dataset [5]. It can be found that this study can indeed provide more informative and representative video summaries for video summarization.

    中文摘要 i Abstract ii 致謝 iii Contents iv List of Figures vii List of Tables ix Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 3 1.3 Contributions 4 1.4 Thesis Organization 5 Chapter 2 Related Work 6 2.1 Video Summarization 6 2.2 Baseline Approach 8 2.2.1 Convolutional Neural Network 11 2.2.2 Recurrent Neural Network 11 2.2.3 Loss Function 12 Chapter 3 Methodology 14 3.1 Network Architecture 16 3.1.1 CNN pre-trained network 16 3.1.2 BRNN 17 3.1.3 Convolutional Long Short-Term Memory 18 3.1.4 Self-attention 20 3.2 Deep Reinforcement Learning 21 3.2.1 Policy Gradient Methods 22 3.2.2 Reward Function 23 3.2.3 Optimization 25 3.3 Video Summary 25 Chapter 4 Experiments 27 4.1 Dataset 27 4.2 Evaluation Metrics and Protocol 28 4.3 Implementation Details 28 4.4 Comparison with State-of-the-arts 30 4.4.1 Quantitative Evaluation 30 4.4.2 Qualitative Evaluation 36 Chapter 5 Conclusions and Future Work 41 5.1 Conclusions 41 5.2 Future Work 41 References 43

    [1] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual
    transformations for deep neural networks," arXiv pre-print server, 2017-04-11
    2017, doi: arxiv:1611.05431.
    [2] M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE
    Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997, doi:
    10.1109/78.650093.
    [3] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo,
    "Convolutional LSTM network: a machine learning approach for precipitation
    nowcasting," in Proceedings of the 28th International Conference on Neural
    Information Processing Systems - Volume 1, Montreal, Canada, 2015, vol. 1:
    MIT Press, pp. 802-810.
    [4] A. Vaswani et al., "Attention is all you need," arXiv pre-print server, 2017, doi:
    arxiv:1706.03762.
    [5] M. Gygli, H. Grabner, H. Riemenschneider, and L. V. Gool, "Creating
    summaries from user videos," vol. 8695: Springer International Publishing, 2014,
    pp. 505-520.
    [6] T. Tsoneva, M. Barbieri, and H. Weda, "Automated summarization of narrative
    video on a semantic level," in International Conference on Semantic Computing
    (ICSC 2007), Irvine, CA, USA, 17-19 Sept. 2007, pp. 169-176, doi:
    10.1109/ICSC.2007.42.
    [7] T. Liu, Q. Meng, A. Vlontzos, D. R. Jeremy Tan, and B. Kainz, "Ultrasound
    video summarization using deep reinforcement learning," arXiv pre-print server,
    2020, doi: arxiv:2005.09531.
    [8] R. P. Mathews et al., "Unsupervised multi-latent space RL framework for video
    summarization in ultrasound imaging," IEEE Journal of Biomedical and Health
    Informatics, vol. 27, no. 1, pp. 227-238, 2023, doi: 10.1109/JBHI.2022.3208779.
    [9] T. Liu, Q. Meng, J.-J. Huang, A. Vlontzos, D. Rueckert, and B. Kainz, "Video
    summarization through reinforcement learning with a 3D spatio-temporal
    U-Net," IEEE Transactions on Image Processing, vol. 31, pp. 1573-1586, 2022,
    doi: 10.1109/TIP.2022.3143699.
    [10] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, "Video summarization with
    long short-term memory," arXiv pre-print server, 2016-07-29, doi:
    arxiv:1605.08110.
    [11] K. Zhou, Y. Qiao, and T. Xiang, "Deep reinforcement learning for unsupervised
    video summarization with diversity-representativeness reward," presented at the
    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
    and Thirtieth Innovative Applications of Artificial Intelligence Conference and
    Eighth AAAI Symposium on Educational Advances in Artificial Intelligence,
    New Orleans, Louisiana, USA, 02 February 2018.
    [12] A. Phaphuangwittayakul, Y. Guo, F. Ying, W. Xu, and Z. Zheng, "Self-attention
    recurrent summarization network with reinforcement learning for video
    summarization task," in 2021 IEEE International Conference on Multimedia and
    Expo (ICME), Shenzhen, China, 09 June 2021: Institute of Electrical and
    Electronics Engineers (IEEE), doi: 10.1109/ICME51207.2021.9428142.
    [13] M. S. Afzal and M. A. Tahir, "Reinforcement learning based video
    summarization with combination of ResNet and gated recurrent unit," in
    VISAPP 2021 - 16th International Conference on Computer Vision Theory and
    Applications, 2021, vol. 4, pp. 261-268.
    [14] P. Kadam, D. Vora, S. Mishra, S. Patil, K. Kotecha, A. Abraham, and L. A.
    Gabralla, "Recent challenges and opportunities in video summarization with
    machine learning algorithms," IEEE Access, vol. 10, pp. 122762-122785, 2022,
    doi: 10.1109/access.2022.3223379.
    [15] S. S. Thomas, S. Gupta, and V. K. Subramanian, "Smart surveillance based on
    video summarization," in 2017 IEEE Region 10 Symposium (TENSYMP),
    Cochin, India, 14-16 July 2017, pp. 1-5, doi:
    10.1109/TENCONSpring.2017.8070003.
    [16] K. Muhammad, T. Hussain, J. D. Ser, W. Ding, A. H. Gandomi, and V. H. C. D.
    Albuquerque, "Efficient video summarization for smart surveillance systems," in
    2022 IEEE Symposium Series on Computational Intelligence (SSCI), Singapore,
    Singapore, 4-7 Dec. 2022, pp. 672-677, doi:
    10.1109/SSCI51031.2022.10022220.
    [17] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, "Video
    summarization using deep neural networks: A survey," Proceedings of the IEEE,
    vol. 109, no. 11, pp. 1838-1863, 2021, doi: 10.1109/JPROC.2021.3117472.
    [18] M. Rochan, L. Ye, and Y. Wang, "Video summarization using fully convolutional
    sequence networks," presented at the Computer Vision – ECCV 2018, 06
    October 2018.
    [19] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino,
    "Summarizing videos with attention," Springer International Publishing, 2019,
    pp. 39-54.
    [20] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, "Learning
    hierarchical self-attention for video summarization," in 2019 IEEE International
    Conference on Image Processing (ICIP), Taipei, Taiwan, 22-25 September 2019:
    Institute of Electrical and Electronics Engineers (IEEE), pp. 3377-3381, doi:
    10.1109/ICIP.2019.8803639.
    [21] B. Mahasseni, M. Lam, and S. Todorovic, "Unsupervised video summarization
    with adversarial LSTM networks," in 2017 IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 09 November
    2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
    [22] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, "Discriminative feature
    learning for unsupervised video summarization," in AAAI'19/IAAI'19/EAAI'19:
    Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and
    Thirty-First Innovative Applications of Artificial Intelligence Conference and
    Ninth AAAI Symposium on Educational Advances in Artificial Intelligence,
    Honolulu, Hawaii, USA, January 2019: AAAI Press, pp. 8537–8544, doi:
    10.1609/aaai.v33i01.33018537.
    [23] E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras, "Summarizing videos
    using concentrated attention and considering the uniqueness and diversity of the
    video frames," in Proceedings of the 2022 International Conference on
    Multimedia Retrieval, Newark, NJ, USA, 27 June 2022: Association for
    Computing Machinery, pp. 407-415, doi: 10.1145/3512527.3531404.
    [24] G. Yaliniz and N. Ikizler-Cinbis, "Using independently recurrent networks for
    reinforcement learning based unsupervised video summarization," Multimedia
    Tools and Applications, vol. 80, no. 12, pp. 17827-17847, 2021, doi:
    10.1007/s11042-020-10293-x.
    [25] X. Wang, Y. Li, H. Wang, L. Huang, and S. Ding, "A video summarization
    model based on deep reinforcement learning with long-term dependency,"
    Sensors, vol. 22, no. 19, p. 7689, 2022, doi: 10.3390/s22197689.
    [26] S.-S. Zang, H. Yu, Y. Song, and R. Zeng, "Unsupervised video summarization
    using deep non-local video summarization networks," Neurocomputing, vol. 519,
    pp. 26-35, 28 January 2023.
    [27] U. N. Yoon, M. D. Hong, and G.-S. Jo, "Unsupervised video summarization
    based on deep reinforcement learning with interpolation," Sensors, vol. 23, no. 7,
    p. 3384, 2023, doi: 10.3390/s23073384.
    [28] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied
    to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp.
    2278-2324, Nov. 1998, doi: 10.1109/5.726791.
    [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the
    inception architecture for computer vision," arXiv pre-print server, 11 Dec 2015,
    doi: arxiv:1512.00567.
    [30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by
    back-propagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986, doi:
    10.1038/323533a0.
    [31] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
    Computation, vol. 9, no. 8, pp. 1735-1780, November 15, 1997, doi:
    10.1162/neco.1997.9.8.1735.
    [32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A
    large-scale hierarchical image database," in 2009 IEEE Conference on Computer
    Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248-255, doi:
    10.1109/CVPR.2009.5206848.
    [33] R. Bellman, "A Markovian Decision Process," J. Math. Mech., vol. 6, no. 5, pp.
    679-684, 1957.
    [34] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
    M. Riedmiller, "Playing Atari with deep reinforcement learning," arXiv pre-print
    server, 2013, doi: arxiv:1312.5602.
    [35] V. Mnih et al., "Human-level control through deep reinforcement learning,"
    Nature, vol. 518, no. 7540, pp. 529-533, 25 February 2015, doi:
    10.1038/nature14236.
    [36] L. A. d. Almeida and M. R. Thielo, "An intelligent agent playing generic action
    games based on deep reinforcement learning with memory restrictions," in 2020
    19th Brazilian Symposium on Computer Games and Digital Entertainment
    (SBGames), Recife, Brazil, 2020, pp. 29-37, doi:
    10.1109/SBGames51465.2020.00015.
    [37] M. Bojarski et al., "End to end learning for self-driving cars," arXiv pre-print
    server, 2016, doi: arxiv:1604.07316.
    [38] BRaviKiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sallab, S. Yogamani, and
    P. Pérez, "Deep reinforcement learning for autonomous driving: A survey," IEEE
    Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp.
    4909-4926, June 2022, doi: 10.1109/TITS.2021.3054625.
    [39] D. Zhang, J. Han, L. Zhao, and T. Zhao, "From discriminant to complete:
    reinforcement searching-agent learning for weakly supervised object detection,"
    IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12,
    pp. 5549-5560, December 2020, doi: 10.1109/TNNLS.2020.2969483.
    [40] X. Han, H. Liu, F. Sun, and X. Zhang, "Active object detection with multistep
    action prediction using deep Q-network," IEEE Transactions on Industrial
    Informatics, vol. 15, no. 6, pp. 3723-3731, June 2019, doi:
    10.1109/TII.2019.2890849.
    [41] S. Sangve, V. Govilkar, N. Shingade, S. Jathar, and A. Jadhav, "Multiple stock
    trading using ensemble strategy and deep reinforcement learning," in 2023
    International Conference on Sustainable Computing and Smart Systems
    (ICSCSS), Coimbatore, India, 2023, pp. 222-228, doi:
    10.1109/ICSCSS57650.2023.10169379.
    [42] H. Yang, X.-Y. Liu, S. Zhong, and A. Walid, "Deep reinforcement learning for
    automated stock trading," in ICAIF '20: Proceedings of the First ACM
    International Conference on AI in Finance, October 2020, no. 31: ACM, pp. 1-8,
    doi: 10.1145/3383455.3422540.
    [43] R. J. Williams, "Simple statistical gradient-following algorithms for
    connectionist reinforcement learning," Machine Learning, vol. 8, no. 3-4, pp.
    229-256, 1992, doi: 10.1007/bf00992696.
    [44] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv
    pre-print server, 2017, doi: arxiv:1412.6980.
    [45] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, "TVSum: Summarizing web
    videos using titles," in 2015 IEEE Conference on Computer Vision and Pattern
    Recognition (CVPR), Boston, MA, USA, 2015, pp. 5179-5187, doi:
    10.1109/CVPR.2015.7299154.
    [46] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, "Category-specific video
    summarization," vol. 8694: Springer International Publishing, 2014, pp.540-555.
    [47] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, "Summary transfer:
    exemplar-based subset selection for video summarization," arXiv pre-print
    server, 2016, doi: arxiv:1603.03369.
    [48] B. Zhao, X. Li, and X. Lu, "HSA-RNN: Hierarchical structure-adaptive RNN for
    video summarization," in 2018 IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7405-7414, doi:
    10.1109/CVPR.2018.00773.
    [49] L. Yuan, F. E. H. Tay, P. Li, and J. Feng, "Unsupervised video summarization
    with cycle-consistent adversarial LSTM networks," IEEE Transactions on
    Multimedia, vol. 22, no. 10, pp. 2711-2722, Oct. 2020, doi:
    10.1109/TMM.2019.2959451.
    [50] M. Rochan and Y. Wang, "Video summarization by learning from unpaired
    data," arXiv pre-print server, 2019, doi: arxiv:1805.12174.
    [51] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, "A
    stepwise, label-based approach for improving the adversarial training in
    unsupervised video summarization," in AI4TV '19: Proceedings of the 1st
    International Workshop on AI for Smart TV Content Production, Access and
    Delivery, October 2019: ACM, pp. 17-25, doi: 10.1145/3347449.3357482.
    [52] Z. Lei, C. Zhang, Q. Zhang, and G. Qiu, "FrameRank: A text processing
    approach to video summarization," in 2019 IEEE International Conference on
    Multimedia and Expo (ICME), Shanghai, China, 2019, pp. 368-373, doi:
    10.1109/ICME.2019.00071.
    [53] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras,
    "Unsupervised video summarization via attention-driven adversarial learning,"
    in MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol.
    11961: Springer International Publishing, 2020, pp. 492-504.
    [54] P. Alexoudi, I. Mademlis, and I. Pitas, "Escaping local minima in deep
    reinforcement learning for video summarization," in ICMR '23: Proceedings of
    the 2023 ACM International Conference on Multimedia Retrieval, Thessaloniki,
    Greece, June 2023: ACM, pp. 530-534, doi: 10.1145/3591106.3592288.
    [55] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras,
    "AC-SUM-GAN: connecting actor-critic and generative adversarial networks for
    unsupervised video summarization," IEEE Transactions on Circuits and Systems
    for Video Technology, vol. 31, no. 8, pp. 3278-3292, Aug. 2021, doi:
    10.1109/TCSVT.2020.3037883.

    QR CODE