卷積長短期記憶與雙向遞迴神經網路結合自注意力機制之深度強化學習於影片摘要

簡易檢索 / 詳目顯示

回結果列表

研究生：	周儷潔 Li-Chieh Chou
論文名稱：	卷積長短期記憶與雙向遞迴神經網路結合自注意力機制之深度強化學習於影片摘要 Convolutional LSTM based Bidirectional RNN with Self-attention for deep reinforcement learning in Video Summarization
指導教授：	蘇順豐 Shun-Feng Su
口試委員:	郭重顯 Chung-Hsien Kuo 王偉彥 Wei-Yen Wang 鍾聖倫 Sheng-Luen Chung
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	51
中文關鍵詞：	影片摘要、雙向遞迴神經網路、卷積長短期記憶網路、自注意力機制、深度強化學習
外文關鍵詞：	video summarization, bi-directional recurrent neural network, convolutional LSTM, self-attention, deep reinforcement learning
相關次數：	點閱：338 下載：14
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本研究中，我們提出了將預訓練卷積神經網路(pretrained CNN network)從GoogLeNet替換成RexNeXt-50 [1], 並將雙向遞迴神經網絡(BRNN) [2]與卷積長短期記憶網路(ConvLSTM) [3]做結合。除此之外，再加入自注意力機制[4]的架構去改善系統的表現。影片摘要的任務是保留原始影片的內容、掌握影片的關鍵，輸出貼近觀眾想法的影片摘要。實現方法使用不需要標記基準真相(ground-truth)的深度強化學習(DRL)來進行訓練。除此之外，我們還分別添加了兩種損失函數，正規化損失函數以及重建損失函數，這樣的做法有助於提高穩定性和性能。我們提出的方法在 SumMe [5]數據集上獲得了 53.1% 的準確度。本研究提供了一個影片摘要的方法來獲得更具信息性和代表性的影片摘要結果。

In this study, an architecture which replaces GoogLeNet in baseline approach by ResNeXt-50 [1] as the CNN pre-trained network as our model and combines the
Bi-directional Recurrent Neural Network [2] with Convolutional Long Short-Term
Memory (ConvLSTM) [3] in the system is proposed for video summarization. In addition, self-attention mechanisms [4] are added to improve the system performance. The video summarization task is to summarize close to the audience's thoughts, to preserve the content of the original videos, and to grasp the key points of the video summary. The implemented method is to consider Deep Reinforcement Learning for training, which does not require labeled data. In addition, two kinds of loss functions, regularization loss and reconstruction loss are considered in our approach and with those loss functions, it helps in improving the stability and performance in video summarization. The proposed method achieves state-of-the-art performance of 53.1% on the SumMe dataset [5]. It can be found that this study can indeed provide more informative and representative video summaries for video summarization.

中文摘要 i
Abstract ii
致謝 iii
Contents iv
List of Figures vii
List of Tables ix
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Contributions 4
1.4 Thesis Organization 5
Chapter 2 Related Work 6
2.1 Video Summarization 6
2.2 Baseline Approach 8
2.2.1 Convolutional Neural Network 11
2.2.2 Recurrent Neural Network 11
2.2.3 Loss Function 12
Chapter 3 Methodology 14
3.1 Network Architecture 16
3.1.1 CNN pre-trained network 16
3.1.2 BRNN 17
3.1.3 Convolutional Long Short-Term Memory 18
3.1.4 Self-attention 20
3.2 Deep Reinforcement Learning 21
3.2.1 Policy Gradient Methods 22
3.2.2 Reward Function 23
3.2.3 Optimization 25
3.3 Video Summary 25
Chapter 4 Experiments 27
4.1 Dataset 27
4.2 Evaluation Metrics and Protocol 28
4.3 Implementation Details 28
4.4 Comparison with State-of-the-arts 30
4.4.1 Quantitative Evaluation 30
4.4.2 Qualitative Evaluation 36
Chapter 5 Conclusions and Future Work 41
5.1 Conclusions 41
5.2 Future Work 41
References 43
                                

[1] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual
transformations for deep neural networks," arXiv pre-print server, 2017-04-11
2017, doi: arxiv:1611.05431.
[2] M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE
Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997, doi:
10.1109/78.650093.
[3] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo,
"Convolutional LSTM network: a machine learning approach for precipitation
nowcasting," in Proceedings of the 28th International Conference on Neural
Information Processing Systems - Volume 1, Montreal, Canada, 2015, vol. 1:
MIT Press, pp. 802-810.
[4] A. Vaswani et al., "Attention is all you need," arXiv pre-print server, 2017, doi:
arxiv:1706.03762.
[5] M. Gygli, H. Grabner, H. Riemenschneider, and L. V. Gool, "Creating
summaries from user videos," vol. 8695: Springer International Publishing, 2014,
pp. 505-520.
[6] T. Tsoneva, M. Barbieri, and H. Weda, "Automated summarization of narrative
video on a semantic level," in International Conference on Semantic Computing
(ICSC 2007), Irvine, CA, USA, 17-19 Sept. 2007, pp. 169-176, doi:
10.1109/ICSC.2007.42.
[7] T. Liu, Q. Meng, A. Vlontzos, D. R. Jeremy Tan, and B. Kainz, "Ultrasound
video summarization using deep reinforcement learning," arXiv pre-print server,
2020, doi: arxiv:2005.09531.
[8] R. P. Mathews et al., "Unsupervised multi-latent space RL framework for video
summarization in ultrasound imaging," IEEE Journal of Biomedical and Health
Informatics, vol. 27, no. 1, pp. 227-238, 2023, doi: 10.1109/JBHI.2022.3208779.
[9] T. Liu, Q. Meng, J.-J. Huang, A. Vlontzos, D. Rueckert, and B. Kainz, "Video
summarization through reinforcement learning with a 3D spatio-temporal
U-Net," IEEE Transactions on Image Processing, vol. 31, pp. 1573-1586, 2022,
doi: 10.1109/TIP.2022.3143699.
[10] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, "Video summarization with
long short-term memory," arXiv pre-print server, 2016-07-29, doi:
arxiv:1605.08110.
[11] K. Zhou, Y. Qiao, and T. Xiang, "Deep reinforcement learning for unsupervised
video summarization with diversity-representativeness reward," presented at the
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
and Thirtieth Innovative Applications of Artificial Intelligence Conference and
Eighth AAAI Symposium on Educational Advances in Artificial Intelligence,
New Orleans, Louisiana, USA, 02 February 2018.
[12] A. Phaphuangwittayakul, Y. Guo, F. Ying, W. Xu, and Z. Zheng, "Self-attention
recurrent summarization network with reinforcement learning for video
summarization task," in 2021 IEEE International Conference on Multimedia and
Expo (ICME), Shenzhen, China, 09 June 2021: Institute of Electrical and
Electronics Engineers (IEEE), doi: 10.1109/ICME51207.2021.9428142.
[13] M. S. Afzal and M. A. Tahir, "Reinforcement learning based video
summarization with combination of ResNet and gated recurrent unit," in
VISAPP 2021 - 16th International Conference on Computer Vision Theory and
Applications, 2021, vol. 4, pp. 261-268.
[14] P. Kadam, D. Vora, S. Mishra, S. Patil, K. Kotecha, A. Abraham, and L. A.
Gabralla, "Recent challenges and opportunities in video summarization with
machine learning algorithms," IEEE Access, vol. 10, pp. 122762-122785, 2022,
doi: 10.1109/access.2022.3223379.
[15] S. S. Thomas, S. Gupta, and V. K. Subramanian, "Smart surveillance based on
video summarization," in 2017 IEEE Region 10 Symposium (TENSYMP),
Cochin, India, 14-16 July 2017, pp. 1-5, doi:
10.1109/TENCONSpring.2017.8070003.
[16] K. Muhammad, T. Hussain, J. D. Ser, W. Ding, A. H. Gandomi, and V. H. C. D.
Albuquerque, "Efficient video summarization for smart surveillance systems," in
2022 IEEE Symposium Series on Computational Intelligence (SSCI), Singapore,
Singapore, 4-7 Dec. 2022, pp. 672-677, doi:
10.1109/SSCI51031.2022.10022220.
[17] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, "Video
summarization using deep neural networks: A survey," Proceedings of the IEEE,
vol. 109, no. 11, pp. 1838-1863, 2021, doi: 10.1109/JPROC.2021.3117472.
[18] M. Rochan, L. Ye, and Y. Wang, "Video summarization using fully convolutional
sequence networks," presented at the Computer Vision – ECCV 2018, 06
October 2018.
[19] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino,
"Summarizing videos with attention," Springer International Publishing, 2019,
pp. 39-54.
[20] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, "Learning
hierarchical self-attention for video summarization," in 2019 IEEE International
Conference on Image Processing (ICIP), Taipei, Taiwan, 22-25 September 2019:
Institute of Electrical and Electronics Engineers (IEEE), pp. 3377-3381, doi:
10.1109/ICIP.2019.8803639.
[21] B. Mahasseni, M. Lam, and S. Todorovic, "Unsupervised video summarization
with adversarial LSTM networks," in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 09 November
2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
[22] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, "Discriminative feature
learning for unsupervised video summarization," in AAAI'19/IAAI'19/EAAI'19:
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and
Thirty-First Innovative Applications of Artificial Intelligence Conference and
Ninth AAAI Symposium on Educational Advances in Artificial Intelligence,
Honolulu, Hawaii, USA, January 2019: AAAI Press, pp. 8537–8544, doi:
10.1609/aaai.v33i01.33018537.
[23] E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras, "Summarizing videos
using concentrated attention and considering the uniqueness and diversity of the
video frames," in Proceedings of the 2022 International Conference on
Multimedia Retrieval, Newark, NJ, USA, 27 June 2022: Association for
Computing Machinery, pp. 407-415, doi: 10.1145/3512527.3531404.
[24] G. Yaliniz and N. Ikizler-Cinbis, "Using independently recurrent networks for
reinforcement learning based unsupervised video summarization," Multimedia
Tools and Applications, vol. 80, no. 12, pp. 17827-17847, 2021, doi:
10.1007/s11042-020-10293-x.
[25] X. Wang, Y. Li, H. Wang, L. Huang, and S. Ding, "A video summarization
model based on deep reinforcement learning with long-term dependency,"
Sensors, vol. 22, no. 19, p. 7689, 2022, doi: 10.3390/s22197689.
[26] S.-S. Zang, H. Yu, Y. Song, and R. Zeng, "Unsupervised video summarization
using deep non-local video summarization networks," Neurocomputing, vol. 519,
pp. 26-35, 28 January 2023.
[27] U. N. Yoon, M. D. Hong, and G.-S. Jo, "Unsupervised video summarization
based on deep reinforcement learning with interpolation," Sensors, vol. 23, no. 7,
p. 3384, 2023, doi: 10.3390/s23073384.
[28] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied
to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp.
2278-2324, Nov. 1998, doi: 10.1109/5.726791.
[29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the
inception architecture for computer vision," arXiv pre-print server, 11 Dec 2015,
doi: arxiv:1512.00567.
[30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by
back-propagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986, doi:
10.1038/323533a0.
[31] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
Computation, vol. 9, no. 8, pp. 1735-1780, November 15, 1997, doi:
10.1162/neco.1997.9.8.1735.
[32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A
large-scale hierarchical image database," in 2009 IEEE Conference on Computer
Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248-255, doi:
10.1109/CVPR.2009.5206848.
[33] R. Bellman, "A Markovian Decision Process," J. Math. Mech., vol. 6, no. 5, pp.
679-684, 1957.
[34] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
M. Riedmiller, "Playing Atari with deep reinforcement learning," arXiv pre-print
server, 2013, doi: arxiv:1312.5602.
[35] V. Mnih et al., "Human-level control through deep reinforcement learning,"
Nature, vol. 518, no. 7540, pp. 529-533, 25 February 2015, doi:
10.1038/nature14236.
[36] L. A. d. Almeida and M. R. Thielo, "An intelligent agent playing generic action
games based on deep reinforcement learning with memory restrictions," in 2020
19th Brazilian Symposium on Computer Games and Digital Entertainment
(SBGames), Recife, Brazil, 2020, pp. 29-37, doi:
10.1109/SBGames51465.2020.00015.
[37] M. Bojarski et al., "End to end learning for self-driving cars," arXiv pre-print
server, 2016, doi: arxiv:1604.07316.
[38] BRaviKiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sallab, S. Yogamani, and
P. Pérez, "Deep reinforcement learning for autonomous driving: A survey," IEEE
Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp.
4909-4926, June 2022, doi: 10.1109/TITS.2021.3054625.
[39] D. Zhang, J. Han, L. Zhao, and T. Zhao, "From discriminant to complete:
reinforcement searching-agent learning for weakly supervised object detection,"
IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12,
pp. 5549-5560, December 2020, doi: 10.1109/TNNLS.2020.2969483.
[40] X. Han, H. Liu, F. Sun, and X. Zhang, "Active object detection with multistep
action prediction using deep Q-network," IEEE Transactions on Industrial
Informatics, vol. 15, no. 6, pp. 3723-3731, June 2019, doi:
10.1109/TII.2019.2890849.
[41] S. Sangve, V. Govilkar, N. Shingade, S. Jathar, and A. Jadhav, "Multiple stock
trading using ensemble strategy and deep reinforcement learning," in 2023
International Conference on Sustainable Computing and Smart Systems
(ICSCSS), Coimbatore, India, 2023, pp. 222-228, doi:
10.1109/ICSCSS57650.2023.10169379.
[42] H. Yang, X.-Y. Liu, S. Zhong, and A. Walid, "Deep reinforcement learning for
automated stock trading," in ICAIF '20: Proceedings of the First ACM
International Conference on AI in Finance, October 2020, no. 31: ACM, pp. 1-8,
doi: 10.1145/3383455.3422540.
[43] R. J. Williams, "Simple statistical gradient-following algorithms for
connectionist reinforcement learning," Machine Learning, vol. 8, no. 3-4, pp.
229-256, 1992, doi: 10.1007/bf00992696.
[44] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv
pre-print server, 2017, doi: arxiv:1412.6980.
[45] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, "TVSum: Summarizing web
videos using titles," in 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Boston, MA, USA, 2015, pp. 5179-5187, doi:
10.1109/CVPR.2015.7299154.
[46] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, "Category-specific video
summarization," vol. 8694: Springer International Publishing, 2014, pp.540-555.
[47] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, "Summary transfer:
exemplar-based subset selection for video summarization," arXiv pre-print
server, 2016, doi: arxiv:1603.03369.
[48] B. Zhao, X. Li, and X. Lu, "HSA-RNN: Hierarchical structure-adaptive RNN for
video summarization," in 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7405-7414, doi:
10.1109/CVPR.2018.00773.
[49] L. Yuan, F. E. H. Tay, P. Li, and J. Feng, "Unsupervised video summarization
with cycle-consistent adversarial LSTM networks," IEEE Transactions on
Multimedia, vol. 22, no. 10, pp. 2711-2722, Oct. 2020, doi:
10.1109/TMM.2019.2959451.
[50] M. Rochan and Y. Wang, "Video summarization by learning from unpaired
data," arXiv pre-print server, 2019, doi: arxiv:1805.12174.
[51] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, "A
stepwise, label-based approach for improving the adversarial training in
unsupervised video summarization," in AI4TV '19: Proceedings of the 1st
International Workshop on AI for Smart TV Content Production, Access and
Delivery, October 2019: ACM, pp. 17-25, doi: 10.1145/3347449.3357482.
[52] Z. Lei, C. Zhang, Q. Zhang, and G. Qiu, "FrameRank: A text processing
approach to video summarization," in 2019 IEEE International Conference on
Multimedia and Expo (ICME), Shanghai, China, 2019, pp. 368-373, doi:
10.1109/ICME.2019.00071.
[53] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras,
"Unsupervised video summarization via attention-driven adversarial learning,"
in MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol.
11961: Springer International Publishing, 2020, pp. 492-504.
[54] P. Alexoudi, I. Mademlis, and I. Pitas, "Escaping local minima in deep
reinforcement learning for video summarization," in ICMR '23: Proceedings of
the 2023 ACM International Conference on Multimedia Retrieval, Thessaloniki,
Greece, June 2023: ACM, pp. 530-534, doi: 10.1145/3591106.3592288.
[55] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras,
"AC-SUM-GAN: connecting actor-critic and generative adversarial networks for
unsupervised video summarization," IEEE Transactions on Circuits and Systems
for Video Technology, vol. 31, no. 8, pp. 3278-3292, Aug. 2021, doi:
10.1109/TCSVT.2020.3037883.

簡易檢索 / 詳目顯示

相關論文