研究生: |
邱奕翔 YI-XIANG QIU |
---|---|
論文名稱: |
運用類神經變換器和前文熵編模型以提升深度視訊編碼效率 Using Neural-Transformer and Context-based Entropy Modeling for Efficient Deep Video Compression |
指導教授: |
陳建中
Jiann-Jone Chen |
口試委員: |
杭學鳴
Hsueh-Ming Hang 郭天穎 Tien-Ying Kuo 吳怡樂 Yi-Leh Wu 陳建中 Jiann-Jone Chen |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 56 |
中文關鍵詞: | 深度視訊壓縮 、深度幀間預測編碼 、高效率視訊編碼H.265/HEVC 、深度學習 、類神經變換器 、自回歸模型 |
外文關鍵詞: | Deep Video Compression, Deep Inter-frame Prediction and Coding, Efficiency Video Coding(H.265/HEVC), Deep Learning, Transformer, Autoregressive Model |
相關次數: | 點閱:253 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著通訊技術和多媒體應用的興起以及人們對更高視訊品質服務的需求增加,造成網路視訊流量大幅提升。因此如何有效壓縮視訊成為多媒體訊號處理最為關 鍵的技術。目前國際視訊編碼標準,例如高效率視訊編碼H.265/HEVC和多功能 視訊編碼H.266/VVC等,主要使用基於區塊的混合式編碼架構(Block-based Hybrid Coding Framework),其中的編碼功能模組都是經由人工設計不斷改進效率而得,目前研究轉向以深度學習方法來設計端對端優化的視訊壓縮系統,例如以卷積神經網路取代變換編碼,以深度學習超先驗(Hyperprior)模型強化熵編模組效率等。而近年來在電腦視覺應用中,基於類神經變換器(Neural-Transformer)的架構被證實能取得比卷積神經網路還要更好的效果。在圖像壓縮領域,亦有探討將類神經變換器與卷積結合,能更有效的融合特徵訊息,降低潛在特徵的熵訊息量,使其能更有效的進行壓縮。文獻亦有顯示自回歸模型(Autoregressive Model)有助於消除潛在特徵間的冗餘訊息,我們認為其可應用於視訊壓縮提升效能,但由於其在解碼過程必須經過嚴格的Z掃描順序,導致卷積的計算無法平行化,效率非常的低。因此本研究將探討如何運用類神經變換器以及自回歸模型於深度視訊壓縮系統,以進一步提升幀間預測和幀間編碼的效率。實驗結果顯示,本文所提出之方法結果優於H.265和大多數的深度編碼架構,並且具有快速編解碼之速度。
With the advance in multimedia communication technologies and higher quality of service requirements, media streaming consumes most of the internet bandwidth. Therefore, how to effectively compress video signals to reduce the loading of bandwidth becomes important. Current video coding standards, such as the highly efficient video coding, H.265/HEVC, and Versatile Video Coding, H.266/VVC, adopt a block-based hybrid coding framework in which most encoding modules are designed based on human professional experiences. In the end-to-end deep video compression system, these modules are designed through the deep-learning models to further improve the compression efficiency. For example, the discrete cosine transform can be improved by utilizing a convolution neural network (CNN) to transform video signals into latent variables through a higher dimensional inference process. The entropy coding can be improved by employing a deep-learned hyperprior to improve the compression efficiency. In recent years in the field of computer vision, transformer-based architectures have been shown to achieve even better results than convolutional neural networks. In addition, the combination of neural-transformer and convolution can effectively fuse feature information, reduce the entropy of latent features, and outperform CNN in image compression. Literature shows that an autoregressive (AR) model can help remove the redundancy between extracted latent features. However, it must go through a strict Z-scan order in the decoding process, so the convolution cannot be calculated in parallel, and the efficiency is very low. Therefore, this research will explore how to utilize Neural-Transformer and AR models to further improve the efficiency of inter-frame prediction and inter-frame coding in deep video compression systems. Experiments show that the proposed method outperforms H.265 and most recent deep video compression methods, and can yield faster encoding and decoding speed.
[1] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11006–11015, 2019.
[2] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6421–6429, 2019.
[3] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, “Learning for video compression with hierarchical quality and recurrent enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6628–6637, 2020.
[4] M. A. Yılmaz and A. M. Tekalp, “End-to-end rate-distortion optimized learned hierarchical bi-directional video compression,” IEEE Transactions on Image Processing, 2021.
[5] Z. Hu, G. Lu, and D. Xu, “Fvc: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1502–1511, 2021.
[6] J. Lin, D. Liu, H. Li, and F. Wu, “M-lvc: Multiple frames prediction for learned video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3554, 2020.
[7] R. Yang, F. Mentzer, L. Van Gool, and R. Timofte, “Learning for video compression with recurrent auto-encoder and recurrent probability model,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 388–401, 2020.
[8] C. Liu, H. Sun, J. Katto, X. Zeng, and Y. Fan, “Learned video compression with residual prediction and loop filter,” arXiv preprint arXiv:2108.08551, 2021.
[9] J. Liu, S. Wang, W.-C. Ma, M. Shah, R. Hu, P. Dhawan, and R. Urtasun, “Conditional entropy coding for efficient video compression,” in European Conference on Computer Vision, pp. 453–468, Springer, 2020.
[10] J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[11] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3292–3308, 2020.
[12] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition, pp. 8503–8512, 2020.
[13] Z. Hu, Z. Chen, D. Xu, G. Lu, W. Ouyang, and S. Gu, “Improving deep video compression by resolution-adaptive flow coding,” in European Conference on Computer Vision, pp. 193–209, Springer, 2020.
[14] G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao, “Content adaptive and error propagation aware deep video compression,” in European Conference on Computer Vision, pp. 456–472, Springer, 2020.
[15] Z. Sun, Z. Tan, X. Sun, F. Zhang, D. Li, Y. Qian, and H. Li, “Spatiotemporal entropy model is all you need for learned video compression,” arXiv preprint arXiv:2104.06083, 2021.
[16] A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen, “Video compression with rate-distortion autoencoders,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7033–7042, 2019.
[17] O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev, “Elf-vc: Efficient learned flexible-rate video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14479–14488, 2021.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/ CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
[21] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211, 2022.
[22] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal attention for longrange interactions in vision transformers,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[23] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844, 2021.
[24] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in neural information processing systems, vol. 31, 2018.
[25] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based image compression,” arXiv preprint arXiv:2111.06707, 2021.
[26] D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14771–14780, 2021.
[27] F. Bellard, “BPG Image format.” https://bellard.org/bpg/, 2018.
[28] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021.
[29] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/ avc video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
[30] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[31] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4161–4170, 2017.
[32] T. T. K. Tran, S. M. Bateni, S. J. Ki, and H. Vosoughifar, “A review of neural networks for air temperature forecasting,” Water, vol. 13, no. 9, p. 1294, 2021.
[33] Swapna K E, “Convolution neural network (cnn).” https://developersbreach. com/convolution-neural-network-deep-learning, 2022.
[34] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016.
[35] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–1402, IEEE, 2003.
[36] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7939–7948, 2020.
[37] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
[38] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with taskoriented flow,” International Journal of Computer Vision, vol. 127, no. 8, pp. 1106– 1125, 2019.
[39] A. Mercat, M. Viitanen, and J. Vanne, “Uvg dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302, 2020.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[41] J. Bégaint, F. Racapé, S. Feltman, and A. Pushparaja, “Compressai: a pytorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
[42] “FFmpeg.” https://www.ffmpeg.org/, Online; accessed 12 April 2021.
[43] G. Bjontegaard, “Calculation of average psnr differences between rd-curves,” VCEGM33, 2001.
[44] M. Chen, T. Goodall, A. Patney, and A. C. Bovik, “Learning to compress videos without computing motion,” Signal Processing: Image Communication, vol. 103, p. 116633, 2022.