簡易檢索 / 詳目顯示

研究生: 邱奕翔
YI-XIANG QIU
論文名稱: 運用類神經變換器和前文熵編模型以提升深度視訊編碼效率
Using Neural-Transformer and Context-based Entropy Modeling for Efficient Deep Video Compression
指導教授: 陳建中
Jiann-Jone Chen
口試委員: 杭學鳴
Hsueh-Ming Hang
郭天穎
Tien-Ying Kuo
吳怡樂
Yi-Leh Wu
陳建中
Jiann-Jone Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 56
中文關鍵詞: 深度視訊壓縮深度幀間預測編碼高效率視訊編碼H.265/HEVC深度學習類神經變換器自回歸模型
外文關鍵詞: Deep Video Compression, Deep Inter-frame Prediction and Coding, Efficiency Video Coding(H.265/HEVC), Deep Learning, Transformer, Autoregressive Model
相關次數: 點閱:253下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著通訊技術和多媒體應用的興起以及人們對更高視訊品質服務的需求增加,造成網路視訊流量大幅提升。因此如何有效壓縮視訊成為多媒體訊號處理最為關 鍵的技術。目前國際視訊編碼標準,例如高效率視訊編碼H.265/HEVC和多功能 視訊編碼H.266/VVC等,主要使用基於區塊的混合式編碼架構(Block-based Hybrid Coding Framework),其中的編碼功能模組都是經由人工設計不斷改進效率而得,目前研究轉向以深度學習方法來設計端對端優化的視訊壓縮系統,例如以卷積神經網路取代變換編碼,以深度學習超先驗(Hyperprior)模型強化熵編模組效率等。而近年來在電腦視覺應用中,基於類神經變換器(Neural-Transformer)的架構被證實能取得比卷積神經網路還要更好的效果。在圖像壓縮領域,亦有探討將類神經變換器與卷積結合,能更有效的融合特徵訊息,降低潛在特徵的熵訊息量,使其能更有效的進行壓縮。文獻亦有顯示自回歸模型(Autoregressive Model)有助於消除潛在特徵間的冗餘訊息,我們認為其可應用於視訊壓縮提升效能,但由於其在解碼過程必須經過嚴格的Z掃描順序,導致卷積的計算無法平行化,效率非常的低。因此本研究將探討如何運用類神經變換器以及自回歸模型於深度視訊壓縮系統,以進一步提升幀間預測和幀間編碼的效率。實驗結果顯示,本文所提出之方法結果優於H.265和大多數的深度編碼架構,並且具有快速編解碼之速度。


    With the advance in multimedia communication technologies and higher quality of service requirements, media streaming consumes most of the internet bandwidth. Therefore, how to effectively compress video signals to reduce the loading of bandwidth becomes important. Current video coding standards, such as the highly efficient video coding, H.265/HEVC, and Versatile Video Coding, H.266/VVC, adopt a block-based hybrid coding framework in which most encoding modules are designed based on human professional experiences. In the end-to-end deep video compression system, these modules are designed through the deep-learning models to further improve the compression efficiency. For example, the discrete cosine transform can be improved by utilizing a convolution neural network (CNN) to transform video signals into latent variables through a higher dimensional inference process. The entropy coding can be improved by employing a deep-learned hyperprior to improve the compression efficiency. In recent years in the field of computer vision, transformer-based architectures have been shown to achieve even better results than convolutional neural networks. In addition, the combination of neural-transformer and convolution can effectively fuse feature information, reduce the entropy of latent features, and outperform CNN in image compression. Literature shows that an autoregressive (AR) model can help remove the redundancy between extracted latent features. However, it must go through a strict Z-scan order in the decoding process, so the convolution cannot be calculated in parallel, and the efficiency is very low. Therefore, this research will explore how to utilize Neural-Transformer and AR models to further improve the efficiency of inter-frame prediction and inter-frame coding in deep video compression systems. Experiments show that the proposed method outperforms H.265 and most recent deep video compression methods, and can yield faster encoding and decoding speed.

    摘要 ABSTRACT 目錄 圖目錄 表目錄 第一章 緒論 1.1 研究動機及目的 1.2 問題描述及研究方法 1.3 論文組織 第二章 背景知識 2.1 深度視訊壓縮模型架構 2.1.1 傳統圖像和視訊壓縮簡介 2.1.2 傳統視訊壓縮的經典架構 2.1.3 基於深度學習的視訊壓縮架構 2.2 深度學習理論 2.2.1 人工神經網路(Artificial Neural Network, ANN) 2.2.2 卷積神經網路(Convolutional Neural Network, CNN) 2.2.3 卷積(Convolution) 2.2.4 轉置卷積(Transposed Convolution) 2.2.5 激勵函數(Activation Function) 2.2.5.1 Sigmoid函數 2.2.5.2 Tanh函數 2.2.5.3 ReLU函數 2.2.5.4 LReLU函數 2.2.6 池化(Pooling) 2.3 影像品質的衡量指標 2.3.1 峰值訊噪比(Peak Single-to-Noise Ratio, PSNR) 2.3.2 結構相似性(Structural Similarity, SSIM) 第三章 模型相關介紹 3.1 Vision Transformer 3.1.1 Multi-Head Attention 3.2 Swin Transformer 第四章 端對端深度視訊壓縮系統 4.1 參考的相關文獻介紹 4.1.1 基於Transformer的圖像壓縮 4.1.2 Checkerboard Context Model加速 4.1.2.1 One Pass編碼潛在值 4.1.2.2 Two Pass解碼潛在值 4.2 深度視訊壓縮的網路設計 4.2.1 動量估計網路 4.2.2 動量編碼與解碼網路 4.2.3 動量補償網路 4.2.4 殘差編碼與解碼網路 4.3 網路訓練策略 4.3.1 損失函數 4.3.2 量化 第五章 實驗結果與討論 5.1 實驗環境設置 5.2 實驗結果分析 5.2.1 消融實驗和分析(Ablation Study and Analysis) 5.2.2 可視化結果分析 5.2.3 編解碼速度比較 第六章 結論與未來研究討論 6.1 結論 6.2 未來研究討論 參考文獻

    [1] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11006–11015, 2019.
    [2] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6421–6429, 2019.
    [3] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, “Learning for video compression with hierarchical quality and recurrent enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6628–6637, 2020.
    [4] M. A. Yılmaz and A. M. Tekalp, “End-to-end rate-distortion optimized learned hierarchical bi-directional video compression,” IEEE Transactions on Image Processing, 2021.
    [5] Z. Hu, G. Lu, and D. Xu, “Fvc: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1502–1511, 2021.
    [6] J. Lin, D. Liu, H. Li, and F. Wu, “M-lvc: Multiple frames prediction for learned video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3554, 2020.
    [7] R. Yang, F. Mentzer, L. Van Gool, and R. Timofte, “Learning for video compression with recurrent auto-encoder and recurrent probability model,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 388–401, 2020.
    [8] C. Liu, H. Sun, J. Katto, X. Zeng, and Y. Fan, “Learned video compression with residual prediction and loop filter,” arXiv preprint arXiv:2108.08551, 2021.
    [9] J. Liu, S. Wang, W.-C. Ma, M. Shah, R. Hu, P. Dhawan, and R. Urtasun, “Conditional entropy coding for efficient video compression,” in European Conference on Computer Vision, pp. 453–468, Springer, 2020.
    [10] J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” Advances in Neural Information Processing Systems, vol. 34, 2021.
    [11] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3292–3308, 2020.
    [12] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition, pp. 8503–8512, 2020.
    [13] Z. Hu, Z. Chen, D. Xu, G. Lu, W. Ouyang, and S. Gu, “Improving deep video compression by resolution-adaptive flow coding,” in European Conference on Computer Vision, pp. 193–209, Springer, 2020.
    [14] G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao, “Content adaptive and error propagation aware deep video compression,” in European Conference on Computer Vision, pp. 456–472, Springer, 2020.
    [15] Z. Sun, Z. Tan, X. Sun, F. Zhang, D. Li, Y. Qian, and H. Li, “Spatiotemporal entropy model is all you need for learned video compression,” arXiv preprint arXiv:2104.06083, 2021.
    [16] A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen, “Video compression with rate-distortion autoencoders,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7033–7042, 2019.
    [17] O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev, “Elf-vc: Efficient learned flexible-rate video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14479–14488, 2021.
    [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
    [19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/ CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
    [21] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211, 2022.
    [22] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal attention for longrange interactions in vision transformers,” Advances in Neural Information Processing Systems, vol. 34, 2021.
    [23] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844, 2021.
    [24] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in neural information processing systems, vol. 31, 2018.
    [25] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based image compression,” arXiv preprint arXiv:2111.06707, 2021.
    [26] D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14771–14780, 2021.
    [27] F. Bellard, “BPG Image format.” https://bellard.org/bpg/, 2018.
    [28] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021.
    [29] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/ avc video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
    [30] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
    [31] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4161–4170, 2017.
    [32] T. T. K. Tran, S. M. Bateni, S. J. Ki, and H. Vosoughifar, “A review of neural networks for air temperature forecasting,” Water, vol. 13, no. 9, p. 1294, 2021.
    [33] Swapna K E, “Convolution neural network (cnn).” https://developersbreach. com/convolution-neural-network-deep-learning, 2022.
    [34] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016.
    [35] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–1402, IEEE, 2003.
    [36] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7939–7948, 2020.
    [37] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
    [38] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with taskoriented flow,” International Journal of Computer Vision, vol. 127, no. 8, pp. 1106– 1125, 2019.
    [39] A. Mercat, M. Viitanen, and J. Vanne, “Uvg dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302, 2020.
    [40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
    [41] J. Bégaint, F. Racapé, S. Feltman, and A. Pushparaja, “Compressai: a pytorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
    [42] “FFmpeg.” https://www.ffmpeg.org/, Online; accessed 12 April 2021.
    [43] G. Bjontegaard, “Calculation of average psnr differences between rd-curves,” VCEGM33, 2001.
    [44] M. Chen, T. Goodall, A. Patney, and A. C. Bovik, “Learning to compress videos without computing motion,” Signal Processing: Image Communication, vol. 103, p. 116633, 2022.

    無法下載圖示 全文公開日期 2024/08/21 (校內網路)
    全文公開日期 2024/08/21 (校外網路)
    全文公開日期 2024/08/21 (國家圖書館:臺灣博碩士論文系統)
    QR CODE