簡易檢索 / 詳目顯示

研究生: 蔡婷羽
Ting-Yu Tsai
論文名稱: 低延遲與高處理速度張量分解演算法及電路架構設計
The Algorithm and VLSI Architecture of Low-Latency and High-Throughput Tensor Decomposition Processor
指導教授: 沈中安
Chung-An Shen
口試委員: 黃元豪
Yuan-Hao Huang
蔡佩芸
Pei-Yun Tsai
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 63
中文關鍵詞: 張量分解塔克分解高階正交迭代低延遲高吞吐量
外文關鍵詞: Tensor decomposition, Tucker decomposition, Higher-order orthogonal iteration, Low-latency, High-throughput
相關次數: 點閱:260下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

需要表示與處理巨量數據為機器學習、影像處理和無線通訊等當代重要應用的共同特點。張量為一表示高維度資料的資料結構,因此其在巨量數據的應用中的重要性與日俱增。在與張量相關的訊號處理中,張量分解扮演最為關鍵的角色。張量分解通常應用於系統中對巨量數據進行壓縮及提取關鍵特徵。然而,由於張量的數據資料量龐大與計算複雜度高的特點,設計高效能低複雜度的張量分解處理器為一巨大的挑戰。本論文針對張量分解演算法與積體電路架構進行協同研究探討,設計並實現低延遲及高處理速度的張量分解處理器。具體而言,本論文提出平行高階正交疊代張量分解演算法,以降低張量分解運算中的資料相依性。我們設計的演算法更利用低延遲及高處理速度的張量分解處理器架構設計。本論文也分別基於傳統的高階正交疊代演算法以及我們所提出的平行高階正交疊代演算法,設計張量分解處理器電路架構。在基於傳統演算法的張量分解處理器架構中,我們著重於在運算流程的改善以及資料流程的設計,以降低張量分解運算的延遲。另外,在基於我們所提出的平行演算法的張量分解處理器架構中,我們克服設計提升硬體元件使用效率的積體電路架構,達成以最少的元件達到最大的平行處理程度以及處理速度。本論文的實驗表明,我們提出的低延遲張量塔克分解處理器以約11626999 個邏輯閘的複雜度,實現0.28086 毫秒的延遲。而我們提出的高處理速度張量分解處理器以約13293013 個邏輯閘的複雜度,實現每秒5670 個張量分解的處理速度。與最新文獻中的張量分解處理器相比,我們設計的電路架構提升了60% 的處理速度以及47% 的效率。


Tensor has become an essential data structure to represent high dimensional signals in various applications, such as machine learning, image processing and wireless communications. Among signal processing operations related to tensor, tensor decomposition plays an important role and commonly used to compress and extract critical features from a large number of data in the system. However, due to huge amount of data and highly complicated computations, great challenges for the operation and storage of the tensor decomposition processor are incurred. This thesis aims to design and implement a low-latency and high-throughput tensor decomposition processor through the joint effort of algorithm and VLSI architecture design. Specifically, this thesis presents a parallel higher-order orthogonal iteration (P-HOOI) algorithm. The proposed algorithm overcomes the problem of high data dependence in the iterative operation and leads to a low-latency tensor decomposition data flow. Furthermore, this thesis presents two tensor decomposition processor. To be specific, based on the conventional higher-order orthogonal iteration (HOOI) algorithm, a low-latency tensor decomposition processor is designed and implemented. The experimental results show that this low-latency tensor decomposition processor achieves a latency of 0.28086 m seconds with a complexity of approximately 11626999 gates. Moreover, based on the proposed P-HOOI algorithm, a high-throughput and low-latency tensor decomposition processor is designed and implemented. The data processing flow of this processor is optimized to improve the utilization of each hardware component so that the processing throughput is maximized with a manageable increment of hardware complexity. The proposed high-throughput tensor decomposition processor achieves a throughput of 5670 tensor decomposition per second with a complexity of about 13293013 gates. Compared with the state-of-the-art tensor decomposition processor, the proposed high-throughput decomposition processor achieves a 60% enhancement in throughput and 47% enhancement in efficiency.

Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . ii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 7 2.1 The Tensor and Tensor Operations . . . . . . . . . . . . . . . . . 7 2.2 The Introduction of Tensor Decomposition . . . . . . . . . . . . . 10 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 The Proposed Algorithm and Simulation Results . . . . . . . . . . . 18 3.1 The Proposed Parallel HOOI (P-HOOI) Algorithm . . . . . . . . . . 19 3.2 Simulation Results and Analyses . . . . . . . . . . . . . . . . . 21 4 The Proposed Low-Latency Tucker Decomposition Architecture . . . . . 24 4.1 Analysis of Operation Flow for Related Work . . . . . . . . . . . 24 4.2 Proposed Low-Latency Tucker Decomposition Operation Process . . . 26 4.3 The Requirements for Tensor Size . . . . . . . . . . . . . . . . . 29 5 The Proposed High-Throughput Tucker Decomposition Architecture . . . 31 5.1 Proposed High-Throughput Tucker Decomposition Operation Process . 31 5.2 Architectural Overview of the High-Throughput Tucker Decomposition 34 5.3 The Requirements for Tensor Size . . . . . . . . . . . . . . . . . 36 5.4 Fixed-Point Simulation Results . . . . . . . . . . . . . . . . . . 40 6 Experimental Results and Comparisons . . . . . . . . . . . . . . . . 42 6.1 Implementation Results . . . . . . . . . . . . . . . . . . . . . . 42 6.2 Comparison with Prior Designs . . . . . . . . . . . . . . . . . . 44 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

[1] X. Hu, C. Deng, and B. Yuan, “Reduced-complexity singular value decomposition for tucker decomposition: Algorithm and hardware,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1793–1797, 2020.
[2] D. Zhang, P. Pan, R. You, and H. Wang, “Svd-based low-complexity hybrid precoding for millimeterwave mimo systems,” IEEE Communications Letters, vol. 22, no. 10, pp. 2176–2179, 2018.
[3] K. Zhang, X. Zhang, and Z. Zhang, “Tucker tensor decomposition on fpga,” 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8, 2019.
[4] T.-L. Wu and Y.-H. Huang, “Design and implementation of tensor processor for hybrid precoding tracking in millimeter wave 3d-mimo systems,” Master’s thesis, National Tsing-Hua University, Taiwan, 2021.
[5] P. D. Kaiming He, Georgia Gkioxari and R. Girshick, “Mask r-cnn,” Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
[6] M. T. A. Dadashzadeh, A. T. Targhi and M. Mirmehdi, “Hgr-net: A fusion network for hand gesture segmentation and recognition,” IET Computer Vision, vol. 13, no. 8, pp. 700–707, 2019.
[7] O. Köpüklü, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8, 2019.
[8] Y. X. Yan, Sijie and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” Thirty-second AAAI conference on artificial intelligence, 2018.
[9] L. Liu and Y. Tian, “Hybrid precoding based on tensor decomposition for mmwave 3d-mimo systems,”2017 IEEE/CIC International Conference on Communications in China (ICCC), pp. 1–6, 2017.
[10] R. A. Harshman, “Foundations of the parafac procedure: models and conditions for an ’exploratory’multimodal factor analysis,” UCLA Working Papers in Phonetics, pp. 1–84, 1970.
[11] C. J. Carroll, J.D., “Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young”decomposition,” Psychometrika, vol. 35, p. 283–319, 1970.
[12] L. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, p. 279–311, 1966.
[13] J. E. C. S. I. P. G. L. D. J. K. W. R. L. Ilias I. Giannakopoulos, Georgy D. Guryev, “Compression of volume-surface integral equation matrices via tucker decomposition for magnetic resonance applications,” arXiv preprint arXiv:2103.06393, 2021.
[14] Y. Fu, Q. Ruan, and Y. Jiang, “Sparse and low-rank tucker decomposition with its application to 2d+3d facial expression recognition,” 2020 15th IEEE International Conference on Signal Processing (ICSP), vol. 1, pp. 37–42, 2020.
[15] T. Xu, T.-Z. Huang, L.-J. Deng, X.-L. Zhao, and J. Huang, “Hyperspectral image superresolution using unidirectional total variation with tucker decomposition,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 4381–4398, 2020.
[16] C.-J. W. X. L. Chunxing Yin, Bilge Acun, “Tt-rec: Tensor train compression for deep learning recommendation models,” Proceedings of Machine Learning and Systems, vol. 3, 2021.
[17] S. Rambhatla, N. D. Sidiropoulos, and J. Haupt, “Tensormap: Lidar-based topological mapping and localization via tensor decompositions,” 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1368–1372, 2018.
[18] Y. Peng, D. Meng, Z. Xu, C. Gao, Y. Yang, and B. Zhang, “Decomposable nonlocal tensor dictionary learning for multispectral image denoising,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2949–2956, 2014.
[19] L. De Lathauwer, B. De Moor, and J. Vandewalle, “On the best rank-1 and rank-(r 1, r 2,..., rn) approximation of higher-order tensors,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000.
[20] T. G. Kolda, “Multilinear operators for higher-order decompositions.,” Tech. Report SAND2006-2081, Sandia National Laboratories, Albuquerque, NM, Livermore, CA, 2006.

無法下載圖示 全文公開日期 2024/08/22 (校內網路)
全文公開日期 2024/08/22 (校外網路)
全文公開日期 2024/08/22 (國家圖書館:臺灣博碩士論文系統)
QR CODE