簡易檢索 / 詳目顯示

研究生: 洪麗玲
Li-ling Hung
論文名稱: 在多重電腦模式及電路模式的有效率平行前置演算法
Efficient Parallel Prefix Algorithms on the Multicomputer and Circuit Models
指導教授: 林彥君
Yen-Chun Lin
口試委員: 黃為德
none
蔡尚榮
none
王勝德
none
陳健輝
none
廖弘源
none
林順喜
none
鍾國亮
none
學位類別: 博士
Doctor
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 92
中文關鍵詞: 組合電路計算有效率的平行前置演算法成本最佳深度及節點數最佳訊息傳遞的多重電腦平行演算法前置問題與問題大小無關腰部及節點數最佳
外文關鍵詞: combinational circuit, computation-efficient parallel prefix, cost optimality, depth-size optimal, message-passing multicomputer, parallel algorithms, prefix problem, problem-size-independent, waist-size optimal
相關次數: 點閱:266下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

前置運算是對n個輸入x1, x2,…, xn進行具結合性的二元運算♁計算求得n個前置值x1 ♁ x2 ♁…♁ xi, 1 ≤ i ≤ n。前置運算已被廣泛應用在很多領域,且被視為一種基本運算。本論文分別在多重電腦(multicomputer)模式及組合電路(combinational circuit)模式提出有效率的平行前置演算法。多重電腦使用分散式記憶體(distributed memory)架構,利用訊息傳遞(message passing)進行資料的交換,其執行時間包含計算時間及通訊時間。我們提出四群計算有效率的演算法。前兩群演算法乃改善既有之演算法,減少其執行時間;另外兩群則使用更強的通訊模式,分別縮短前兩群演算法的通訊時間。在論文中也提出可有效使用此四群演算法應有的先決條件。這些演算法提供使用者彈性,可視軟硬體設備等級來選擇適當演算法以達到最短的執行時間。
前置電路是組合電路的前置演算法。對任一寬度(即輸入埠個數)為m的前置電路A,其深度d(A)與運算節點數s(A)有d(A) + s(A) ≥ 2m – 2的關係。因此,若d(A) + s(A) = 2m – 2則稱A為深度及節點數最佳電路。較小的深度代表執行時間較快,運算節點數少代表所需的成本及消耗功率低。若前置電路A的腰部(waist)為w(A),則有w(A) + s(A) ≥ 2m – 2的關係。若w(A) + s(A) = 2m – 2且w(A) = 1,則稱A為WSO-1電路。另外,節點的扇出(fan-out)等於節點的出度(out degree),前置電路的扇出為電路中所有節點的最大扇出。由於節點的扇出愈大,除了功率損耗愈大外,所需要的積體電路面積愈大,也會使輸出的速度變慢。因此,實用上,前置電路的扇出要愈小愈好。本論文所提出的前置電路扇出均為2。
本論文提出一群WSO-1平行前置電路,這群電路不但可以組成深度小的深度及節點數最佳電路,也是在輸入資料個數大於電路寬度時較快的電路。這群電路與其他的電路比較,當輸入個數比電路寬度大越多時,速度越快。而且,當輸入資料個數大於或等於電路寬度2倍時,這群新電路比所有電路都要快。本論文也建構一群深度及節點數最佳電路;與另外兩群深度及節點數最佳電路的建造方法相較,我們建造新電路的方法是最直接且最適合自動化。電路結構平衡也是新電路的一個優勢。


Given n inputs x1, x2,…, xn and an associative binary operator ♁, the prefix problem, or prefix computation, is to compute yi = x1 ♁ x2 ♁…♁ xi, for 1 ≤ i ≤ n. Prefix computation is used in various areas and is considered as a primitive operation. This thesis presents efficient algorithms for both multicomputer and combinational circuit models. Four families of computation-efficient parallel prefix algorithms for the multicomputer are given. The first two families generalize previous algorithms that use only half-duplex communications, and can improve the running time. The third and fourth families adopt collective communication operations to improve the communication time of the first two, respectively. The precondition of all the presented algorithms is also derived. The families of algorithms each provide the flexibility of choosing either less computation time or less communication time to achieve the minimal running time depending on the ratio of the time required by a communication step to the time required by a computation step.
Parallel prefix circuits are parallel prefix algorithms on the combinational circuit model. The depth of a prefix circuit A, d(A), is a measure of its processing time; smaller depth implies faster computation. The size of a prefix circuit A, s(A), is the number of operation nodes in it; smaller size implies less power consumption, less VLSI area, and less cost. For any prefix circuit A of width m, which is the number of inputs, it has been shown that d(A) + s(A) ≥ 2m – 2; thus, A is depth-size optimal if d(A) + s(A) = 2m – 2. With the definition w(A), which is called waist of A, for any prefix circuit A of width m, it has been known that s(A) + w(A) ≥ 2m – 2. If s(A) + w(A) = 2m – 2 and w(A) = 1, then A is said to be waist-size optimal with waist 1 (WSO-1). A circuit with a smaller fan-out is in general faster and occupies less VLSI area. To be of practical use, the depth and fan-out of a prefix circuit should be small. This thesis presents prefix circuits with the smallest fan-out 2.
In this thesis, a family of parallel prefix circuits is presented. These prefix circuits are WSO-1. They are not only building blocks for constructing fast depth-size optimal prefix circuits, but also themselves fast problem-size-independent prefix circuits. When the problem size is greater than the circuit width, a new prefix circuit may be well faster than any other prefix circuits of the same width, especially when the problem size is greater than or equal to twice the circuit width. The new prefix circuits are compared analytically with other representative ones circuits to show how fast they are. They have the minimum depth and are the fastest among all WSO-1 prefix circuits of the same width and fan-out. They are also better building blocks than other circuits for constructing fast depth-size optimal prefix circuits. A family of depth-size optimal, parallel prefix circuits with fan-out 2 is also presented. It is easier to construct and more amenable to automatic synthesis than two other families, all sharing the same minimum depth. The balanced structure of the new family is also a merit.

摘要i Abstractiii 1.Introduction1 1.1. Prefix computation1 1.1.1. Multicomputer model2 1.1.2. Prefix circuit model3 1.2. Related work5 1.2.1. Multicomputer algorithms5 1.2.2. Prefix circuits5 1.3. Contributions8 1.4. Overview8 2.Previous prefix algorithms on the multicomputer10 3.Four new families of parallel prefix algorithms12 3.1. Algorithm A12 3.2. Algorithm B16 3.3. Algorithms E and F17 4.Analysis of new algorithms18 4.1. Computation time of the algorithms18 4.2. Communication time of Algorithm A21 4.3. Communication time of Algorithm B24 4.4. Communication time of Algorithms E and F27 4.5. Precondition of algorithms29 5.Comparisons of related algorithms32 6.Previous results on prefix circuits38 7.New parallel prefix circuits with fan-out 244 7.1. H: a family of waist-size optimal prefix circuits44 7.2. HI and HD: Two families of waist-size optimal prefix circuits51 7.3. C2: A family of depth-size optimal prefix circuits52 8.Comparisons of prefix circuits57 8.1. Problem-size-independent prefix circuits57 8.2. Depth-size optimal prefix circuits62 9.Conclusions and future work64 Appendix A. Lemma A.1 66 Appendix B. Proof of Theorem 1 67 References 71

[1]S. G. Akl, Parallel Computation: Models and Methods, Prentice-Hall, 1997.
[2]S. Aluru, N. Futamura, and K. Mehrotra, Parallel biological sequence comparison using prefix computations, Journal of Parallel and Distributed Computing, Vol. 63, No. 3, 2003, pp. 264-272.
[3]A. Bilgory and D. D. Gajski, A heuristic for suffix solutions, IEEE Transactions on Computers, Vol. C-35, No. 1, 1986, pp. 34-42.
[4]G. E. Blelloch, Scans as primitive operations, IEEE Transactions on Computers, Vol. 38, No. 11, 1989, pp. 1526-1538.
[5]R. P. Brent and H. T. Kung, A regular layout for parallel adders, IEEE Transactions on Computers, Vol. C-31, No. 3, 1982, pp. 260-264.
[6]D. A. Carlson and B. Sugla, Limited width parallel prefix circuits, Journal of Supercomputing, Vol. 4, No. 2, 1990, pp. 107-129.
[7]L. Cinque and G. Bongiovanni, Parallel prefix computation on a pyramid computer, Pattern Recognition Letters, Vol. 16, No. 1, 1995, pp. 19-22.
[8]A. Datta, Multiple addition and prefix sum on a linear array with a reconfigurable pipelined bus system, Journal of Supercomputing, Vol. 29, No. 3, 2004, pp. 303-317.
[9]C. Efstathiou, H. T. Vergos, and D. Nikolos, Fast parallel-prefix modulo 2n + 1 adders, IEEE Transactions on Computers, Vol. 53, No. 9, 2004, pp. 1211-1216.
[10]O. Egecioglu and C. K. Koc, Parallel prefix computation with few processors, Computers and Mathematics with Applications, Vol. 24, No. 4, 1992, pp. 77-84.
[11]S. C. Eisenstat, O(log* n) algorithms on a Sum-CRCW PRAM, Computing, Vol. 79, No. 1, 2007, pp. 93-97.
[12]F. E. Fich, New bounds for parallel prefix circuits, in Proceedings 15th Symposium on the Theory of Computing, 1983, pp. 100-109.
[13]W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, 1994.
[14]T. Han and D. A. Carlson, Fast area-efficient VLSI adders, in Proceedings 8th Computer Arithmetic Symposium, Como, Italy, 1987, pp. 49-56.
[15]D. R. Helman and J. JaJa, Prefix computations on symmetric multiprocessors, Journal of Parallel and Distributed Computing, Vol. 61, 2001, pp. 265-278.
[16]L.-L. Hung and Y.-C. Lin, Parallel prefix algorithms on the multicomputer, WSEAS Transactions on Computer Research, Vol. 3, No. 4, 2008, pp. 229-239.
[17]Inmos, The Transputer Databook, 3rd ed., Inmos, 1992.
[18]P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Transactions on Computers, Vol. C-22, No. 8, 1973, pp. 783-791.
[19]D. W. Krumme, G. Cybenko, and K. N. Venkataraman, Gossiping in minimal time, SIAM Journal on Computing, Vol. 21, No. 1, 1992, pp. 111-139.
[20]C. P. Kruskal, L. Rudolph, and M. Snir, The power of parallel prefix, IEEE Transactions on Computers, Vol. C-34, 1985, pp. 965-968.
[21]R. E. Ladner and M. J. Fischer, Parallel prefix computation, Journal of the Association for Computing Machinery, Vol. 27, No. 4, 1980, pp. 831-838.
[22]S. Lakshmivarahan and S. K. Dhall, Parallel Computing Using the Prefix Problem, Oxford University Press, 1994.
[23]S. Lakshmivarahan, C. M. Yang, and S. K. Dhall, On a new class of optimal parallel prefix circuits with (size + depth) = 2n – 2 and log n ≤ depth ≤ (2 log n – 3), in Proceedings International Conference on Parallel Processing, St. Charles, IL, 1987, pp. 58-65.
[24]F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, 1992.
[25]R. Lin, K. Nakano, S. Olariu, M. C. Pinotti, J. L. Schwing, and A. Y. Zomaya, Scalable hardware-algorithms for binary prefix sums, IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 8, 2000, pp. 838-850.
[26]Y.-C. Lin, Optimal parallel prefix circuits with fan-out 2 and corresponding parallel algorithms, Neural, Parallel & Scientific Computations, Vol. 7, No. 1, 1999, pp. 33-42.
[27]Y.-C. Lin, A family of computation-efficient parallel prefix algorithms, WSEAS Transactions on Computers, Vol. 5, No. 12, 2006, pp. 3060-3066.
[28]Y.-C. Lin and J.-N. Chen, Z4: A new depth-size optimal parallel prefix circuit with small depth, Neural, Parallel & Scientific Computations, Vol. 11, No. 3, 2003, pp. 221-235.
[29]Y.-C. Lin and J.-W. Hsiao, A new approach to constructing optimal parallel prefix circuits with small depth, Journal of Parallel and Distributed Computing, Vol. 64, No. 1, 2004, pp. 97-107.
[30]Y.-C. Lin, Y.-H. Hsu, and C.-K. Liu, Constructing H4, a fast depth-size optimal parallel prefix circuit, Journal of Supercomputing, Vol. 24, No. 3, 2003, pp. 279-304.
[31]Y.-C. Lin and L.-L. Hung, Straightforward Construction of Depth-Size Optimal, Parallel Prefix Circuits with Fan-Out 2, Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, Technical Report: NTUST-CSIE-07-02, July 2007.
[32]Y.-C. Lin and L.-L. Hung, Waist-size optimal parallel prefix circuits, Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, Technical Report: NTUST-CSIE-07-01, June 2007.
[33]Y.-C. Lin and C. M. Lin, Efficient parallel prefix algorithms on multicomputers, Journal of Information Science and Engineering, Vol. 16, No. 1, 2000, pp. 41-64.
[34]Y.-C. Lin and C.-K. Liu, Finding optimal parallel prefix circuits with fan-out 2 in constant time, Information Processing Letters, Vol. 70, No. 4, 1999, pp. 191-195.
[35]Y.-C. Lin and C.-C. Shih, Optimal parallel prefix circuits with fan-out at most 4, in Proceedings 2nd IASTED International Conference on Parallel and Distributed Computing and Networks, Brisbane, Australia, 1998, pp. 312-317.
[36]Y.-C. Lin and C.-C. Shih, A new class of depth-size optimal parallel prefix circuits, Journal of Supercomputing, Vol. 14, No. 1, 1999, pp. 39-52.
[37]Y.-C. Lin and C.-Y. Su, Faster optimal parallel prefix circuits: New algorithmic construction, Journal of Parallel and Distributed Computing, Vol. 65, No. 12, 2005, pp. 1585-1595.
[38]Y.-C. Lin and C.-S. Yeh, Efficient parallel prefix algorithms on multiport message-passing systems, Information Processing Letters, Vol. 71, No. 2, 1999, pp. 91-95.
[39]Y.-C. Lin and C.-S. Yeh, Optimal parallel prefix on the postal model, Journal of Information Science and Engineering, Vol. 19, No. 1, 2003, pp. 75-83.
[40]J. Liu, S. Zhou, H. Zhu, and C.-K. Cheng, An algorithmic approach for generic parallel adders, in Proceedings International Conference on Computer-Aided Design, San Jose, CA, 2003, pp. 734-740.
[41]R. Manohar and J. A. Tierno, Asynchronous parallel prefix computation, IEEE Transactions on Computers, Vol. 47, No. 11, 1998, pp. 1244-1252.
[42]J. H. Park and H. K. Dai, Reconfigurable hardware solution to parallel prefix computation, Journal of Supercomputing, Vol. 43, No. 1, 2008, pp. 43-58.
[43]R. A. Patel, M. Benaissa, and S. Boussakta, Fast parallel-prefix architectures for modulo 2n – 1 addition with a single representation of zero, IEEE Transactions on Computers, Vol. 56, No. 11, 2007, pp. 1484-1492.
[44]E. E. Santos, Optimal and efficient algorithms for summing and prefix summing on parallel machines, Journal of Parallel and Distributed Computing, Vol. 62, 2002, pp. 517-543.
[45]M. Sheeran and I. Parberry, A New Approach to the Design of Optimal Parallel Prefix Circuits, Department of Computer Science and Engineering, Chalmers University of Technology, Goteborg, Sweden, Technical Report: 2006:1, 2006.
[46]M. Snir, Depth-size trade-offs for parallel prefix computation, Journal of Algorithms, Vol. 7, 1986, pp. 185-201.
[47]M. Snir, P. Hochschild, D. D. Frye, and K. J. Gildea, The communication software and parallel environment of the IBM SP2, IBM Systems Journal, Vol. 34, No. 2, 1995, pp. 205-221.
[48]Thinking Machines, Connection Machine Parallel Instruction Set (PARIS), Thinking Machines, 1986.
[49]H. Wang, A. Nicolau, and K. S. Siu, The strict time lower bound and optimal schedules for parallel prefix with resource constraints, IEEE Transactions on Computers, Vol. 45, No. 11, 1996, pp. 1257-1271.
[50]B. W.-Y. Wei, Synthesis and Optimization of VLSI Prefix Circuits, Ph.D. thesis, University of California, Berkeley, 1987.
[51]N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective, 2nd ed., Addison-Wesley, 1993.
[52]Z. Xu and K. Hwang, Modeling communication overhead: MPI and MPL performance on the IBM SP2, IEEE Parallel & Distributed Technology, Vol. 4, No. 1, 1996, pp. 9-23.
[53]H. Zhu, C.-K. Cheng, and R. Graham, Constructing zero-deficiency parallel prefix circuits of minimum depth, ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 2, 2006, pp. 387-409.
[54]R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and Their Synthesis, Ph.D. thesis, Swiss Federal Institute of Technology (ETH), Zurich, 1997.

QR CODE