研究生: |
溫智旻 Chih-ming Wen |
---|---|
論文名稱: |
使用詞彙式大字符集之文本資料壓縮 Text Compression Using a Word-Based Large Alphabet |
指導教授: |
古鴻炎
Hung-yan Gu |
口試委員: |
鍾國亮
Kuo-liang Chung 范欽雄 Chin-shyurng Fahn 陳建華 Chien-hua Chen 黃紹華 Shaw-hua Hwang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 中文 |
論文頁數: | 96 |
中文關鍵詞: | 文本資料壓縮 、大字符集 、詞彙式 |
外文關鍵詞: | text compression, word-based, large alphabet |
相關次數: | 點閱:127 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文研究基於詞彙式大字符集之文本資料壓縮方法,將中、英文檔案以詞彙為單位剖析出token,再對token以混合式預測模型或部分匹配預測模型來估算出現機率,接著以算術編碼對該機率編碼。我們也針對使用大字符集的預測模型結構,研究一些可以加快處理速度的方法。
在把上述流程實作為實際可壓縮、解壓縮的程式後,對此程式作壓縮率的實驗,實驗結果顯示本論文實作的壓縮程式,對中、英文測試檔壓縮所得到的壓縮率,比同樣使用預測式編碼的壓縮程式PPMd、及以BWT為基礎的bzip2程式、及以LZ77為基礎的GZIP程式都還好。對中文檔案來說,我們的程式得到的平均壓縮率,比PPMd程式好1.12%,比bzip2程式好5.48%,比GZIP程式好17.02%。對英文檔案來說,我們的程式比PPMd好0.29%,比bzip2程式好2.04%,比GZIP好12.08%。
In this thesis, some word-based large-alphabet text compression schemes are studied. After a word token is parsed from an English or Chinese text file, its occurrence probability is predicted with blended predictive models or with partially matched predictive model. Then, this probability is encoded in the module of arithmetic coding. In order to improve the speed of our compression schemes, we have also studied some data structures and their corresponding processing methods under the condition of large alphabet size.
For the schemes studied here, we have implemented them as executable programs to practically compress and decompress typical text files. Performance comparison is made between our program and other text compression programs such as GZIP, bzip2, and PPMd. Experiment results show that our program have better compression rate than the programs mentioned for compressing both kinds of Chinese and English text files. In average, rate improvements are 17.02%, 5.48%, and 1.12% for Chinese text files, and 12.08%, 2.04%, 0.29% for English text files, respectively.
[1] 7-ZIP, http://www.7-zip.org/.
[2] Huffman, D.A., “A Method for the Construction of Minimum Redundancy Codes”, Proc. Inst. Electr. Radio Eng. 40, 9, pp.1098-1101(1952).
[3] Rissanen, J.J., “Generalized Kraft Inequality and Arithmetic Coding”, IBM J. Res. Dev.20, pp.198-203(1976).
[4] Shannon, C.E., “A Mathematical Theory of Communication”, The Bell System Technical Journal, Vol.27, pp.379-423, 623-656(1948).
[5] Ziv, J., Lempel, A., “A universal Algorithm for Sequential Data Compression”, Information Theory, IEEE Transactions, Vol.23, Issue 3, pp.337-343(1977).
[6] Ziv, J., Lempel, A., “Compression of individual sequences via variable-rate coding”, Information Theory, IEEE Transactions, Vol.24, Issue 5, pp.530-536(1978).
[7] Welch, T.A., “A technique for high-performance data compression”, IEEE Computer, Vol.17, No.6, pp.8-19(1984).
[8] Sayood, Khalid, Introduction to Data Compression, 2’nd ed., Morgan Kaufmann(2000).
[9] Witten, I.H., Bell, T.C., “The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression”, Information Theory, IEEE Transactions, Vol.37, No.4, pp. 1085-1094(1991).
[10] Moffat, A., Neal, R., and Witten, I.H.,” Arithmetic Coding Revisited”, Data Compression Conference, DCC '95. Proceedings, pp.202-211(1995).
[11] Phil Vines, Justin Zobel, “Compression Techniques for Chinese Text”, Software Practice and Experience, Vol.28, No.12, pp.1299-1314(1998).
[12] Gu, H.Y., “A Large-Alphabet-Oriented Scheme for Chinese and English Text Compression”, Software-Practice and Experience, Vol.35, pp.1-13(2005).
[13] J.G. Cleary, I.H. Witten, “Data Compression using Adaptive Coding and Partial String Matching”, IEEE Transactions on Communication 32 (1984).
[14] Shkarin D., “PPM: One Step To Practicality”, Proceedings of the Data Compression Conference, UT. pp.202-211(2002).
Available at: http://compression.graphicon.ru/ds/.
[15] Horspool, R.N., Cormack, G.V., “Constructing Word-Based Text Compression Algorithms”, Data Compression Conference, DCC '92, pp.62-71(1992).
[16] Kwok-Shing Cheng, Gilbert H. Young, and Kam-Fai Wong, "A Study on Word-Based and Integral-Bit Chinese Text Compression Algorithms", Journal of the American Society for Information Science, Vol.50, No.3, pp.218-228(1999).
[17] 劉景民,基於大字符集柏洛菲勒轉換之中文文本資料壓縮方法,國立台灣科技大學電機工程研究所碩士論文,2004。
[18] Microsoft Windows Codepage 950,
http://www.microsoft.com/globaldev/reference/dbcs/950.mspx.
[19] Visual C# Developer Center, “An Extensive Examination of Data Structures Part 2: The Queue, Stack, and Hashtable”, http://msdn.microsoft.com/vcsharp/default.aspx?pull=/library/en-us/dv_vstechart/html/datastructures_guide2.asp.
[20] Mark Nelson, “Arithmetic Coding + Statistical Modeling = Data Compression”, http://www.dogma.net/markn/articles/arith/part1.htm.
[21] P.M. Fenwick, “A New Data Structure for Cumulative Probability Tables”, Software-Practice and Experience, Vol.24, pp.327-336(1994)
[22] A. Moffat, “Implementing the PPM Data Compression Scheme”, IEEE Transactions on Communications, COM-38, pp.1917-1921(1990)
[23] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression, Prentice Hall Inc., Englewood Cliffs, New Jersey(1990)
[24] GZIP source code, ftp://prep.ai.mit.edu/pub/gnu/gzip/
[25] Bzip2 home page, http://www.bzip.org/