簡易檢索 / 詳目顯示

研究生: 溫智旻
Chih-ming Wen
論文名稱: 使用詞彙式大字符集之文本資料壓縮
Text Compression Using a Word-Based Large Alphabet
指導教授: 古鴻炎
Hung-yan Gu
口試委員: 鍾國亮
Kuo-liang Chung
范欽雄
Chin-shyurng Fahn
陳建華
Chien-hua Chen
黃紹華
Shaw-hua Hwang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 96
中文關鍵詞: 文本資料壓縮大字符集詞彙式
外文關鍵詞: text compression, word-based, large alphabet
相關次數: 點閱:127下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文研究基於詞彙式大字符集之文本資料壓縮方法,將中、英文檔案以詞彙為單位剖析出token,再對token以混合式預測模型或部分匹配預測模型來估算出現機率,接著以算術編碼對該機率編碼。我們也針對使用大字符集的預測模型結構,研究一些可以加快處理速度的方法。
    在把上述流程實作為實際可壓縮、解壓縮的程式後,對此程式作壓縮率的實驗,實驗結果顯示本論文實作的壓縮程式,對中、英文測試檔壓縮所得到的壓縮率,比同樣使用預測式編碼的壓縮程式PPMd、及以BWT為基礎的bzip2程式、及以LZ77為基礎的GZIP程式都還好。對中文檔案來說,我們的程式得到的平均壓縮率,比PPMd程式好1.12%,比bzip2程式好5.48%,比GZIP程式好17.02%。對英文檔案來說,我們的程式比PPMd好0.29%,比bzip2程式好2.04%,比GZIP好12.08%。


    In this thesis, some word-based large-alphabet text compression schemes are studied. After a word token is parsed from an English or Chinese text file, its occurrence probability is predicted with blended predictive models or with partially matched predictive model. Then, this probability is encoded in the module of arithmetic coding. In order to improve the speed of our compression schemes, we have also studied some data structures and their corresponding processing methods under the condition of large alphabet size.
    For the schemes studied here, we have implemented them as executable programs to practically compress and decompress typical text files. Performance comparison is made between our program and other text compression programs such as GZIP, bzip2, and PPMd. Experiment results show that our program have better compression rate than the programs mentioned for compressing both kinds of Chinese and English text files. In average, rate improvements are 17.02%, 5.48%, and 1.12% for Chinese text files, and 12.08%, 2.04%, 0.29% for English text files, respectively.

    第一章 緒論 1 1.1 資料壓縮的功用 1 1.2 壓縮方法之回顧 2 1.2.1 編碼(coding) 2 1.2.2 模式(modeling) 3 1.2.3 剖析 5 1.3 研究動機與方法 7 1.4 論文架構 10 第二章 模型之資料結構與算術編碼 11 2.1 大字符集簡介 11 2.2 基於大字符集之剖析 12 2.2.1 中文詞彙之剖析 13 2.2.2 英文詞彙之剖析 16 2.2.3 符號詞彙之剖析 17 2.3 大字符集與機率計數器的空間管理 19 2.3.1 雜湊表(hash table) 20 2.3.2 字符集資料結構 24 2.3.3 詞符集資料結構 27 2.4 算術編碼 30 2.4.1 算術編碼簡介 30 2.4.2 支援大字符集之算術編碼 31 2.5 支援累積之二元搜尋樹 33 2.5.1 支援累積之二元搜尋樹簡介 34 2.5.2 支援累積之二元搜尋樹之操作 35 第三章 一般的預測模型 39 3.1 基於部分匹配之預測模型 39 3.1.1 基於部分匹配之預測模型簡介 39 3.1.2 模型的階數 40 3.1.3 逃脫機率之計算 45 3.1.4 基於部分匹配預測模型實作 48 3.2 混合式模型 51 3.2.1 混合式模型簡介 51 3.2.2 混合式模型之權重計算 55 3.2.3 支援累積之二元搜尋樹之混合 57 第四章 區分語詞、非語詞之預測模型 64 4.1 區分語詞詞彙與非語詞詞彙 64 4.2 詞彙類型機率計算 65 4.3 基於部分匹配之預測模型 66 4.4 混合式模型 67 第五章 測試實驗與結果 69 5.1 雜湊表與二元搜尋樹效率實驗 71 5.1.1 雜湊表新增詞彙實驗 71 5.1.2 雜湊表與二元搜尋樹效率比較實驗 72 5.2 一般的預測模型實驗 73 5.2.1 基於部分匹配之預測模型實驗 74 5.2.2 混合式模型實驗 76 5.3 區分語詞、非語詞之預測模型實驗 79 5.3.1 基於部分匹配預測模型之實驗 79 5.3.2 混合式模型實驗 81 5.4 不同詞典之實驗 83 5.4.1 中文檔案實驗 83 5.4.2 英文檔案實驗 84 5.5 英文不同詞符集之比較 85 5.6 與其他壓縮法比較 87 第六章 結論 90 參考文獻 93

    [1] 7-ZIP, http://www.7-zip.org/.
    [2] Huffman, D.A., “A Method for the Construction of Minimum Redundancy Codes”, Proc. Inst. Electr. Radio Eng. 40, 9, pp.1098-1101(1952).
    [3] Rissanen, J.J., “Generalized Kraft Inequality and Arithmetic Coding”, IBM J. Res. Dev.20, pp.198-203(1976).
    [4] Shannon, C.E., “A Mathematical Theory of Communication”, The Bell System Technical Journal, Vol.27, pp.379-423, 623-656(1948).
    [5] Ziv, J., Lempel, A., “A universal Algorithm for Sequential Data Compression”, Information Theory, IEEE Transactions, Vol.23, Issue 3, pp.337-343(1977).
    [6] Ziv, J., Lempel, A., “Compression of individual sequences via variable-rate coding”, Information Theory, IEEE Transactions, Vol.24, Issue 5, pp.530-536(1978).
    [7] Welch, T.A., “A technique for high-performance data compression”, IEEE Computer, Vol.17, No.6, pp.8-19(1984).
    [8] Sayood, Khalid, Introduction to Data Compression, 2’nd ed., Morgan Kaufmann(2000).
    [9] Witten, I.H., Bell, T.C., “The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression”, Information Theory, IEEE Transactions, Vol.37, No.4, pp. 1085-1094(1991).
    [10] Moffat, A., Neal, R., and Witten, I.H.,” Arithmetic Coding Revisited”, Data Compression Conference, DCC '95. Proceedings, pp.202-211(1995).
    [11] Phil Vines, Justin Zobel, “Compression Techniques for Chinese Text”, Software Practice and Experience, Vol.28, No.12, pp.1299-1314(1998).
    [12] Gu, H.Y., “A Large-Alphabet-Oriented Scheme for Chinese and English Text Compression”, Software-Practice and Experience, Vol.35, pp.1-13(2005).
    [13] J.G. Cleary, I.H. Witten, “Data Compression using Adaptive Coding and Partial String Matching”, IEEE Transactions on Communication 32 (1984).
    [14] Shkarin D., “PPM: One Step To Practicality”, Proceedings of the Data Compression Conference, UT. pp.202-211(2002).
    Available at: http://compression.graphicon.ru/ds/.
    [15] Horspool, R.N., Cormack, G.V., “Constructing Word-Based Text Compression Algorithms”, Data Compression Conference, DCC '92, pp.62-71(1992).
    [16] Kwok-Shing Cheng, Gilbert H. Young, and Kam-Fai Wong, "A Study on Word-Based and Integral-Bit Chinese Text Compression Algorithms", Journal of the American Society for Information Science, Vol.50, No.3, pp.218-228(1999).
    [17] 劉景民,基於大字符集柏洛菲勒轉換之中文文本資料壓縮方法,國立台灣科技大學電機工程研究所碩士論文,2004。
    [18] Microsoft Windows Codepage 950,
    http://www.microsoft.com/globaldev/reference/dbcs/950.mspx.
    [19] Visual C# Developer Center, “An Extensive Examination of Data Structures Part 2: The Queue, Stack, and Hashtable”, http://msdn.microsoft.com/vcsharp/default.aspx?pull=/library/en-us/dv_vstechart/html/datastructures_guide2.asp.
    [20] Mark Nelson, “Arithmetic Coding + Statistical Modeling = Data Compression”, http://www.dogma.net/markn/articles/arith/part1.htm.
    [21] P.M. Fenwick, “A New Data Structure for Cumulative Probability Tables”, Software-Practice and Experience, Vol.24, pp.327-336(1994)
    [22] A. Moffat, “Implementing the PPM Data Compression Scheme”, IEEE Transactions on Communications, COM-38, pp.1917-1921(1990)
    [23] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression, Prentice Hall Inc., Englewood Cliffs, New Jersey(1990)
    [24] GZIP source code, ftp://prep.ai.mit.edu/pub/gnu/gzip/
    [25] Bzip2 home page, http://www.bzip.org/

    QR CODE