簡易檢索 / 詳目顯示

研究生: 陳弈璁
Yi-cong Chen
論文名稱: 應用多特徵與正規化於統計式未知詞萃取之研究
Combination of Multiple Feature and Normalization for Statistical Unknown Word Extraction
指導教授: 林伯慎
Bor-shen Lin
口試委員: 羅乃維
Nai-wei Lo
古鴻炎
Huan-yan Gu
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 56
中文關鍵詞: 未知詞統計方法類神經網路詞彙萃取特徵值分佈正規化
外文關鍵詞: Unknown Word, Statistical Based Method, Multilayer Perceptron, Chinese Word Extraction, Distribution Normalization
相關次數: 點閱:263下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

詞彙組成的結構非常複雜,而且隨著資訊普及的加速,新事物、新觀念不斷的產生,以及人們的生活方式一直在演變,新的詞彙自然而然地快速遞增,使得詞彙組成的結構層出不窮。因此本論文利用機器學習的方法,以統計的方式,結合不同詞彙特性的統計特徵訓練出一個詞彙的分類器,進行詞彙驗證。但是自然語言應用的領域非常廣泛,使用的語料庫領域或大小也都不盡相同。顯得以統計式為基礎的方法,不容乎視訓練集與測試集語料特徵分佈不匹配的問題。
本論文提出兩種新穎的統計特徵,分別是含有候選詞內部結構的資訊,稱之為鏈結強度,以及候選詞字首和其子結構的分離程度,稱之為字首分離度。這兩種特徵和描述長度增益(Description Length Gain)、介接變異度(Accessor Variety)具有不同的特性,當結合使用時,可以產生很好的互補效果。另外,提出使用直方圖均化(Histogram Equalization)將描述長度增益(Description Length Gain)特徵值進行更進一步的正規化,讓測試集與訓練集的特徵值分佈能互相匹配,解決語料庫大小或領域不同所造成特徵值範圍變動及分佈差異的問題,使得本論文的詞彙萃取方法更具一般性,不必因為處理跨領域的資料而重新訓練詞彙萃取模型。
我們使用SIGHAN2的繁體語料庫進行測試,在結合四種特徵且經過特徵值分佈正規化後,會有最佳的詞彙驗證效能。對於中研院資訊所詞庫小組及香港城市大學所提供的語料庫,F-Measure分別可以達到68.43%和71.40%。最後我們將此詞彙萃取方法應用於萃取新穎領域的未知詞,我們發現本論文方法與兩套斷詞系統萃取的未知詞具有互補的特性,本論文方法可以萃取出具有強烈的統計詞彙特性且難以透過語意的方式萃取出來的未知詞,例如:「海角7號」、「金融海嘯」等。但是相對於人名或地方名稱的未知詞萃取,則顯得較不足。


By the accelerated spread of information, constantly generating new things and new concepts, as well as the way of how people live has been evolving, the composition of words emerged in endlessly. Therefore, this paper used the machine learning method to train a classifier by combining different statistical features for Chinese unknown word Extraction. However, the field of natural language applications is quite broad. And the size or the domain of corpus is also different. It appears that a noticeable issue of the training set and testing set distribution of feature does not match on the statistical based approach.
This paper presents two novel statistical features. One feature is containing the linkage information of internal structure for word candidate and another is the degree of separation between the prefix and the substring of word candidate. They can be complementary when combined with the description length gain or the accessor variety for extraction words. Furthermore, the normalization using the histogram equalization to the description length gain is proposed. It can be matched the distribution between the training set and testing set corpus with in different size or in different domain. And it makes the scheme more general, instead of processing the different domain data to re-train the extraction word model.
This scheme was tested on the corpora provided by SIGHAN2. There would be the best result when combined with the four features and normalized the distribution of feature. 68.43% and 71.40% of F-Measure can be obtained for the CKIP corpus and the HKCU corpus, respectively. Finally, we applied the scheme to extract unknown word in the novel domain corpus. We found that this scheme was complementary to extract unknown word with the two Chinese word segmentation systems. This method can be extracted the unknown word with strong statistical characteristics and difficult to extract through the semantic characteristics, such as “Cape No7”(海角七號), “Financial Tsunami”(金融海嘯), and etc. However, the capability was not enough for extracting the personal names or place names.

第1章 緒論...............................1 1.1 研究動機...........................1 1.2 背景簡介...........................2 1.3 論文目的與成果簡介.................4 1.4 論文組織與架構.....................5 第2章 文獻與背景技術.....................6 2.1 候選詞篩選方法.....................7 2.2 詞彙驗證...........................8 2.3 多層次倒傳遞類神經網路分類器.......10 2.4 本章摘要...........................14 第3章 詞彙萃取...........................15 3.1 詞彙萃取的流程.....................15 3.2 分類資料的產生.....................18 3.3 各種統計特徵組合之驗證效能分析.....25 3.4 本章摘要...........................33 第4章 統計特徵正規化方法之研究...........34 4.1 統計特徵分佈的問題.................34 4.2 特徵正規化方法介紹.................37 4.3 不同正規化方法之詞彙驗證效能分析...40 4.4 本章摘要...........................44 第5章 應用於新穎領域的萃取...............45 5.1 應用於新穎領域的詞彙萃取流程.......45 5.2 標記詞彙的方法.....................46 5.3 新穎領域之未知詞萃取分析...........48 5.4 本章摘要...........................51 第6章 結論與未來研究方向.................52 6.1 結論...............................52 6.2 未來研究方向.......................54

[1] Maosong SUN, Lawrence CHEUNG, and Benjamin K., “Identifying Chinese Name in Unrestricted Texts”, Chinese & Oriental Languages Information Processing Society, 1994.
[2] Xueqiang Lu, Le Zhang, and Junfeng Hu, “Statistical Substring Reduction in Linear Time”, In Proceeding of the 1nd International Joint Conference on Natural Language Processing(IJCNLP), 2004.
[3] Zhao Hai and Kit Chunyu, “An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework”, In Proceedings of The 3nd International Joint Conference on Natural Language Processing(IJCNLP), 2008.
[4] Keh-Jiann Chen and Wei-Yun Ma, “Unknown Word Extraction for Chinese Documents”, In Proceedings of The 19nd International Conference on Computational Linguistics (COLING), Pages 169-175, 2002.
[5] 梁婷, 葉大榮, 應用構詞法則與類神經網路於中文新詞萃取, In Proceedings of Research on Computational Linguistics Conference XIII(ROCLING), Pages 21-40, 2000.
[6] Goh Chooi Ling, Masayuki Asahara, and Yuji Matsumoto, “Chinese unknown word identification using character-based tagging and chunking”, In Proceedings of The 41nd Annual Meeting on Association for Computational Linguistics - Volume 2, Pages 197-200, 2003.
[7] Luning Ji, Mantai Sum, Qin Lu, Wenjie Li, and Yirong Chen, “Chinese Terminology Extraction Using Window-Based Contextual Information”, In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing, Pages: 62 – 74, 2009
[8] Yang Yuhang, Qin Lu, and Tiejun Zhao, “Chinese Term Extraction Using Minimal Resources”, The 22nd International Conference on Computational Linguistics (COLING).Pages 1033-1040, 2008.
[9] Wei-Yun Ma and Keh-Jiann Chen, “A Bottom-Up Merging Algorithm for Chinese Unknown Word Extraction”, In Proceedings of The 2nd SIGHAN Workshop on Chinese Language Processing, Pages31-38, 2003.
[10] David Rumelhart, James McClelland, eds., “Parallel Distributed Processing : Explorations in the Microstructure of Cognition”, Cambridge, Ma: MIT Press, Volume 1, 1986.
[11] Chunyu Kit and Yorick Wilks, “Unsupervised Learning of Word Boundary with Description Length Gain”, In Proceedings of CoNLL99 ACL Workshop, 1999.
[12] Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng, “Accessor Variety Criteria for Chinese Word Extraction”, Computational Linguistics, 2004.
[13] Miao Wan, Song Liu, Jian-Yi Liu, and Cong Wang, “Automatic Technical Term Extraction Based on Term Association”, Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 2008.
[14] Hiroshi Nakagawa, Hiroyuki Kojima, and Akira Maeda, “Chinese Term Extraction from Web Pages Based on Compound word Productivity”, The 2nd International Joint Conference on Natural Language Processing(IJCNLP), 2005.
[15] Robert Hummel,” Image enhancement by histogram transformation”, Comp. Graph. Image Process. , Volume 6, Pages 184-195, 1977.
[16] 中央研究院資訊科學所詞庫小組中文斷詞線上服務(http://ckipsvr.iis.sinica.edu.tw/)
[17] 中國科學院計算技術所漢語詞法分析系統ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)(http://ictclas.org/index.html)

QR CODE