簡易檢索 / 詳目顯示

研究生: 陳怡婷
Yee-ting Chen
論文名稱: 以事件詞彙鏈為基礎之多文件摘要
Multi-Document Summarization Based on Event Lexical Chain
指導教授: 徐俊傑
Chun-chieh Hsu
口試委員: 王有禮
Yue-li Wang
王建民
Chien-min Wang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 82
中文關鍵詞: 多文件摘要事件詞彙鏈事件特徵
外文關鍵詞: Multi-Document Summarization, Event Lexical Chain, Event Feature
相關次數: 點閱:140下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

由於網際網路的普及化,人們可比以往更容易地取得資料,但這也導致了資訊過量的問題,除此之外,還存在著資訊重複的問題。以電子新聞為例,現在線上電子新聞網站林立,各個網站可能以不同的論點陳述著相同的事件,使用者若想窺知事情全貌,就需花費大量時間到各個電子新聞網站搜尋、瀏覽相關的事件,只有在搜集、閱讀完所有報導後,才有可能取得完整資訊。然而對忙碌的現代人來說,並沒有這麼多的時間來搜集、閱讀如此大量的資訊。因此需要一個自動形成高品質摘要的機制,來幫助人們以最短的時間獲得最大的知識量。
因此,本論文提出”事件詞彙鏈”的方法,加入了描述文章事件的事件特徵及偵測出文件中事件與概念詞彙關係形成事件詞彙鏈的關聯性計算。透過事件詞彙鏈重要度的計算客觀地判斷事件詞彙鏈的重要性,選擇出重要的事件詞彙鏈來擷取出幫助使用者了解文件重點的摘要。經由實驗發現,以事件詞彙鏈為基礎可有效地改善文件摘要。


Due to the population of the internet, people can access much more information easily than before, which leads to information overload problem. In addition, there exists information redundancy problem. For example, a lot of online electronic news websites are built and many websites may present the same events in different arguments. If the user wants to know the overall picture of the thing, they must spend a lot of time to surf various electronic news website and browse related news. One needs to collect and read all related news in order to obtain the complete information. However, the modern busy people do not have so much time to collect and read the large amount of information. Therefore, it is desirable to have a mechanism to automatically format a high quality summary for helping people receive the maximum amount of knowledge in the shortest time.
In this thesis, we propose a method called “Event Lexical Chain”. Adding the event features which describe the event in the documents and the relation between terms can form the Event Lexical Chain, which can be used to detect the events and concept terms in the documents. The importance of the Event Lexical Chain can be objectively judged by weighting the Event Lexical Chain. Selecting the important Event Lexical Chain can extract the summary to help users understand the key point of the documents. The experimental results reveal that the multi-document based on Event Lexical Chain can effectively improve the summary of documents.

中文摘要.......................................................................I 英文摘要......................................................................II 誌謝.........................................................................III 圖索引......................................................................VIII 表索引.....................................................................VIIII 第一章 緒論...................................................................1 1.1 研究背景.................................................................1 1.2 研究動機.................................................................2 1.3 研究目的及方法..........................................................2 1.4 預期貢獻.................................................................4 1.5 論文架構.................................................................5 第二章 文獻探討...............................................................6 2.1 摘要的定義...............................................................6 2.2 自動摘要的來源與發展.....................................................7 2.3 自動摘要的類別...........................................................8 2.3.1 依文件數量及語言來分類...............................................8 2.3.2 依產生的形式分類.....................................................9 2.3.3 以功能性分類.........................................................9 2.3.4 以讀者需求來分類....................................................10 2.4 摘要方法................................................................10 2.4.1 去重複性之方法......................................................11 2.4.2 淺度方法............................................................13 2.4.3 深度方法............................................................15 2.4.4 其它相關研究及文獻..................................................20 2.5 評估方法................................................................21 2.5.1 內部評估............................................................21 2.5.2 外部評估............................................................22 2.5.3 其它評估方式之文獻..................................................23 第三章 以事件詞彙鏈為基礎的多文件摘要........................................24 3.1 系統架構................................................................24 3.2 文件前處理..............................................................26 3.2.1 中文斷詞 (Segmentation).............................................26 3.2.2 人名辨識 (Name-Entities Identification).............................27 3.2.3 複合詞偵測 (Compound-Words Detection)...............................29 3.3 特徵選取及文件分群......................................................30 3.3.1 詞性過濾 (Filtering the Part of Speech).............................30 3.3.2 詞彙頻率及文件頻率過濾 (TF and DF filtering)........................31 3.3.3 事件特徵偵測 (Event Feature Detection)..............................31 3.3.4 文件分群 (Documents Clustering).....................................33 3.4 事件詞彙鏈..............................................................38 3.4.1 關聯性計算(Relation Calculation)....................................39 3.4.2 事件概念(Event Concept).............................................41 3.4.3 概念詞彙鏈結(Concept Term Chaining).................................41 3.4.4 事件詞彙鏈重要度計算(Event Lexical Chain Weighting).................42 3.5 摘要擷取................................................................44 3.5.1 擷取事件概念句子(Sentence Extraction)...............................44 3.5.2 去除重複性(Redundancy Removal)......................................45 第四章 實驗結果與分析........................................................50 4.1 資料集與文件分群結果....................................................50 4.1.1 資料集..............................................................50 4.1.2 分群評估方法........................................................51 4.1.3 特徵選取、變異程度及最少文件數之設定對分群之影響....................51 4.2事件詞彙鏈之摘要新聞品質分析.............................................53 4.2.1 摘要評估方法........................................................53 4.2.2 關聯性參數設定之影響................................................54 4.2.3 事件詞彙鏈權重參數設定之影響........................................55 4.2.4 六種組合之摘要品質比較..............................................56 4.2.5 事件詞彙鏈、詞彙鏈與TF*IDF加總摘要品質之比較........................65 4.2.6 事件特徵及關聯性計算對事件詞彙鏈摘要品質之影響......................66 4.2.7 相同分群下事件詞彙鏈與詞彙鏈摘要品質之比較..........................68 第五章 結論與未來研究........................................................70 5.1 結論....................................................................70 5.2 未來研究方向............................................................71 參考文獻......................................................................73 附錄一 摘要展示...............................................................78 圖索引 圖1 - 1 論文架構...............................................................5 圖2 - 1 Ontology範例..........................................................17 圖2 - 2 語段模式之示意圖......................................................19 圖3 - 1 以事件詞彙鏈為基礎之多文件摘要系統架構圖..............................25 圖3 - 2 事件特徵偵測圖........................................................33 圖3 - 3 k-means分群演算法.....................................................35 圖3 - 4 Modified bisecting k-means演算法......................................38 圖3 - 5 挑選句子及去除重複性之流程圖..........................................46 圖4 - 1 特徵選取、變異程度及最少文件數之設定對分群準確率之影響................52 圖4 - 2 關聯性參數設定之影響(實驗4-2-1).......................................55 圖4 - 3 事件詞彙鏈權重參數設定之影響(實驗4-2-2)...............................56 圖4 - 4 各種組合之比較(實驗4-2-3).............................................63 圖4 - 5 不同方法之比較(實驗4-2-4).............................................65 圖4 - 6 事件特徵及關聯性計算的影響(實驗4-2-5).................................67 圖4 - 7 相同分群下摘要品質之比較..............................................69 表索引 表2 - 1 去除重複性之相關文獻..................................................12 表2 - 2 淺度方法-外觀層次相關文...............................................14 表2 - 3 淺度方法-實體層次相關文獻.............................................15 表2 - 4 樣板例子..............................................................16 表2 - 5 深度方法-樣板相關文獻.................................................16 表2 - 6 深度方法-概念融合相關文獻.............................................18 表2 - 7 深度方法-語段模型相關文獻.............................................19 表2 - 8 摘要方法之比較相關文獻................................................20 表2 - 9 摘要系統其它評估方法..................................................23 表3 - 1 分群終止條件表........................................................37 表3 - 2 挑選主題概念句子......................................................45 表3 - 3 挑選句子及去除重複性之範例............................................47 表3 - 4 方法1-(a).............................................................47 表3 - 5 方法1-(b).............................................................48 表3 - 6 方法2-(a).............................................................48 表3 - 7 方法3-(a).............................................................49 表3 - 8 方法3-(b).............................................................49 表4 - 1 所有主題及其文件數資料................................................51 表4 - 2 特徵選取、變異程度及最少文件數之設定..................................51 表4 - 3 設定一之分群結果......................................................52 表4 - 4 事件詞彙鏈之實驗列表..................................................53 表4 - 5 句子評比等級..........................................................54 表4 - 6 關聯性參數設定之影響..................................................54 表4 - 7 事件詞彙鏈權重參數設定之影響..........................................55 表4 - 8 六種挑選句子及去重複性之方法組合......................................56 表4 - 9 組合一之摘要結果......................................................57 表4 - 10組合二之摘要結果......................................................58 表4 - 11組合三之摘要結果......................................................59 表4 - 12組合四之摘要結果......................................................60 表4 - 13組合五之摘要結果......................................................61 表4 - 14組合六之摘要結果......................................................62 表4 - 15各組合摘要F-Measure之比較.............................................63 表4 - 16不同方法之摘要F-Measure比較...........................................65 表4 - 17事件特徵及關聯性計算之影響............................................67 表4 - 18相同分群文章下摘要品質之比較..........................................68

[1]黃純敏, 吳郁瑩, “網路文件自動摘要”, TANET'99台灣區網際網路研討會, 1999。
[2]Hovy, E. and C.-Y. Lin, ”Automated Text Summarization in SUMMARIST”, In ACL '97 workshop on Intelligent Scalable Text Summarization, pp 18-24 (1997).
[3]Luhn, Hans P., ”The Automatic Creation of Literature Abstracts”, IBM Journal, pages 159-165(1958).
[4]葉鎮源, ”文件自動化摘要方法之研究及其在中文文件的應用”, 交通大學資訊科學研究所論士論文(2002)。
[5]黃思萱, ”以關鍵詞分群為基礎的多文件摘要”, 台灣科技大學資訊管理研究所碩士論文(2000)。
[6]Mani, I. ,” Natural Language Processing, Vol. 3: Automatic Summarization”, John Benjamins (2001).
[7]Radev, D. R., Hovy, E. and McKeown, K., ” Introduction to the Special Issue on Summarization”, Association for Computational Linguistic, Volume 28, pp. 399-408(2002).
[8]Goldstein, J., Mittal, V., Carbonell, J. and Kantrowitz, M., “ Multi-Document Summarization By Sentence Extraction”, ANLP/NAACL Workshop on Automatic Summarization(2000).
[9]H. H., Chen and C. J., Lin, “A Multilingual News Summarizer“, Proceedings of 18th International Conference on Computational Linguistics, pp 159-165(2000).
[10]Radev, D. R., Fanx, W. and Zhangy, Z., ” WebInEssence: A Personalized Web-Based Multi-Document Summarization and Recommendation System”, NAACL Workshop on Automatic Summarization (2001).
[11]Hatzivassiloglou, V., Klavans, L. J., Holcombe, L. M., Barzilay, R., M. Y., Kan, and McKeown, K., ”SIMFINDER: A Flexible Clustering Tool for Summarization”, In Proceedings of the Workshop on Summarization in NAACL ‘01, Pittsburg, Pennsylvania, USA(2001).
[12]Hardy, H., Shimizu N., Strzalkowski T., L., Ting, Wise, B. G.., and X., Zhang, “Cross-Document Summarization by Concept Classification”, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp 121 - 128 (2002).
[13]Goldensohn, S. B., Evans,D., Hatzivassiloglou,V., McKeown, K., Nenkova, A., Passonneau, R., Schiffman, B., Schlaikjer, A., Siddharthan, A., Siegelman, S., “Columbia University at DUC 2004”, In Proceedings of the 4th Document Understanding Conference (2004).
[14]Radev, D. R., Jing, H., Sty’s, M., and Tam, D., “Centroid-based summarization of multiple documents”, Information Processing and Management, Volume 40 , Issue 6, pp919–938 (2004).
[15]Kupiec J., Pedersen J. and Chen F. “A Trainable Document Summarizer”, ACM SIGIR, Seattle WA, USA(1995).
[16]Watanabe, H., ” A Method for Abstracting Newspaper Articles by Using Surface Clues”, Proceedings of the 16th conference on Computational linguistics, Volume 2 , pp974 – 979(1996).
[17]邱中人, ”中文新聞摘要”, 清華大學資訊工程所碩士論文(2000)。
[18]吳家威, ”自動摘要方法之研究與探討”, 政治大學資訊科學所碩士論文(2002)。
[19]Salton G., Singhal A., Mitra M.and Chris Buckley, ”Automatic Text Structuring and Summarization”, Information Processing and Management: an International Journal ,Volume 33 , Issue 2, pp193-207(1997).
[20]J. H., Kimt, J. H., Kimt and D., Hwang, ” Korean Text Summarization Using an Aggregate Similarity”,International Workshop on Information Retrieval with Asia Languages Proceedings of the fifth in International workshop on Information retrieval with Asian languages, pp111 – 118(2000).
[21]Erkan, G.¸ and Radev, D. R., “LexPageRank : Prestige in Multi-Document Text Summarization”, Proceedings of EMNLP, Barcelona, Spain, pp365-371 (2004).
[22]黃純敏,”多語文(中英文)超文件自動摘要與評估”,行政院國家科學委員會專題研究計畫成果報告(2001)。
[23]Paice C. D. and Jones P.A., “The Identification of Important Concepts in Highly Structured Technical Papers” , In Proceedings of the 16 International Conference on Research and Development Information Retrieval, pp69-78(1993).
[24]Yan-Min Chen, Xiao-Long Wang, Bing-Quan Liu, ”Multi-Document Summarization Based On Lexical Chains”, IEEE Proceedings of the Fourth International conference on Machine Learning and Cybernetics(2005).
[25]C. Y., Lin and Hovy, E., “The Automated Acquisition of Topic Signatures for Text Summarization”, Proceedings of the 18th conference on Computational linguistics, Volume 1, pp495-501(2000).
[26]C. Y., Lin and Hovy, E., “Automated Multi-document Summarization in NeATS”, In Proceedings of the DARPA Human Language Technology Conference, pp. 50-53(2002).
[27]Y. M. Chen, X. L. Wang and B. Q. Liu, “Multi-Document Summarization Based On Lexical Chains”, IEEE Proceedings of the Fourth International conference on Machine Learning and Cybernetics(2005).
[28]Smooch T. K. Tang, Jerome Yen, and Christopher C. Yang,”Multi-Document Summarization based on Concept Space”, IEEE conference on information Technology: Research and Education, pp385-389(2003).
[29]黃耀民, “以字句擷取為基礎並應用於文件分類之自動摘要之研究”, 師範大學資訊工程研究所碩士論文(2005)。
[30]Radev, D. R, “A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure”, In Proceedings, 1st ACL SIGDIAL Workshop on Discourse and Dialogue(2000).
[31]Mann, W. C. and Thompson, S. A. ,” Rhetorical structure theory: A theory of text organization”,Technical Reports Information Sciences Institute, pp87-190(1987).
[32]Z., Zhang, Otterbacher, J. and Radev, D. R, “Learning Cross-document Structural Relationships using Boosting”, Proceedings of the twelfth international conference on Information and knowledge management, pp124 -130(2003).
[33]Bellaachia, A., ” Information Retrieval and Data Mining Techniques for Generic Text Summarization”, Technical Report, Computer Science Department, School of Engineering and Applied Sciences, The George Washington University(2003).
[34]Hovy, E., “The Oxford Handbook of Computational Linguistics”, Oxford university press, Chapter32(2003).
[35]Jun'ichi Fukumoto, ” Multi-Document Summarization Using Document Set Type Classification”, Proceedings NTCIR(2004).
[36]Newsblaster網站, http://www1.cs.columbia.edu/nlp/newsblaster/
[37]Mani, I. and Bloedorn, E. “Summarizing Similarities and Differences among Related Documents”, Proceedings of RIAO, Montreal(1999).
[38]Mani, I., House D. , Klein G., Hirschman L., Obrst L., Firmin T., Chrzanowski M., and Sundheim B., “The tipster summac text summarization evaluation: Final report.” Technical report, DARPA(1998).
[39]李祥賓和柯淑津,”新聞文件摘要之研究”,中華民國90年第十四屆計算機語言學會研討會論文集, pp65-88(2001)。
[40]郭家良,(黃純敏),”新聞事件群聚及摘要檢索研究”,雲林科技大學資訊管理研究所碩士論文”(2003)。
[41]黃聖傑, ”多文件自動摘要方法研究”, 台灣大學資訊工程研究所碩士論文(1999)。
[42]殷欣靖, ”以文件為基礎的資訊擷取系統”, 國立台灣科技大學資訊管理研究碩士論文(2001)。
[43]H. H. Chen and J. C. Lee, “Identification and Classification of Proper Nouns in Chinese Texts”, Proceedings of 16th International Conference on Computational Linguistics, pp222-229(1996).
[44]中華人網站:http://www.greatchinese.com/
[45]蔡坤修, ”以動態式詞分群為基礎之文件分群研究”, 國立台灣科技大學資訊管理研究所碩士論文(2003)。
[46]Nallapati, R., Allan, J. and Mahadevan, S., “Extraction of Key Words from News Stories”, CIIR Technical report #IR-345, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts(2004).
[47]C. N. Li and S. A., “Thompson, Mandarin Chinese – A Functional Reference Grammar”, the Crane Publishing Co.(1982).
[48]屈承喜, “A Concise Grammar of Mandarin Chinese”, 五南圖書出版公司( 1999)。
[49]邱詩佩, “以事件特徵為基礎的階層式新聞偵測系統”, 國立台灣科技大學資訊管理研究所碩士論文(2005)。

無法下載圖示 全文公開日期 2011/06/16 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE