Basic Search / Detailed Display

Author: 邱詩佩
Shih-pei Chiu
Thesis Title: 以事件特徵為基礎的階層式新聞偵測系統
Hierarchical News Detection based on Event Feature
Advisor: 徐俊傑
Chun-Chieh Hsu
Committee: 蕭顯勝
Hsien-Sheng Hsiao
賴源正
Yuan-Cheng Lai
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2005
Graduation Academic Year: 93
Language: 中文
Pages: 88
Keywords (in Chinese): 事件特徵主題偵測事件偵測新聞偵測文件分群
Keywords (in other languages): Event Feature, Topic Detection, Event Detection, News Detection, Document Clustering
Reference times: Clicks: 419Downloads: 2
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 由於資訊科技的進步,電子文件充斥在我們生活的周遭,加上網際網路的蓬勃發展,讓我們可以輕而易舉地獲得所需的資料。然而在浩瀚的資料堆中,要如何快速地獲取正確的資訊即變成一個非常重要的課題。電子新聞是人們獲取生活新知主要的管道之一,雖然電子新聞入口網站亦提供新聞檢索的功能,但是必須在使用者能夠下對正確的關鍵字,才能獲得有興趣的新聞報導。因此,必須要有一個機制,能夠自動將相關主題與事件的新聞報導聚集在一起。
    本研究分析新聞報導的特性,偵測出新聞報導中的事件特徵(Event Feature),以此事件特徵來萃取主題詞彙及事件導向詞彙。本研究為階層式的新聞偵測架構,主題階層是以監督式(Supervised)的學習模式,根據主題詞彙(Topic Term)將相同主題的新聞報導聚集在一起;事件階層是以非監督式(Unsupervised)的學習模式,利用事件導向詞彙與Modified Bisecting K-means分群演算法將相同事件的新聞報導聚集在一起。
    經由實驗發現,以本研究提出的以事件特徵為基礎的新聞偵測系統,在主題階層的新聞偵測最多可提昇約21%的精確率,在事件階層的新聞偵測最多可提昇約22%的效果。


    Since the growth of the information technique, there exist a large amount of electronic documents exist in our life. In addition, the booming development of the internet makes us obtain the information which we need easily. However, how to extract the right information quickly from the vast amount of documents becomes an important issue. Among the information, electronic news is one of the most common ways in which people retrieve the information they need. Although all portal sites provide news retrieval methods, only the users who precisely know the nature of the facts which they are seeking can effectively derive their needed information. Therefore, it is desirable to have a mechanism to automatically locate topically related topics and events in newswire stories.
    In this thesis, we analyze the properties of the news in order to detect the features of the events, which is called “Event Feature”. Event feature is used to identify the topic terms and event-oriented terms. In addition, we propose a hierarchical structure, which includes topic-level and event-level, for detecting the characteristics of the news. In topic-level, we use a supervised learning model based on topic terms to classify the news into pre-defined topic categories. In event-level, we adopt an unsupervised learning model based on event-oriented terms and “Modified Bisecting K-means Clustering Algorithm” to cluster the news.
    We have also conducted many experiments to study the effectiveness of our approach. The results show that in topic-level the precision of detection based on event feature can be raised 21 percent, and in event-level the performance of detection based on event feature can be raised 22 percent.

    中文摘要 I 英文摘要 II 誌謝 III 目錄 IV 圖索引 VII 表索引 IX 第一章、緒論 1 1.1 研究背景 1 1.2 研究目的及方法 2 1.3 論文架構 3 第二章、文獻探討 4 2.1 主題與事件的定義 4 2.2 主題偵測與追蹤 5 2.3 事件偵測的分類 6 2.3.1 回顧偵測 7 2.3.2 線上偵測 9 2.4 事件偵測相關研究文獻 10 2.5 詞彙挑選方法 12 2.6 向量空間模型(Vector Space Model, VSM) 15 2.7 文件分群技術 16 2.7.1 階層式分群演算法 17 2.7.2 分割式分群演算法 18 第三章、以事件特徵為基礎的階層式新聞偵測 21 3.1 系統架構 21 3.2 資料前處理程序 22 3.2.1中文斷詞 (Segmentation) 23 3.2.2人名辨識 (Name-Entities Identification) 23 3.2.3複合詞偵測 (Compound-Words Detection) 25 3.2.4 詞性過濾 (Filtering the Part of Speech) 26 3.2.5 詞彙頻率及文件頻率過濾 (TF and DF filtering) 26 3.2.6 事件特徵偵測 (Event Feature Detection) 27 3.3 階層式的新聞偵測架構 30 3.4 主題階層之新聞偵測 (Topic-Level News Detection) 31 3.4.1建立主題階層的文件向量空間 32 3.4.2 萃取代表每一主題之主題詞彙 33 3.4.3 主題階層之新聞偵測演算法 35 3.5 事件階層之新聞偵測 (Event-Level News Detection) 36 3.5.1建立事件階層的文件向量空間 37 3.5.2 過濾主題關聯的共同詞彙 38 3.5.3 過濾特殊化詞彙以挑選文件之事件導向詞彙 41 3.5.4 挑選文件之事件導向詞彙例子 42 3.5.5 事件階層之新聞偵測演算法 — Modified bisecting k-means 43 第四章、實驗結果與分析 49 4.1 資料集與實驗評估方法 49 4.1.1 資料集 49 4.1.2 實驗評估方法 50 4.2人名辨識之效果 52 4.3 主題階層之新聞偵測結果分析 52 4.3.1 主題詞彙是否以事件特徵為基礎對於主題階層新聞偵測之影響 53 4.3.2 主題詞彙數(參數m)對於主題階層新聞偵測之影響 55 4.3.3 挑選文件向量元素之條件對於主題階層新聞偵測之影響 57 4.3.4 主題類別對於主題階層新聞偵測之影響 61 4.3.5 主題類別數對於主題階層新聞偵測之影響 63 4.4 事件階層之新聞偵測結果分析 65 4.4.1 過濾共同性詞彙之參數  對事件階層新聞偵測的影響 66 4.4.2 過濾特殊化詞彙之參數  對事件階層新聞偵測的影響 70 4.4.3 擷取事件導向詞彙是否以事件特徵為基礎對事件階層新聞偵測的影響 72 第五章、結論與未來研究 77 5.1 結論 77 5.2 未來研究方向 78 參考文獻 80 附錄一、階層式新聞偵測系統展示 84

    [1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang, “Topic Detection andTracking Pilot Study Final Report”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp.194-218, 1998.
    [2] J. Allan, V. Lavrenko, and R. Papka, “Event Tracking”, CIIR Technical report #IR-128, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 1998.
    [3] J. Allan, R. Papka, and V. Lavrenko, “On-line New Event Detection and Tracking”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp.37-45, 1998.
    [4] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu and J. S. Park, “A Framework for Finding Projected Cluster in High Dimensional Spaces”, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp.70-81, 1999.
    [5] J. Allan, Topic Detection and Tracking: Event-based Information Organization, Kluwer Academic Publishers, 2002.
    [6] T. Brants, F. Chen, and A. Farahat, “A System for New Event Detection”, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp.330-337, 2003.
    [7] C. Cliftion, R. Cooley, and J. Rennie, “TopCat: Data Mining for Topic Identification in a Text Corpus”, IEEE Transactions on Knowledge and Engineering, vol.6, no.8, pp.949-964, 2004.
    [8] H. H. Chen and J. C. Lee, "Identification and Classification of Proper Nouns in Chinese Texts", Proceedings of 16th International Conference on Computational Linguistics, pp.222-229, 1996.
    [9] S. M. Hsieh, S .J. Huang, C. C. Hsu, and H. C. Chang, “Personal documents recommendation system based on data mining techniques”, Proceeding of 2004 IEEE/WIC/ACM International Joint Conference on Web Intelligence, pp.51-57, 2004.
    [10] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.
    [11] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: a Review,” ACM Computing Surveys, vol.31, no.3, pp.264-323, 1999.
    [12] G. Kumaran, and J. Allan, “Text Classification and Named Entities for New Event Detection”, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp.297-304, 2004.
    [13] G. Kumaran, J. Allan and A. McCallum, “Classification Models for New Event Detection”, CIIR Technical report #IR-362, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 2004.
    [14] D. Koller, and M. Sahami, “Hierarchically classifying documents using very few words”, Proceedings of the 14th International Conference on Machine Learning, pp.170-178, 1997.
    [15] V. Lavrenko, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas, “Relevance Models for Topic Detection and Tracking”, Proceedings of the Human Language Technology Conference (HLT), pp.104-110, 2002.
    [16] W. Lam, P. S. Cheung, and R. Huang, “Mining Events and New Name Translations from Online Daily News”, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pp.287-295, 2001.
    [17] W. Lam, H. M. L. Meng, K. L. Wong, and J. C. H. Yen, “Using Contextual Analysis for News Event Detection”, International Journal of Intelligent Systems, vol.16, no.4, pp.525-546, 2001.
    [18] C. N. Li and S. A. Thompson, Mandarin Chinese – A Functional Reference Grammar, the Crane Publishing Co., 1982.
    [19] J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi, “Simple Semantics in Topic Detection and Tracking”, Information Retrieval, vol.7, no.3-4, pp.347-368, 2004.
    [20] R. Nallapati, J. Allan, and S. Mahadevan, “Extraction of Key Words from News Stories”, CIIR Technical report #IR-345, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 2004.
    [21] C. J. van Rijsbergen, Information Retrieval, Second Edition Butterworths, London, 1979.
    [22] B. Y. Ricaardo, and R. N. Berthier, Modern Information Retrieval, Addison Wesley, 1999.
    [23] N. Stokes, and J. Carthy, “First Story Detection using a Composite Document Representation”, Proceedings of the Human Language Technology Conference (HLT), pp.134-141, 2001.
    [24] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques”, Technical report #00-034, Department of Computer Science and Engineering, University of Minnesota, 2000.
    [25] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, and X. Liu, “Learning approaches for Detecting and Tracking News Events”, IEEE Intelligent Systems, vol. 14, no. 4, pp.32-43, 1999.
    [26] Y. Yang, and J. Pedersen, “A Comparative study on Feature Selection in Text Categorization”, Proceedings of the 14th International Conference on Machine Learning, pp.412-420, 1997.
    [27] Y. Yang, T. Pierce, and J. Carbonell, “A Study on Retrospective and On-Line Event Detection”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp.28-36, 1998.
    [28] Y. Yang, J. Zhang, J. Carbonell, and C. Jin, “Topic-conditioned Novelty Detection”, Proceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.688-693, 2002.
    [29] Y. Zhang, J. Callan, and T. Minka, “Novelty and Redundancy Detection in Adaptive Filtering”, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp.81-88, 2002.
    [30] 屈承喜, A Concise Grammar of Mandarin Chinese, 五南圖書出版公司, 1999.
    [31] 殷欣靖,”以文件為基礎的資訊擷取系統”,國立台灣科技大學資訊管理研究所未出版碩士論文,2001。
    [32] 蔡坤修,”以動態式詞分群為基礎之文件分群研究”,國立台灣科技大學資訊管理研究所未出版碩士論文,2003。
    [33] 黃思佳,”以資料探勘技術為基礎的個人化文件推薦系統“,國立台灣科技大學資訊管理研究所未出版碩士論文,2003。

    無法下載圖示 Full text public date 2006/06/07 (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)
    QR CODE