Author: |
邱詩佩 Shih-pei Chiu |
---|---|
Thesis Title: |
以事件特徵為基礎的階層式新聞偵測系統 Hierarchical News Detection based on Event Feature |
Advisor: |
徐俊傑
Chun-Chieh Hsu |
Committee: |
蕭顯勝
Hsien-Sheng Hsiao 賴源正 Yuan-Cheng Lai |
Degree: |
碩士 Master |
Department: |
管理學院 - 資訊管理系 Department of Information Management |
Thesis Publication Year: | 2005 |
Graduation Academic Year: | 93 |
Language: | 中文 |
Pages: | 88 |
Keywords (in Chinese): | 事件特徵 、主題偵測 、事件偵測 、新聞偵測 、文件分群 |
Keywords (in other languages): | Event Feature, Topic Detection, Event Detection, News Detection, Document Clustering |
Reference times: | Clicks: 419 Downloads: 2 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
由於資訊科技的進步,電子文件充斥在我們生活的周遭,加上網際網路的蓬勃發展,讓我們可以輕而易舉地獲得所需的資料。然而在浩瀚的資料堆中,要如何快速地獲取正確的資訊即變成一個非常重要的課題。電子新聞是人們獲取生活新知主要的管道之一,雖然電子新聞入口網站亦提供新聞檢索的功能,但是必須在使用者能夠下對正確的關鍵字,才能獲得有興趣的新聞報導。因此,必須要有一個機制,能夠自動將相關主題與事件的新聞報導聚集在一起。
本研究分析新聞報導的特性,偵測出新聞報導中的事件特徵(Event Feature),以此事件特徵來萃取主題詞彙及事件導向詞彙。本研究為階層式的新聞偵測架構,主題階層是以監督式(Supervised)的學習模式,根據主題詞彙(Topic Term)將相同主題的新聞報導聚集在一起;事件階層是以非監督式(Unsupervised)的學習模式,利用事件導向詞彙與Modified Bisecting K-means分群演算法將相同事件的新聞報導聚集在一起。
經由實驗發現,以本研究提出的以事件特徵為基礎的新聞偵測系統,在主題階層的新聞偵測最多可提昇約21%的精確率,在事件階層的新聞偵測最多可提昇約22%的效果。
Since the growth of the information technique, there exist a large amount of electronic documents exist in our life. In addition, the booming development of the internet makes us obtain the information which we need easily. However, how to extract the right information quickly from the vast amount of documents becomes an important issue. Among the information, electronic news is one of the most common ways in which people retrieve the information they need. Although all portal sites provide news retrieval methods, only the users who precisely know the nature of the facts which they are seeking can effectively derive their needed information. Therefore, it is desirable to have a mechanism to automatically locate topically related topics and events in newswire stories.
In this thesis, we analyze the properties of the news in order to detect the features of the events, which is called “Event Feature”. Event feature is used to identify the topic terms and event-oriented terms. In addition, we propose a hierarchical structure, which includes topic-level and event-level, for detecting the characteristics of the news. In topic-level, we use a supervised learning model based on topic terms to classify the news into pre-defined topic categories. In event-level, we adopt an unsupervised learning model based on event-oriented terms and “Modified Bisecting K-means Clustering Algorithm” to cluster the news.
We have also conducted many experiments to study the effectiveness of our approach. The results show that in topic-level the precision of detection based on event feature can be raised 21 percent, and in event-level the performance of detection based on event feature can be raised 22 percent.
[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang, “Topic Detection andTracking Pilot Study Final Report”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp.194-218, 1998.
[2] J. Allan, V. Lavrenko, and R. Papka, “Event Tracking”, CIIR Technical report #IR-128, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 1998.
[3] J. Allan, R. Papka, and V. Lavrenko, “On-line New Event Detection and Tracking”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp.37-45, 1998.
[4] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu and J. S. Park, “A Framework for Finding Projected Cluster in High Dimensional Spaces”, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp.70-81, 1999.
[5] J. Allan, Topic Detection and Tracking: Event-based Information Organization, Kluwer Academic Publishers, 2002.
[6] T. Brants, F. Chen, and A. Farahat, “A System for New Event Detection”, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp.330-337, 2003.
[7] C. Cliftion, R. Cooley, and J. Rennie, “TopCat: Data Mining for Topic Identification in a Text Corpus”, IEEE Transactions on Knowledge and Engineering, vol.6, no.8, pp.949-964, 2004.
[8] H. H. Chen and J. C. Lee, "Identification and Classification of Proper Nouns in Chinese Texts", Proceedings of 16th International Conference on Computational Linguistics, pp.222-229, 1996.
[9] S. M. Hsieh, S .J. Huang, C. C. Hsu, and H. C. Chang, “Personal documents recommendation system based on data mining techniques”, Proceeding of 2004 IEEE/WIC/ACM International Joint Conference on Web Intelligence, pp.51-57, 2004.
[10] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.
[11] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: a Review,” ACM Computing Surveys, vol.31, no.3, pp.264-323, 1999.
[12] G. Kumaran, and J. Allan, “Text Classification and Named Entities for New Event Detection”, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp.297-304, 2004.
[13] G. Kumaran, J. Allan and A. McCallum, “Classification Models for New Event Detection”, CIIR Technical report #IR-362, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 2004.
[14] D. Koller, and M. Sahami, “Hierarchically classifying documents using very few words”, Proceedings of the 14th International Conference on Machine Learning, pp.170-178, 1997.
[15] V. Lavrenko, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas, “Relevance Models for Topic Detection and Tracking”, Proceedings of the Human Language Technology Conference (HLT), pp.104-110, 2002.
[16] W. Lam, P. S. Cheung, and R. Huang, “Mining Events and New Name Translations from Online Daily News”, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pp.287-295, 2001.
[17] W. Lam, H. M. L. Meng, K. L. Wong, and J. C. H. Yen, “Using Contextual Analysis for News Event Detection”, International Journal of Intelligent Systems, vol.16, no.4, pp.525-546, 2001.
[18] C. N. Li and S. A. Thompson, Mandarin Chinese – A Functional Reference Grammar, the Crane Publishing Co., 1982.
[19] J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi, “Simple Semantics in Topic Detection and Tracking”, Information Retrieval, vol.7, no.3-4, pp.347-368, 2004.
[20] R. Nallapati, J. Allan, and S. Mahadevan, “Extraction of Key Words from News Stories”, CIIR Technical report #IR-345, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 2004.
[21] C. J. van Rijsbergen, Information Retrieval, Second Edition Butterworths, London, 1979.
[22] B. Y. Ricaardo, and R. N. Berthier, Modern Information Retrieval, Addison Wesley, 1999.
[23] N. Stokes, and J. Carthy, “First Story Detection using a Composite Document Representation”, Proceedings of the Human Language Technology Conference (HLT), pp.134-141, 2001.
[24] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques”, Technical report #00-034, Department of Computer Science and Engineering, University of Minnesota, 2000.
[25] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, and X. Liu, “Learning approaches for Detecting and Tracking News Events”, IEEE Intelligent Systems, vol. 14, no. 4, pp.32-43, 1999.
[26] Y. Yang, and J. Pedersen, “A Comparative study on Feature Selection in Text Categorization”, Proceedings of the 14th International Conference on Machine Learning, pp.412-420, 1997.
[27] Y. Yang, T. Pierce, and J. Carbonell, “A Study on Retrospective and On-Line Event Detection”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp.28-36, 1998.
[28] Y. Yang, J. Zhang, J. Carbonell, and C. Jin, “Topic-conditioned Novelty Detection”, Proceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.688-693, 2002.
[29] Y. Zhang, J. Callan, and T. Minka, “Novelty and Redundancy Detection in Adaptive Filtering”, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp.81-88, 2002.
[30] 屈承喜, A Concise Grammar of Mandarin Chinese, 五南圖書出版公司, 1999.
[31] 殷欣靖,”以文件為基礎的資訊擷取系統”,國立台灣科技大學資訊管理研究所未出版碩士論文,2001。
[32] 蔡坤修,”以動態式詞分群為基礎之文件分群研究”,國立台灣科技大學資訊管理研究所未出版碩士論文,2003。
[33] 黃思佳,”以資料探勘技術為基礎的個人化文件推薦系統“,國立台灣科技大學資訊管理研究所未出版碩士論文,2003。