研究生: |
簡琮祐 Tsung-Yu Chien |
---|---|
論文名稱: |
預測事件文件之樣式探勘與自動驗證技術 Pattern Mining and Verification of On-Line Event Prediction Document |
指導教授: |
鍾聖倫
Sheng-luen Chung 陸敬互 Ching-hu Lu |
口試委員: |
蘇順豐
Shun-feng Su 古倫維 none 毛敬豪 none |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 中文 |
論文頁數: | 127 |
中文關鍵詞: | 樣式探勘 、自然語言處理 、語料庫 、預測驗證 |
外文關鍵詞: | pattern mining, nature language processing, corpus, prediction verification |
相關次數: | 點閱:273 下載:9 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
預測的價值在於其與之後發生的結果是否吻合而定。網路上有許多預測的服務,然而,卻很少看到後續真確性的核實。本研究的目的是針對網路上預測文件進行自動化追溯,以客觀地核實其預測結果的真確性。據此,對應於商周甲骨文的卜卦中:前辭、命辭、占辭與驗辭的四項內容組成,本論文所要實現的資料探勘技術是針對預測性文章,自動摘錄出對應此卜卦四個組成的小卜卦 (oraclet),並發展後續發生的結果進行驗辭的驗證。技術上,我們利用Stanford University的中文自然語言處理工具 (NLTK) 來對預測性質文章進行解析處理。其中,我們特別就應用NLTK時常發生的斷詞錯誤以及詞性標註錯誤等問題,提出修正的演算法:在不需重新訓練語料庫前提下,讓調適後的自然語言處理能將文章內容轉化成是能夠被剖析且帶有語法結構的剖析樹,並能自動對應到小卜卦中的前辭、命辭與占辭等內容,以便後續與應用領域相關的驗辭驗證。透過上述:文字斷詞、詞性標注、語法分析等步驟,我們可針對以往預測性文章查核其預測與事後實際狀況的吻合度。經由分析此預測來源 (prediction source) 的歷史預測結果,以客觀地評估該預測來源的可信度。同時,在發現有新的預測文章在網路公佈時,我們所設計的線上預測評估系統,也能自動建置等候驗證的小卜卦,而在預測事件發生後,自動查核預測的正確度,並將此結果歸併入該預測單位的歷史預測記錄中。當作成果展示,在本研究所建置「預測事件驗證網」網站上,特別就投信機構過去幾年對台灣股市的預測為例進行證驗。
The value of a prediction hinges on its validity of outcome. Too many prediction services are available from the Internet; whilst, few keeps track of their follow-up validity. This study aims to verify validity of internet event prediction document through automatic mining process. We draw analogy from ancient Chinese oracle archives, where each prediction is composed of four components: preface, charge, prognostication, and verification. Our goal is to extract contents corresponding to these four components as a four-tuple prediction entity, which we call an oraclet. An oracle yet to be verified is called pending, whereas, a verified one as final. In essence, all prediction documents start as pending oraclets with the verification part unfilled. To conduct pattern mining, Stanford University’s Chinese Natural Language Processing Toolkit (NLTK) is utilized. To tackle the common errors of segmentation and part-of-speech tagging occurred when utilizing NLTK, we propose an amended corpus such that, without need to retrain corpus, a prediction document can be parsed, yielding a pending oraclet later to be verified by pertinent but separate authority sources. This way, prediction confidence concerning a particular prediction source can be referenced through systematic collection and verification of its past historic predictions. An online verification service – Web Service for Prediction Verification (WSPV) – is developed. When spotting a new prediction, WSPV will map it to a corresponding oraclet, schedule time for verification, and verify the predicted result for later confidence check. For demonstration, prediction document for Taiwan stock market performance of the following year from 2004 to 2014 are mined, verified, and presented. The proposed pattern mining technique can also be generalized to other domains of prediction.
[1] 中國國學網. (2011). 商代貞人與信息傳播探析 (1) [Online]. Available: http://www.confucianism.com.cn/html/lishi/14001243.html
[2] 國立故宮博物院. (2001). 中央研究院歷史語言研究所藏甲骨特展 [Online]. Available:http://www.npm.gov.tw/exhbition/yin0701/intro.htm
[3] F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, pp. 1-47, 2002.
[4] I. H. Witten, “Text mining,” Practical handbook of internet computing, pp. 14-1, 2005.
[5] A. C. Mendes and C. Antunes, “Pattern mining with natural language processing: An exploratory approach,” in Machine Learning and Data Mining in Pattern Recognition, ed: Springer, pp. 266-279, 2009.
[6] X. J. Wang, Y. Qin, and W. Liu, “A search-based Chinese word segmentation method,” in Proceedings of the 16th international conference on World Wide Web, pp. 1129-1130, 2007.
[7] P. C. Chang, H. Tseng, D. Jurafsky, and C. D. Manning, “Discriminative reordering with Chinese grammatical relations features,” in Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation, pp. 51-59, 2009.
[8] R. Higashinaka, K. S. Tsu, K. Saito, T. Makino, and Y. Matsuo, “Creating an extended named entity dictionary from wikipedia,” in 24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers, pp. 1163-1178, 2012.
[9] D. Tkach, “Text mining technology: turning information into knowledge,” IBM White Paper, Feb 17, 1998.
[10] P. Youngmin, K. Sangwoo, and S. Jungyun, “Title named entity recognition using wikipedia and abbreviation generation,” in Big Data and Smart Computing (BIGCOMP), 2014 International Conference on, pp. 169-172, 2014.
[11] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval vol. 1: Cambridge university press Cambridge, 2008.
[12] D.E Appelt. (1999, April 28). Introduction to Information Extraction Technology-A Tutorial Prepared for IJCAI-99 [Online], Available: http://www.ai.sri.com/~appelt/ie-tutorial
[13] Q. Zeng, X. Zhang, Z. Li, L. Liu, and W. Zhang, “Extracting clinical information from free-text of pathology and operation notes via Chinese Natural Language Processing,” in 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops(BIBMW), pp. 593-597, 2010.
[14] Natural Language Toolkit. (2015, June 17). NLTK 3.0 documentation [Online]. Available: http://www.nltk.org/
[15] Y. F. Tsai. 中文斷詞與詞類標記系統簡介 [Online]. Available: http://linganchor.ling.sinica.edu.tw/data/file/LC040906LC03.pdf
[16] G. Neubig. (2010). NLP Programming Tutorial 4 - Word Segmentation. [Online]. Available: http://www.phontron.com/slides/nlp-programming-en-04-ws.pdf
[17] G. Neubig. (2010). NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models [Online]. Available: http://www.phontron.com/slides/nlp-programming-en-05-hmm.pdf
[18] Natural Language Toolkit. (2015, June 17). Parsing With Context Free Grammar [Online]. Available: http://www.nltk.org/book/ch08.html
[19] F. Xia, “The part-of-speech tagging guidelines for the Penn Chinese Treebank (3.0),” 2000.