研究生: |
王嘉群 Jia-Chiun Wang |
---|---|
論文名稱: |
建置以機器學習理論為基礎之中英文電子郵件分類器 Apply Machine learning Theory to Build E-mail Filter |
指導教授: |
洪西進
Shi-Jinn Horng |
口試委員: |
賴祐吉
Yu-Chi Lai 鮑興國 Hsing-Kuo Pao 鄧惟中 Wei-Chung Teng 吳怡樂 Yi-Leh Wu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 中文 |
論文頁數: | 59 |
中文關鍵詞: | 貝式郵件分類器 、N-gram斷詞 、相關性係數與距離權重 |
外文關鍵詞: | Bayesian spam filter, N-gram segmemtation, correlation and distance coefficients |
相關次數: | 點閱:191 下載:11 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網際網路的發達,由於電子郵件的傳遞具有經濟迅速的特點,越來越多使用者以電子郵件做為聯絡工具,垃圾郵件的氾濫成為收件者一大困擾,基於上述原因,本論文採用分類效果佳、速度快的Naïve Bayes演算法為垃圾郵件過濾器,並以相關性係數與距離係數權重計算建立重要關鍵詞相互關係的特徵擷取法與使用N-gram中文斷詞的郵件前置處理法來提升中文垃圾郵件分辨率。系統效能則使用TREC 2006中文郵件資料集與TREC 2007英文郵件資料集,以k-fold方式進行評估。實驗數據證明SP(Spam Precision)與SR(Spam Recall)整體而言都較其他研究成果為佳。
As the Internet developed, more and more people use e-mail as a communication means. At the same time, spam flooding has also become a serious problem for recipients. This paper chose Naïve Bayes Theory as the classifier in spam filter because of good classification results and classification speed. To reduce the influence of content tampering from spammers and to enhance the impact of spam on the resolution, we use correlation and distance coefficients addition with features to establish important keywords are related to each other. Another pre-processing for Chinese e-mail, we use the N-gram to do segment job. The datasets we use are the TREC 2006 data set of Chinese e-mail and TREC 2007 of English e-mail. Experiments show that our SP (Spam Precision) and SR (Spam Recall) Overall results are better than the other researches.
[1] Douglas E. Comer,電腦與網際網路第三版,December 2002.
[2] QP是甚麼? http://input.cpatch.org/txt/bbswww/qp.txt
[3] Brian Bangnall, Chris O.Broomes, Ryan Russell,E-mail病毒防護技術手冊,October 2002.
[4] Alan Gray and Mads Haahr, “Personalised, Collaborative Spam Filtering”, Proceedings of 1st conference on email and anti-spam, 2004.
[5] Ken Lunde,中日韓越-資訊處理,January 1999.
[6] Joseph S. Kong, Behnam A. Rezaei, Nima Sarshar, and Vwani P. Roychowdhury, P. Oscar Boykin, “Collaborative spam filtering using e-mail network”, IEEE, 2006
[7] V. Vapnik, “Statistical Learning Theory”, 1998.
[8] Sang Min Lee, Dong Seong Kim, Ji Ho Kim and Jong Sou Park, “Spam Detection Using Feature Selection and Parameters Optimization”, Complex, Intelligent and Software Intensive Systems (CISIS), 2010 International Conference
[9] 賽門鐵克,2009年12月份垃圾郵件報告:http://eval.symantec.com/mktginfo/enterprise/other_resources/b-state_of_spam_report_12-2009.en-us.pdf
[10] 賽門鐵克,2010年5月垃圾郵件及網路釣魚報告: http://eval.symantec.com/mktginfo/enterprise/other_resources/b-state_of_spam_and_phishing_report_05-2010.en-us.pdf
[11] 台灣趨勢科技首頁 http://www.trend.com.tw
[12] 資訊教育情報網,垃圾郵件猖獗主因:使用者將信箱地址外露http://www.pcteacher.com.tw/modules/news/article.php?storyid=29
[13] 賽門鐵克中文首頁 http://www.symantec.com/region/tw/
[14] Banit Agrawal, Nitin Kumar, and Mart Molle, “Controlling Spam Email at the Routers”, in IEEE International Conference on Communication(ICC 05), Seoul Korea, 2005.
[15] Konstantin Tretyakov, “Machine Learning Techniques in Spam Filtering”, Data Mining Problem-oriented Seminar, MTAT.03.177, May 2004, pp. 60-79.
[16] Lorenzo Lazzari, Marco Mari and Agostino Poggi, “CAFE - Collaborative Agents for Filtering E-mails”, Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (WETICE’05), 2005
[17] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz, “A Bayesian Approach to Filtering Junk E-Mail”, AAAI Workshop on Learning for Text Categorization, July 1998.
[18] Ion Androutsopoulos, John Koutsias, Georage Paliouras, Konstantinos V. Chandrinos and Constantine D. Spyropoulos, “An Evaluation of Naive Bayesian Anti-Spam Filtering”, Proceedings of the workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pp. 9-17, 2000.
[19] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, “A bayesian approach to filtering junk e-mail”, AAAI Workshop on Learning for Text Classification, 1998.
[20] Pascal Soucy, Gy w. Mineau,A Simple KNN Algorithm for Text Categorization, In Proceeding IEEE International Conference on Data Mining,2001.ICDM, pages 647-648,Dec.2001.
[21] Bart Massey, Mick Thomure, Raya Budrevich and Scott Long, “Learning Spam: Simple Techniques For Freely-Available Software”, Proceedings of the 2003 Usenix Annual Technical Conference, Freenix Track, 2003.
[22] Martin Stig Stissing and Lars Hesel Christensen, “Recognising spam using neural network”, Topics of Evolutionary Computation 2002, EVALife, Dept. of Computer Science, University of Aarhus, Denmark.
[23] Yuan Lian, “E-Mail Filtering”, Masters Project dissertation, the University of Sheffield, August 30, 2002.
[24] 最常詞組符合演算法 Maximum Matching Algorithm 及MMSEG http://technology.chtsai.org/mmseg/
[25] William S. Yerazunis, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It., 2004 MIT Spam Conference, January 18, 2004
[26] 永遠的Unix 首頁 http://www.fanqiang.com/
[27] IETF RFC首頁 RFC821 http://www.ietf.org/rfc.html
[28] IETF RFC首頁 RFC1939 http://www.ietf.org/rfc.html
[29] IETF RFC首頁 RFC1341 http://www.ietf.org/rfc.html
[30] Yiming Yang, Jan O. Pedersen. “A comparative Study on Feature Selection in Text Categorization”, In Proceedings of the Fourteenth International Conference on Machine Learning (ICML ‘97), pages 412-420, July 08-12, 1997
[31] Web Site:Term Weighting Approaches in Automatic Text Retrieval http://portal.acm.org/citation.cfm?id=866292
[32] TREC (The Text Retrieval Conference) http://trec.nist.gov/
[33] 李美玲, “基於機器學習理論建置中英文電子郵件過濾系統之研究”, 碩士論文, 台灣科技大學, 2006
[34] Jiang-Liang Hou and Chuan-An Chan, “A DOCUMENT CONTENT EXTRACTION MODEL USING KEYWORD CORRELATION ANALYSIS”, International Journal of Electronic Business Management, Vol. 1, No. 1, pp. 54-62, 2003
[35] 網路安全小組 http://www.20cn.net
[36] 張僩鈞, 葉生正, 蘇民揚, “A Study of Two-tier Filtering Schemes for Anti-spam”, 碩士論文, 銘傳大學, 2005
[37] 資訊時代,網路業者出招 向垃圾郵件說不,
http://www.libertytimes.com.tw/2002/new/mar/11/today-i1.htm
[38] 資訊教育情報網,垃圾郵件猖獗主因:使用者將信箱地址外露http://www.pcteacher.com.tw/modules/news/article.php?storyid=29
[39] 賽門鐵克,Security Response Center,http://www.symantec.com/region/tw/avcenter/
[40] 賽門鐵克,不當垃圾郵件判別處理中心:http://www.symantec.com/region/tw/spamwatch/
[41] IBM全球服務部,e-電子郵件服務,http://www-8.ibm.com/services/tw/ebhost/note/ebhost_note01.html
[42] 中央研究院資訊科學所中文詞知識庫小組,http://ckip.iis.sinica.edu.tw/CKIP/
[43] 鄧維侖, “搭配個人合法郵件過濾器之垃圾郵件過濾方法”, 碩士論文, 台灣科技大學, 2009
[44] Y. Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval 1, Volume 1, Numbers1-2, pages69-90., April, 1999
[45] Web Site:What is Stemming? http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm