簡易檢索 / 詳目顯示

研究生: 王嘉群
Jia-Chiun Wang
論文名稱: 建置以機器學習理論為基礎之中英文電子郵件分類器
Apply Machine learning Theory to Build E-mail Filter
指導教授: 洪西進
Shi-Jinn Horng
口試委員: 賴祐吉
Yu-Chi Lai
Hsing-Kuo Pao
Wei-Chung Teng
Yi-Leh Wu
學位類別: 碩士
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 59
中文關鍵詞: 貝式郵件分類器N-gram斷詞相關性係數與距離權重
外文關鍵詞: Bayesian spam filter, N-gram segmemtation, correlation and distance coefficients
相關次數: 點閱:191下載:11
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網際網路的發達,由於電子郵件的傳遞具有經濟迅速的特點,越來越多使用者以電子郵件做為聯絡工具,垃圾郵件的氾濫成為收件者一大困擾,基於上述原因,本論文採用分類效果佳、速度快的Naïve Bayes演算法為垃圾郵件過濾器,並以相關性係數與距離係數權重計算建立重要關鍵詞相互關係的特徵擷取法與使用N-gram中文斷詞的郵件前置處理法來提升中文垃圾郵件分辨率。系統效能則使用TREC 2006中文郵件資料集與TREC 2007英文郵件資料集,以k-fold方式進行評估。實驗數據證明SP(Spam Precision)與SR(Spam Recall)整體而言都較其他研究成果為佳。

    As the Internet developed, more and more people use e-mail as a communication means. At the same time, spam flooding has also become a serious problem for recipients. This paper chose Naïve Bayes Theory as the classifier in spam filter because of good classification results and classification speed. To reduce the influence of content tampering from spammers and to enhance the impact of spam on the resolution, we use correlation and distance coefficients addition with features to establish important keywords are related to each other. Another pre-processing for Chinese e-mail, we use the N-gram to do segment job. The datasets we use are the TREC 2006 data set of Chinese e-mail and TREC 2007 of English e-mail. Experiments show that our SP (Spam Precision) and SR (Spam Recall) Overall results are better than the other researches.

    中文摘要 Ⅰ 英文摘要 Ⅱ 誌  謝 Ⅲ 目 錄 Ⅳ 圖 目 錄 Ⅷ 表 目 錄 X 第一章 緒論 1 1.1 研究背景與問題探討 1 1.2 研究動機與目的 3 1.3 論文架構 4 第二章 電子郵件剖析 5 2.1 通訊協定 5 2.1.1 基本架構 5 2.1.2 SMTP 7 2.1.3 POP3 10 2.1.4 MIME 12 2.2 電子郵件問題探討 16 2.2.1 垃圾郵件的定義 17 2.2.2 垃圾郵件的危害 18 2.2.3 垃圾郵件的表現方式 19 第三章 相關文獻 21 3.1 防堵垃圾郵件的架構 21 3.1.1 單機過濾 21 3.1.2 多機聯防 22 3.2 Naïve Bayes演算法 24 3.2.1 Naïve Bayes演算法的原理 25 3.2.2 選用Naïve Bayes演算法的理由 27 第四章 系統架構 29 4.1  系統方塊圖 29 4.2 前置處理 29 4.2.1 英文郵件前置處理 30 4.2.2 字典比對法 31 4.2.3 N-gram斷詞法 32 4.3 特徵值選取方式 33 4.3.1 詞頻TF 34 4.3.2 詞頻-逆向文件頻率TF-IDF 34 4.3.3 卡方積 35 4.3.4 馬可夫特徵擷取法 36 4.3.5 相關性係數 38 4.4 訓練方法分析 42 4.4.1 TEFT 42 4.4.2 TOE 42 4.4.3 TUNE 43 第五章 實驗方法與結果分析 44 5.1 資料集與驗證方式 44 5.2 效能評估方式 45 5.3 實驗環境與結果 47 5.3.1 改良步驟1. 中文斷詞方式 47 5.3.2 改良步驟2. 特徵選取 48 5.3.3 改良步驟3. 訓練方法 50 5.3.4 與其他分類系統之比較 51 第六章 結論與未來發展 54 6.1 結論 54 6.2 未來發展 55 參考文獻 56

    [1] Douglas E. Comer,電腦與網際網路第三版,December 2002.
    [2] QP是甚麼? http://input.cpatch.org/txt/bbswww/qp.txt
    [3] Brian Bangnall, Chris O.Broomes, Ryan Russell,E-mail病毒防護技術手冊,October 2002.
    [4] Alan Gray and Mads Haahr, “Personalised, Collaborative Spam Filtering”, Proceedings of 1st conference on email and anti-spam, 2004.
    [5] Ken Lunde,中日韓越-資訊處理,January 1999.
    [6] Joseph S. Kong, Behnam A. Rezaei, Nima Sarshar, and Vwani P. Roychowdhury, P. Oscar Boykin, “Collaborative spam filtering using e-mail network”, IEEE, 2006
    [7] V. Vapnik, “Statistical Learning Theory”, 1998.
    [8]  Sang Min Lee, Dong Seong Kim, Ji Ho Kim and Jong Sou Park, “Spam Detection Using Feature Selection and Parameters Optimization”, Complex, Intelligent and Software Intensive Systems (CISIS), 2010 International Conference
    [9] 賽門鐵克,2009年12月份垃圾郵件報告:http://eval.symantec.com/mktginfo/enterprise/other_resources/b-state_of_spam_report_12-2009.en-us.pdf
    [10] 賽門鐵克,2010年5月垃圾郵件及網路釣魚報告: http://eval.symantec.com/mktginfo/enterprise/other_resources/b-state_of_spam_and_phishing_report_05-2010.en-us.pdf
    [11]  台灣趨勢科技首頁 http://www.trend.com.tw
    [12] 資訊教育情報網,垃圾郵件猖獗主因:使用者將信箱地址外露http://www.pcteacher.com.tw/modules/news/article.php?storyid=29
    [13]  賽門鐵克中文首頁 http://www.symantec.com/region/tw/
    [14]  Banit Agrawal, Nitin Kumar, and Mart Molle, “Controlling Spam Email at the Routers”, in IEEE International Conference on Communication(ICC 05), Seoul Korea, 2005.
    [15] Konstantin Tretyakov, “Machine Learning Techniques in Spam Filtering”, Data Mining Problem-oriented Seminar, MTAT.03.177, May 2004, pp. 60-79.
    [16]  Lorenzo Lazzari, Marco Mari and Agostino Poggi, “CAFE - Collaborative Agents for Filtering E-mails”, Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (WETICE’05), 2005
    [17] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz, “A Bayesian Approach to Filtering Junk E-Mail”, AAAI Workshop on Learning for Text Categorization, July 1998.
    [18] Ion Androutsopoulos, John Koutsias, Georage Paliouras, Konstantinos V. Chandrinos and Constantine D. Spyropoulos, “An Evaluation of Naive Bayesian Anti-Spam Filtering”, Proceedings of the workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pp. 9-17, 2000.
    [19]  M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, “A bayesian approach to filtering junk e-mail”, AAAI Workshop on Learning for Text Classification, 1998.
    [20]  Pascal Soucy, Gy w. Mineau,A Simple KNN Algorithm for Text Categorization, In Proceeding IEEE International Conference on Data Mining,2001.ICDM, pages 647-648,Dec.2001.
    [21] Bart Massey, Mick Thomure, Raya Budrevich and Scott Long, “Learning Spam: Simple Techniques For Freely-Available Software”, Proceedings of the 2003 Usenix Annual Technical Conference, Freenix Track, 2003.
    [22] Martin Stig Stissing and Lars Hesel Christensen, “Recognising spam using neural network”, Topics of Evolutionary Computation 2002, EVALife, Dept. of Computer Science, University of Aarhus, Denmark.
    [23] Yuan Lian, “E-Mail Filtering”, Masters Project dissertation, the University of Sheffield, August 30, 2002.
    [24] 最常詞組符合演算法 Maximum Matching Algorithm 及MMSEG http://technology.chtsai.org/mmseg/
    [25] William S. Yerazunis, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It., 2004 MIT Spam Conference, January 18, 2004
    [26] 永遠的Unix 首頁 http://www.fanqiang.com/
    [27] IETF RFC首頁 RFC821 http://www.ietf.org/rfc.html
    [28] IETF RFC首頁 RFC1939 http://www.ietf.org/rfc.html
    [29] IETF RFC首頁 RFC1341 http://www.ietf.org/rfc.html
    [30] Yiming Yang, Jan O. Pedersen. “A comparative Study on Feature Selection in Text Categorization”, In Proceedings of the Fourteenth International Conference on Machine Learning (ICML ‘97), pages 412-420, July 08-12, 1997
    [31] Web Site:Term Weighting Approaches in Automatic Text Retrieval http://portal.acm.org/citation.cfm?id=866292
    [32] TREC (The Text Retrieval Conference)  http://trec.nist.gov/
    [33] 李美玲, “基於機器學習理論建置中英文電子郵件過濾系統之研究”, 碩士論文, 台灣科技大學, 2006
    [34]  Jiang-Liang Hou and Chuan-An Chan, “A DOCUMENT CONTENT EXTRACTION MODEL USING KEYWORD CORRELATION ANALYSIS”, International Journal of Electronic Business Management, Vol. 1, No. 1, pp. 54-62, 2003
    [35] 網路安全小組 http://www.20cn.net
    [36] 張僩鈞, 葉生正, 蘇民揚, “A Study of Two-tier Filtering Schemes for Anti-spam”, 碩士論文, 銘傳大學, 2005
    [37] 資訊時代,網路業者出招 向垃圾郵件說不,
    [38] 資訊教育情報網,垃圾郵件猖獗主因:使用者將信箱地址外露http://www.pcteacher.com.tw/modules/news/article.php?storyid=29
    [39] 賽門鐵克,Security Response Center,http://www.symantec.com/region/tw/avcenter/
    [40] 賽門鐵克,不當垃圾郵件判別處理中心:http://www.symantec.com/region/tw/spamwatch/
    [41] IBM全球服務部,e-電子郵件服務,http://www-8.ibm.com/services/tw/ebhost/note/ebhost_note01.html
    [42] 中央研究院資訊科學所中文詞知識庫小組,http://ckip.iis.sinica.edu.tw/CKIP/
    [43] 鄧維侖, “搭配個人合法郵件過濾器之垃圾郵件過濾方法”, 碩士論文, 台灣科技大學, 2009
    [44] Y. Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval 1, Volume 1, Numbers1-2, pages69-90., April, 1999
    [45]  Web Site:What is Stemming? http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm