簡易檢索 / 詳目顯示

研究生: 廖俊雄
Chun - Hsiung Liao
論文名稱: 使用平滑支撐向量機之個人化垃圾郵件過濾系統
Personalized Spam Mail Filtering by Using SSVM
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 鄧惟中
Teng Wei-Chung
鮑興國
Hsing-Kuo Pao
阮聖彰
Shanq-Jang Ruan
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 59
中文關鍵詞: 垃圾郵件黑名單白名單平滑式支撐向量機
外文關鍵詞: Spam Mail, Blacklist, Whitelist, Smooth Support Vector Machine
相關次數: 點閱:436下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網際網路的快速發展,垃圾郵件已成為企業與個人在資安上的重要挑戰。除了傳統的商業垃圾郵件外,其它包括釣魚攻擊、色情訊息、惡意程式(病毒)等皆是透過垃圾郵件散佈,這些垃圾郵件除了造成大量的網路資源耗用外,更為企業及個人帶來資料洩露的風險。
    在本研究中,我們建立一套個人化垃圾郵件過濾架構,除了透過建立個人化黑名單、白名單清單進行垃圾郵件判斷外,並提供使用者回饋機制,由使用者自行決定為正常郵件或垃圾郵件,並針對個人化的歷史郵件資料(包括正常郵件與垃圾郵件),透過平滑式支撐向量機(Smooth Support Vector Machine)進行分類學習,產生分類模型,提供後續新進郵件之分類判斷。


    With the rapid development of the Internet, spam mail has become a business and personal data placement in an important challenge. In addition to traditional commercial spam mail, other attacks, including phishing, pornographic messages, malicious code (viruses) are spread through spam, junk e-mail in addition to caused by a large number of network resource consumption, the more businesses and individuals with to the risk of data leakage.
    In this study, we set up a personalized spam filter structure, except by creating a personal blacklist, white list for spam list to determine the outside, and provide users with a feedback mechanism by the user to decide the normal mail or spam e-mail, and e-mail for personal history (including normal mail and spam), via Smooth Support Vector Machine classification of learning, resulting in classification model, to provide follow-up with a message of classification judgments.

    目錄 論文摘要 I 目錄 II 圖目錄 IV 表目錄 V 第一章 緒論 1 第1.1節 前言 1 第1.2節 研究背景與動機 2 第1.3節 研究目的 3 第1.4節 論文架構 3 第二章 背景及相關技術 4 第2.1節 簡單郵件傳輸協定 4 第2.1.1節 郵件傳送流程 5 第2.2節 垃圾郵件介紹 7 第2.2.1節 垃圾郵件原理 8 第2.2.2節 釣魚攻擊(Phishing) 10 第三章 文件分類技術與研究方法 15 第3.1節 郵件分類方法介紹 15 第3.1.1節 名單比對 15 第3.1.2節 內容過濾 18 第3.2節 文件前置處理(PREPROCESSING PROCEDURE) 19 第3.2.1節 中文斷詞處理 19 第3.2.2節 Stop Word移除 20 第3.3節 向量空間模型(VECTOR SPACE MODEL) 22 第3.3.1節 特徵選取(Feature Selection) 23 第3.3.2節 詞彙權重(Term Weight)建立 24 第3.4節 效能評估(PERFORMANCE MEASURE) 25 第四章 平滑支撐向量機 29 第4.1節 SUPPORT VECTOR MACHINE 29 第4.2節 SMOOTH SUPPORT VECTOR MACHINE 31 第五章 系統架構與研究方法 33 第5.1節 系統架構 33 第5.2節 實驗平台與系統設定 35 第5.2.1節 黑名單與白名單處理 35 第5.2.2節 電子郵件內容處理 37 第5.2.3節 中文斷詞與文章前處理 38 第5.2.4節 郵件特徵值選取與訓練 40 第5.2.5節 SSVM 離線訓練 41 第5.3節 個人化電子郵件處理 42 第5.3.1節 個人化郵件訓練資料收集 42 第5.3.2節 個人化垃圾郵件過濾系統 45 第六章 實驗結果與分析 47 第6.1節 PU123ACORPORA訓練資料集分析結果 47 第6.2節 LING-SPAM訓練資料集分析結果 54 第七章 結論 58 第7.1節 結論 58 第7.2節 未來工作 58 參考文獻 60 中文部份 60 英文部份 61

    參考文獻
    中文部份
    [1]NCC 防治垃圾郵件宣導網 http://www.ncc.gov.tw/antispam/html/我國防制垃圾郵件之政策規劃及推動作為.mht
    [2]http://www.symantec.com/zh/tw/about/news/release/article.jsp?prid=20100615_01 賽門鐵克發表6月最新垃圾郵件及網路釣魚報告
    [3]錢冠評,「整合平行處理的行為檢查和病毒偵測之郵件伺服器防禦系統」,國立中正大學電機工程研究所,民國九十七年
    [4]沈成達,「行動網路上的一個高效率檔案系統」,國立交通大學資訊工程研究所
    [5]實例解析網路釣魚攻擊的幕後
    http://forum.icst.org.tw/phpbb/viewtopic.php?t=7620
    [6]葉生正、蘇民揚,結合SVM與Naive Bayes演算法防堵垃圾郵件的研究,銘傳大學,民國九十六年
    [7]彭聖全,利用Google搜尋引擎實作英文文法改錯工具,國立中正大學資訊工程研究所,民國九十八年

    英文部份
    [8]RFC 821 Simple Mail Transfer Protocol
    [9]RFC 5321 Simple Mail Transfer Protocol
    [10]Graham-Cumming, J. (2006),Does Bayesian Poisoning Exist?,
    virusbtn.com/spambulletin/archive/2006/02/sb200602-poison
    [11]Dhamija, R, J. D. Tygar and M. Hearst. "Why Phishing Works".CHI 2006, April 22-27 , Montreal, Quebec, Canada
    [12]N. Chou, R. Ledesma, Y. Teraguchi, and J. C.Mitchell. Client-side defense against web-based identity theft. In NDSS, 2004.
    [13]Y. Zhang, J. Hong, and L. Cranor. Cantina: A content-based approach to detecting phishing web sites. In WWW, 2007.
    [14]Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to Detect Phishing Emails.Proceedings of the 16th International Conference on World Wide Web. New York:ACM, 649-656
    [15]SpamCop Blocking List. http://www.spamcop.net/bl.shtml
    [16]S. Hao, N. Feamster, A. Gray, N. Syed, and S. Krasser," Detecting spammers with snare: Spatio-temporal network-level automated reputation engine." in 18th USENIX Security, Montreal, Aug 2009
    [17]Greylisting,http://www.greylisting.org/.
    [18]I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras and C.D. Spyropoulos : An Evaluation of Naive Bayesian Anti-Spam Filtering , In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9–17, 2000.
    [19]Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., and Stamatopoulos, P., “Learning to filter spam e-mail: a comparison of a naïve Bayesian and a memory-based approach”. In Proceedings of the Workshop on Machine Learning and Textual Information Access, PKDD 2000, Lyon, France, pp. 1– 3.
    [20]Sahami, M, etc. : A Bayesian Approach to Filtering Junk E-Mail. Papers from the AAAI Workshop, pp. 55-62 , Madison Wisconsin. AAAI Technical Report WS-98-05,1998.
    [21]K.M. Schneider, (2004) Learning to Filter Junk E-Mail from Positive and Unlabeled Examples, Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP-04), pp. 602-60
    [22]Meyer,T.A,Whately,B.:SpamBayes:Effective open-Source , Bayesian Based,Email Classification System., First Conference on Email and Anti-Spam(CEAS),2004
    [23]H. Drucker, D. Wu and V.N. Vapnik, Support Vector Machines for Spam Categorization , IEEE Transaction on Neural Networks, 1999, Vol.10, No.5, pp.1048-1054
    [24]M. Woitaszek, M. Shaaban, R. Czernikowski, (January 2003) Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine, In Proceedings of the 2003 Symposium on Applications and the Internet (SAINT'03), pp. 166
    [25]Tsai,C.,MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm”, http://technology.chtsai.org/mmseg/
    [26]Porter, M., "An Algorithm for Suffix Stripping. Program (Automated Library and Information Systems, Vol. 14, No. 1, pp. 130-137,1980.
    [27]Berry, M.W., & Browne, M. (1999).Understanding Search Engines: Mathematical Modeling and Text Retrieval(Software , Environments , Tools ) . Society for industrial & Applied Mathematics.
    [28]G. Salton and M. Mcgill , Introduction to Modern Information Retrieval , McGraw-Hill , New York , 1983.
    [29]Minoru Sasaki, Hiroyuki Shinnou, “Spam Detection Using Text Clustering”, IEEE International Conference on Cyberworlds, 2005.
    [30]Y.Yang and J Pedersen. A Comparative study of Feature Selection in Text Categorization. In International Conference on Machine Learning(ICML) , 1997
    [31]Salton, G. Buckley, C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, pages 513-523, 1988.
    [32]G.Salton and C.Yang. and A.Wong(1975), A Vector Space Model for automatic indexing. Communication of the ACM 18(11), 613-620
    [33]Ian H.Witten , Eibe Frank Data Mining , Morgan Kaufmann
    [34]Type I and type II errors, http://en.wikipedia.org/wiki/False_positive
    [35]Yun-Jye Lee and O. L. Mangasarian , SSVM: A Smooth Support Vector Machine for Classification. Computation Optimization and Application (2001)
    [36]http://search.cpan.org/dist/MIME-tools/lib/MIME/Parser.pm
    [37]Androutsopoulos, I., Paliouras, G., “Learning to filter Unsolicited Commercial E-mail”. The Third Conference of E-mail and Anti-Spam 2004

    QR CODE