簡易檢索 / 詳目顯示

研究生: 林榮雄
Rong-Syong Lin
論文名稱: 基於動態語意擷取的垃圾郵件過濾
Spam Filtering based on Dynamic Semantic Extraction
指導教授: 戴碧如
Bi-Ru Dai
口試委員: 鮑興國
Hsing-Kuo Pao
鄧維光
Wei-Guang Teng
胡誌麟
Chih-Lin Hu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 75
中文關鍵詞: 垃圾郵件過濾潛藏語意分析樣式擷取線上動態學習個人化過濾器
外文關鍵詞: spam filter, latent semantic analysis, pattern extraction, online active learning, personalized filter
相關次數: 點閱:232下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

垃圾信攻擊隨著時間在變化,必須要有好的機制來因應不斷變化的垃圾信攻擊。動態式的學習增長,將是近年垃圾信過濾器重要發展方向。
由於信件中有些伴隨出現的文字或樣式,將有很大的辨別依據,所以本論文希望藉由信件的語意與摘要分析,藉而蒐集語意的集合來進行比對。
本過濾去分成合法信件與非法信件的語意集合,目的是為了找出與這兩類皆不相似的信件,進而分析出新的語意,使得過濾器學到新的語意來對於往後再出現的信件進行過濾。進而分析出新的語意,使得過濾器學到新的語意來對於往後再出現的信件進行過濾。
此外,使用者的回饋將幫助本過濾器得到未顯露過信件真實的類別,並且簡單利用先cluster的方法減少使用者回饋的負擔。
實驗結果顯示出,本論文在經過語意分析後的過濾器,呈現出較好的結果。


Spam mails attacking changes through time, and calls for good strategies to cope with spam mails with varying characteristics. In recent years, several methods based on active learning are developed to handle such mails. Owing to the fact that some words or patterns emerge out of mails are the basis of detection, the thesis utilizes the analysis of semantics and contents of mails to collect semantic patterns for determining spam mails. Further, the filter in the thesis two semantic classes: the legal mail class and the illegal mail class to identify mails which are different from both classes. Next, the latest semantic patterns are analyzed from them and will be used in the subsequent filtering. In addition, user feedback will assist the filter to obtain the real class labels of unrevealed mails, and the burden of user feedback is further reduced by making use of clustering beforehand. Experiments show that our method achieves better performance than prior filters which are also based on latent semantic analysis.

目錄 指導教授推薦書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 論文口試委員審定書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Vector SpaceModel and Latent Semantic Analysis . . . . . . . . . . . 5 2.2 Feature Extraction and Feature Selection . . . . . . . . . . . . . . . . 7 2.3 Online Active Learning Task . . . . . . . . . . . . . . . . . . . . . . . 9 3 SpamFilter Based On Semantic Extraction . . . . . . . . . . . . . . . . . 11 3.1 Corpus Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Port Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 StopWords . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 vi 3.1.3 Term Frequency and Inverse Document Frequency Matrix . . 14 3.2 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Pattern Extraction and Semantic Pattern set . . . . . . . . . . . . . . 21 3.4 Subject Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 SimilarityMeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.1 Subject Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.2 Text-Based Similarity . . . . . . . . . . . . . . . . . . . . . . . 26 4 Online Active Learning in SpamFilter . . . . . . . . . . . . . . . . . . . . 30 4.1 SlidingWindow Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 User Feedback and Text Clustering . . . . . . . . . . . . . . . . . . . 33 4.3 MaintainingWell Semantic Pattern Sets andWell Subject text . . . . 35 4.4 False Positive Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Personalized Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6 Conclusion and FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 授權書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

References
[1] V. Bhat, T. Oates, V. Shanbhag, and C. Nicholas, “Finding aliases on the web
using latent semantic analysis,” Data Knowl. Eng., vol. 49, no. 2, pp. 129–143,
2004.
[2] D. N. J. R. Bellegarda and K. Silverman, “Automatic junk evmail filtering based
on latent content,” IEEE Automatic Speech Recognition and Understanding,
pp. 465–470, 2003.
[3] S. Hershkop and S. J. Stolfo, “Combining email models for false positive reduction,”
in KDD ’05: Proceedings of the eleventh ACM SIGKDD international
conference on Knowledge discovery in data mining, (New York, NY, USA),
pp. 98–107, ACM, 2005.
[4] G. Fumera, I. Pillai, and F. Roli, “Spam filtering based on the analysis of text
information embedded into images,” Journal of Machine Learning Research,
vol. 6, pp. 2699–2720, 2006.
[5] W. N. Gansterer, A. G. K. Janecek, and P. Lechner, “A reliable componentbased
architecture for e-mail filtering,” in ARES ’07: Proceedings of the
The Second International Conference on Availability, Reliability and Security,
(Washington, DC, USA), pp. 43–52, IEEE Computer Society, 2007.
[6] Y. Li, B. Fang, L. Guo, and S. Wang, “Research of a novel anti-spam technique
based on users feedback and improved naive bayesian approach,” Networking
and Services, 2006. ICNS ’06. International conference on, pp. 86–86, 2006.
[7] M. R. Islam, M. U. Chowdhury, and W. Zhou, “An innovative spam filtering
model based on support vector machine,” in CIMCA ’05: Proceedings of the
International Conference on Computational Intelligence for Modelling, Control
and Automation and International Conference on Intelligent Agents, Web
Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC’06), (Washington,
DC, USA), pp. 348–353, IEEE Computer Society, 2005.
61
[8] M. Islam, W. Zhou, and M. Choudhury, “Dynamic feature selection for spam
filtering using support vector machine,” Computer and Information Science,
2007. ICIS 2007. 6th IEEE/ACIS International Conference on, pp. 757–762,
July 2007.
[9] J. Gordillo and E. Conde, “An hmm for detecting spam mail,” Expert Syst.
Appl., vol. 33, no. 3, pp. 667–682, 2007.
[10] A. Veloso and W. M. Jr., “Lazy associative classification for content-based
spam detection,” Web Congress, 2006. LA-Web ’06. Fourth Latin American,
pp. 154–161, Oct. 2006.
[11] M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, vector spaces, and information
retrieval,” SIAM Rev., vol. 41, no. 2, pp. 335–362, 1999.
[12] J. Bellegarda, “Exploiting latent semantic information in statistical language
modeling,” Proceedings of the IEEE, vol. 88, pp. 1279–1296, Aug 2000.
[13] “Toward integrating feature selection algorithms for classification and clustering,”
IEEE Trans. on Knowl. and Data Eng., vol. 17, no. 4, pp. 491–502, 2005.
Senior Member-Huan Liu and Student Member-Lei Yu.
[14] “Spam filtering based on latent semantic indexing,” Survey of Text Mining II,
pp. 165–183, 2007. Wilfried N. Gansterer Andreas G. K. Janecek and Robert
Neumayer.
[15] X. Zhou and J. Shuai, “A self-learning spam detecting system model based on
memory rules,” in ICNC ’07: Proceedings of the Third International Conference
on Natural Computation (ICNC 2007), (Washington, DC, USA), pp. 421–425,
IEEE Computer Society, 2007.
[16] M. F. Porter, “An algorithm for suffix stripping,” pp. 313–316, 1997.
[17] G. Salton and C. Buckley, “Term weighting approaches in automatic text retrieval,”
tech. rep., Ithaca, NY, USA, 1987.
62
[18] A. Ramachandran, N. Feamster, and D. Dagon, “Revealing botnet membership
using dnsbl counter-intelligence,” in SRUTI’06: Proceedings of the 2nd conference
on Steps to Reducing Unwanted Traffic on the Internet, (Berkeley, CA,
USA), pp. 8–8, USENIX Association, 2006.
[19] G. Cormack, “Trec 2006 spam track overview,” in ”To appear in: The Fifteenth
Text Retrieval Conference (TREC 2006) Proceedings, 2006.”.
[20] G. Cormack, “Trec 2007 spam track overview,” in ”The Sixteenth Text Retrieval
Conference(TREC 2007) Proceedings”.
[21] G.-B. T. Vlado, “Daltrec 2005 spam track: Spam filtering using,” in The Fourteenth
Text Retrieval Conference(TREC 2005) Proceedings.
[22] A. Bratko, B. Filipiˇc, G. V. Cormack, T. R. Lynam, and B. Zupan, “Spam filtering
using statistical data compression models,” J. Mach. Learn. Res., vol. 7,
pp. 2673–2698, 2006.
[23] A. Bratko, B. Filipiˇc, and B. Zupan, “Towards practical ppm spam filtering:
Experiments for the TREC 2006 spam track,” in Proc. 15th Text Retrieval
Conference (TREC 2006), (Gaithersburg, MD), 2006.

無法下載圖示 全文公開日期 2013/07/29 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE