研究生: |
廖婉淑 Wan-Shu Liao |
---|---|
論文名稱: |
利用整合式的機器學習方式提高垃圾網站偵測率 Improving WEBSPAM Detection by Using a Hybrid Machine Learning |
指導教授: |
洪西進
Shi-Jinn Horng |
口試委員: |
陳秋華
none 王毓饒 none |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 中文 |
論文頁數: | 46 |
中文關鍵詞: | 垃圾網站 、機器學習 、資訊增益 、支援向量機 、決策樹 、叢聚法 |
外文關鍵詞: | WEBSPAM, imbalance distribution |
相關次數: | 點閱:316 下載:14 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
垃圾網站(WEBSPAM)係利用人為的方式不正當的提高自己網站在搜尋引擎的排名(Rank)。因此如何偵測分辨出垃圾網站(WEBSPAM),防止因太多垃圾網站,降低搜尋引擎的效率及影響搜尋引擎的名聲,同時避免網路使用者搜尋到垃圾網站而遭受到惡意網站的攻擊,成為搜尋引擎公司的當前要務。
本研究利用網站的連結(link-based features)及網頁內容特性(content-based features),以知名WEBSPAM UK2007為測試資料集(Benchmark),結合監督式(supervised)的SVM及非監督式(unsupervised)的Cluster機器學習(Machine Learning)方式,在特徵選取後(由275個特徵減少到21個特徵)不僅提升分類效能,仍有0.832偵測率,較未分群時偵測率0.611提高了0.221的偵測率,同時誤判率也由0.668降到0.349,對於資料集分佈不平衡(imbalance)的分類問題,提供了一個能有較佳分類率的解決方法。
WEB Spamming is a kind of scheme to boost the ranking of a website on search engine deliberately higher than they deserve. We call those websites as “WEBSPAM”. Also, it is a crucial issue to detect those websites for many search engine providers, because this kind of websites will damage not only the performance but also the reputation of a search engine.
Our research is aimed to improve the WEBSPAM detection by using a hybrid, supervised (SVM) and unsupervised (Cluster), machine learning. We evaluated our model on the well known WEBSPAM corpus, WEBSPAM UK2007, there is an imbalance distribution of webspams and reputable websites(1:18) in the dataset. As a result, there is an improvement of detection rate from 0.611 to 0.832 under a reducing of false positive from 0.668 to 0.349. It should be a good solution to resolve the problem of imbalance distribution of data set.
[1] J.acob Abernethy, O. Chapelle, C. Castillo, Web spam Identification Through Content and Hyperlinks, AIRWeb ’08, ACM, 2008.
[2] N. Alexandros, N. Marc, M. Mark and F. Dennis, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, ACM, 2006.
[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
[4] A. A. Benczúr, K. Csalogány, T. Sarlós, “LinkBased Similarity Search to Fight Web Spam”, AIRWEB’06.
[5] M. Bianchini, M. Gori and F. Scarselli, Inside PageRank, University of Siena, 2003.
[6] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri., Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands, July 2007. ACM.
[7] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine : learning to identify link spam. In Proceedings of the 16th European Conference on Machine Learning (ECML), volume 3720 of Lecture Notes in Artificial Intelligence, pages 233–243, Porto, Portugal, 2005.
[8] S. Craig, M. Hannes, H. Monika and M. Michael, Analysis of a very large web search engine query log, SIGIR Forum, 33 (1999), pp. 6-12.
[9] Z. Gyongui and H. Garcia-Molina, Web Spam Taxonomy, First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), 2005.
[10] Z. Gyongui, H. Garcia-Molina, Pavel Berkhin and J. Pedersen, "Link Spam Detection Based on Mass Estimation",32th International Conference on Very Large Data Bases (VLDB), 2006.
[11] M. K. Jon, Authoritative sources in a hyperlinked environment, J. ACM, 46 (1999), pp. 604-632.
[12] P. Kolari, T. Finin and A. Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection, Association for the Advancement of Artifical Intelligence (AAAI), 2005.
[13] A. langville and C. Meyer, Deeper Inside PageRank, North Carolina State University, 2003.
[14] M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemporary Physics, 46 (2005), pp. 323-351.
[15] L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Libraries Technologies Project, 1998.
[16] Y. Preund and R. E. Schapire. A Decision-theoretic, Generalization of on-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1): 119-139, August 1997.
[17] J. R. Quinlan, “Induction of decision trees”, Machine Learning, 1, 1986.
[18] B. Wu and B. D. Davison, Cloaking and Redirection: A Preliminary Study, First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), 2005.
[19] W. Yi-Min, M. Ming, N. Yuan and C. Hao, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, ACM, Banff, Alberta, Canada, 2007.
[20] V. Vapnik, “Statistical Learning Theory,” Wiley, New York, 1998.
[21] E. Ardizzone, A. Chella, R.Pirrone, “An Architecture for Automatic Gesture Analysis”, Proceedings of the Working Conference on Advanced Visual Interfaces May 2000.
[22] A.H. Sung and Srinivas Mukkamala, “Identify Important Features for Intrusion Detection Using Support Vector Machines and Neural Networks,” 2003 Symposium on, Applications and the Internet, 2003. roceedings. 27-31 Jan. 2003, pp.209 -216.
[23] X. Qi, L. Nie, B. D. Davision, "Measuring Similarity to Detect Qualified Links", AIRWeb ’07, 2007 ACM.
[24] Web Spam Challenge. http://webspam.lip6.fr/, 2007.
[25] Weka software. http://www.cs.waikato.ac.nz/ml/weka/
[26] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006.