利用整合式的機器學習方式提高垃圾網站偵測率｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	廖婉淑 Wan-Shu Liao
論文名稱：	利用整合式的機器學習方式提高垃圾網站偵測率 Improving WEBSPAM Detection by Using a Hybrid Machine Learning
指導教授：	洪西進 Shi-Jinn Horng
口試委員:	陳秋華 none 王毓饒 none
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2009
畢業學年度：	97
語文別：	中文
論文頁數：	46
中文關鍵詞：	垃圾網站、機器學習、資訊增益、支援向量機、決策樹、叢聚法
外文關鍵詞：	WEBSPAM, imbalance distribution
相關次數：	點閱：313 下載：14
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

垃圾網站(WEBSPAM)係利用人為的方式不正當的提高自己網站在搜尋引擎的排名(Rank)。因此如何偵測分辨出垃圾網站(WEBSPAM)，防止因太多垃圾網站，降低搜尋引擎的效率及影響搜尋引擎的名聲，同時避免網路使用者搜尋到垃圾網站而遭受到惡意網站的攻擊，成為搜尋引擎公司的當前要務。
本研究利用網站的連結(link-based features)及網頁內容特性(content-based features)，以知名WEBSPAM UK2007為測試資料集(Benchmark)，結合監督式(supervised)的SVM及非監督式(unsupervised)的Cluster機器學習(Machine Learning)方式，在特徵選取後(由275個特徵減少到21個特徵)不僅提升分類效能，仍有0.832偵測率，較未分群時偵測率0.611提高了0.221的偵測率，同時誤判率也由0.668降到0.349，對於資料集分佈不平衡(imbalance)的分類問題，提供了一個能有較佳分類率的解決方法。

WEB Spamming is a kind of scheme to boost the ranking of a website on search engine deliberately higher than they deserve. We call those websites as “WEBSPAM”. Also, it is a crucial issue to detect those websites for many search engine providers, because this kind of websites will damage not only the performance but also the reputation of a search engine.
Our research is aimed to improve the WEBSPAM detection by using a hybrid, supervised (SVM) and unsupervised (Cluster), machine learning. We evaluated our model on the well known WEBSPAM corpus, WEBSPAM UK2007, there is an imbalance distribution of webspams and reputable websites(1:18) in the dataset. As a result, there is an improvement of detection rate from 0.611 to 0.832 under a reducing of false positive from 0.668 to 0.349. It should be a good solution to resolve the problem of imbalance distribution of data set.

中文摘要	I
Abstract	II
致謝	III
圖目錄	V
表目錄	V
附表(Appendix)	V
第一章 導論	1
1.1 背景	1
1.2 貢獻	2
第二章 相關工作	4
2.1搜尋引擎排名策略	4
2.2垃圾網站竄改排名的技術	6
2.3相關研究探討	11
第三章 研究方法	16
3.1 資料探勘(Data mining)	16
3.2本篇論文的架構及方法	25
第四章 實驗結果	27
4.1資料集	27
4.2特徵選取(feature selection)及分類器(classifier)	28
4.3實驗結果	28
第五章 結論與未來展望	33
附表(Appendix)	36

                                

[1] J.acob Abernethy, O. Chapelle, C. Castillo, Web spam Identification Through Content and Hyperlinks, AIRWeb ’08, ACM, 2008.
[2] N. Alexandros, N. Marc, M. Mark and F. Dennis, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, ACM, 2006.
[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
[4] A. A. Benczúr, K. Csalogány, T. Sarlós, “LinkBased Similarity Search to Fight Web Spam”, AIRWEB’06.
[5] M. Bianchini, M. Gori and F. Scarselli, Inside PageRank, University of Siena, 2003.
[6] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri., Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands, July 2007. ACM.
[7] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine : learning to identify link spam. In Proceedings of the 16th　European Conference on Machine Learning (ECML),　volume 3720 of Lecture Notes in Artificial Intelligence,　pages 233–243, Porto, Portugal, 2005.
[8] S. Craig, M. Hannes, H. Monika and M. Michael, Analysis of a very large web search engine query log, SIGIR Forum, 33 (1999), pp. 6-12.
[9] Z. Gyongui and H. Garcia-Molina, Web Spam Taxonomy, First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), 2005.
[10] Z. Gyongui, H. Garcia-Molina, Pavel Berkhin and J. Pedersen, "Link Spam Detection Based on Mass Estimation",32th International Conference on Very Large Data Bases (VLDB), 2006.
[11] M. K. Jon, Authoritative sources in a hyperlinked environment, J. ACM, 46 (1999), pp. 604-632.
[12] P. Kolari, T. Finin and A. Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection, Association for the Advancement of Artifical Intelligence (AAAI), 2005.
[13] A. langville and C. Meyer, Deeper Inside PageRank, North Carolina State University, 2003.
[14] M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemporary Physics, 46 (2005), pp. 323-351.
[15] L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Libraries Technologies Project, 1998.
[16] Y. Preund and R. E. Schapire. A Decision-theoretic, Generalization of on-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1): 119-139, August 1997.
[17] J. R. Quinlan, “Induction of decision trees”, Machine Learning, 1, 1986.
[18] B. Wu and B. D. Davison, Cloaking and Redirection: A Preliminary Study, First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), 2005.
[19] W. Yi-Min, M. Ming, N. Yuan and C. Hao, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, ACM, Banff, Alberta, Canada, 2007.
[20] V. Vapnik, “Statistical Learning Theory,” Wiley, New York, 1998.
[21] E. Ardizzone, A. Chella, R.Pirrone, “An Architecture for Automatic Gesture Analysis”, Proceedings of the Working Conference on Advanced Visual Interfaces May 2000.
[22] A.H. Sung and Srinivas Mukkamala, “Identify Important Features for Intrusion Detection Using Support Vector Machines and Neural Networks,” 2003 Symposium on, Applications and the Internet, 2003. roceedings. 27-31 Jan. 2003, pp.209 -216.
[23] X. Qi, L. Nie, B. D. Davision, "Measuring Similarity to Detect Qualified Links", AIRWeb ’07, 2007 ACM.
[24] Web Spam Challenge. http://webspam.lip6.fr/, 2007.
[25] Weka software. http://www.cs.waikato.ac.nz/ml/weka/
[26] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006.

簡易檢索 / 詳目顯示

相關論文