研究生: |
鐘仕廷 Shi-ting Zhong |
---|---|
論文名稱: |
基於文字探勘應用於使用者特徵向量擷取及行為分析的垃圾微網誌偵測系統 A Micro-blog Spammer Detection Framework Based on Mining User-Generated Context and Behavior. |
指導教授: |
洪西進
Shi-Jinn Horng |
口試委員: |
楊士萱
Shih-Hsuan Yang 吳有基 Yu-Chi Wu 黃元欣 Yuan-Shin Hwang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2011 |
畢業學年度: | 99 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | Twitter 、資訊檢索模型 、文字探勘 、Support Vector Machine |
外文關鍵詞: | Twitter, Information Retrieval Model, Text Mining, Support Vector Machine |
相關次數: | 點閱:228 下載:16 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Twitter 是一個社交網站,在此網站中每篇文章最多由140個字所組成,稱之為Tweets。相較於傳統部落格,Twitter的特色是文章長度較短,但是它也允許在這簡短的文章中包含影像連結、影片連結,並且網站提供了使用者交換彼此之間資訊的功能。人們可以利用Twitter去尋找自己有興趣的主題還有文章。不幸的是,Twitter上充斥著許多垃圾訊息,這些垃圾訊息降低了Twitter搜尋引擎搜尋後的品質,也浪費了許多的網路資源,本論文的研究主要目標是偵測Twitter中的散播垃圾訊息的帳號,帶給使用者一個乾淨的網路環境。在準備產生能幫助判斷垃圾訊息散播者的分類器之前,必須要找出Twitter中能幫助分類的特徵,本論文利用文字探勘結合了資訊檢索模型產生基於文本的特徵,並且觀察使用者發文情形產生出使用者行為特徵。最後,本論文使用Support Vector Machine (SVM) 結合以上兩種特徵向量後產生出分類器,幫助在Twitter自動偵測出散播垃圾訊息的帳號。
Twitter is a social network made up of 140-character messages called Tweets. Twitter differs from a traditional blog in that its content is typically smaller. It allows users to exchange small elements of contents such as short sentences, individual images, or video links. People can use Twitter to discover the latest news related to subjects they care about. Unfortunately, Twitter has been infiltrated by large amount of Spam. Spam decreases the quality of Twitter search engine result as well as wastes network resources. Our works focus on Spammer detection of Twitter to bring user a clean webspace. In preparation for Spammer detection, we need to extract the meaningful features from Tweets. In thesis, we apply Text Mining technique with Information Retrieval Model to generate text-based feature, and we also investigate Tweets contents to generate user behavior features. Finally, we use the Support Vector Machine (SVM) to train classifier that can be used for detecting Spammer automatically in Twitter.
[1]http://www.alexa.com/topsites
[2]http://royal.pingdom.com/2010/02/10/twitter-now-more-than-1-billion-tweets-per-month/
[3]http://www.pearanalytics.com/blog/wp-content/uploads/2010/05/Twitter-Study-August-2009.pdf
[4]http://support.twitter.com/groups/31-twitter-basics/topics/114-guidelines-best-practices/articles/18311-the-twitter-rules
[5]N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol. 118, no. 1-2, pp. 15-68, 2000.
[6]C.-N Hsu and M-T Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23,no. 8,pp. 521-538,1998
[7]I. Muslea, S. Minton, and C.A. Knoblock, “Hierarchical Wrapper Induction for Semi-Structured Information Sources,” Autonomous Agents and Multi-Agent Systems, vol. 4, no. 1-2, pp. 93-114, 2001.
[8]A. Sahuguet and F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Eng.,vol. 36, no. 3, pp. 283-316, 2001.
[9]L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 611-621, 2000.
[10]D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int’l Conf. Distributed Computing Systems (ICDCS), pp. 361-370, 2001.
[11]V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 109-118, 2001.
[12]C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129-147, 2003.
[13]B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 601-606, 2003.
[14]Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. Int’l World Wide Web Conf. (WWW), pp. 76-85, 2005.
[15]http://zh.wikipedia.org/wiki/%E6%AD%A3%E8%A6%8F%E8%A1%A8%E5%BC%8F
[16]http://ebiquity.umbc.edu/resource/html/id/216/Spam-in-Blogs-and-Social-Media
[17]Kyumin Lee, James Caverlee, Steve Webb, “The Social Honeypot Project: Protecting Online Communities from Spammers,” 19th International World Wide Web Conference , Raleigh, pp. 1139-1140, April 2010.
[18]Fabr’ıcio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virg’ılio Almeida, “Detecting Spammers on Twitter,” CEAS 2010 - - Seventh annual Collaboration, Electronic messaging, AntiAbuse and Spam Conference July 13-14, 2010, Redmond, Washington, US
[19]Alex Har Wang, “Don't Follow Me: Spam Dection In Twitter,” In Proceedings of the International Conference on Security and Cryptography, SECRYPT, pp. 1-10, July 2010.
[20]Shih-liang Chang, “Detection Microblog Spam using User Behavior and Content Analysis, ” National Taiwan University of Science and Technology, Master Thesis, Taipei, Taiwan, 2010.
[21]http://search.twitter.com/api/
[22]Julie Beth Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computational Linguistics, vol. 11, no. 1-2, pp. 22–31, June 1968.
[23]Porter, M.F., “An Algorithm for Suffix Stripping, Program,”, vol. 14, no. 3, pp. 130–137, 1980.
[24]http://tartarus.org/~martin/PorterStemmer/
[25]Web Site: Term Weighting Approaches in Automatic Text Retrieval. http://portal.acm.org/citation.cfm?id=866292
[26]Yiming Yang, Jan O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization, ” Proceedings of the Fourteenth International Conference on Machine Learning, page 412-420. San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., (1997)
[27]Quinlan. J. R., “Induction of decision trees,” Machine Learning, No. 1, pp. 81-106, 1986.
[28]http://www.csie.ntu.edu.tw/~cjlin/libsvm/