簡易檢索 / 詳目顯示

研究生: 鐘仕廷
Shi-ting Zhong
論文名稱: 基於文字探勘應用於使用者特徵向量擷取及行為分析的垃圾微網誌偵測系統
A Micro-blog Spammer Detection Framework Based on Mining User-Generated Context and Behavior.
指導教授: 洪西進
Shi-Jinn Horng
口試委員: 楊士萱
Shih-Hsuan Yang
吳有基
Yu-Chi Wu
黃元欣
Yuan-Shin Hwang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 51
中文關鍵詞: Twitter資訊檢索模型文字探勘Support Vector Machine
外文關鍵詞: Twitter, Information Retrieval Model, Text Mining, Support Vector Machine
相關次數: 點閱:228下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

Twitter 是一個社交網站,在此網站中每篇文章最多由140個字所組成,稱之為Tweets。相較於傳統部落格,Twitter的特色是文章長度較短,但是它也允許在這簡短的文章中包含影像連結、影片連結,並且網站提供了使用者交換彼此之間資訊的功能。人們可以利用Twitter去尋找自己有興趣的主題還有文章。不幸的是,Twitter上充斥著許多垃圾訊息,這些垃圾訊息降低了Twitter搜尋引擎搜尋後的品質,也浪費了許多的網路資源,本論文的研究主要目標是偵測Twitter中的散播垃圾訊息的帳號,帶給使用者一個乾淨的網路環境。在準備產生能幫助判斷垃圾訊息散播者的分類器之前,必須要找出Twitter中能幫助分類的特徵,本論文利用文字探勘結合了資訊檢索模型產生基於文本的特徵,並且觀察使用者發文情形產生出使用者行為特徵。最後,本論文使用Support Vector Machine (SVM) 結合以上兩種特徵向量後產生出分類器,幫助在Twitter自動偵測出散播垃圾訊息的帳號。


Twitter is a social network made up of 140-character messages called Tweets. Twitter differs from a traditional blog in that its content is typically smaller. It allows users to exchange small elements of contents such as short sentences, individual images, or video links. People can use Twitter to discover the latest news related to subjects they care about. Unfortunately, Twitter has been infiltrated by large amount of Spam. Spam decreases the quality of Twitter search engine result as well as wastes network resources. Our works focus on Spammer detection of Twitter to bring user a clean webspace. In preparation for Spammer detection, we need to extract the meaningful features from Tweets. In thesis, we apply Text Mining technique with Information Retrieval Model to generate text-based feature, and we also investigate Tweets contents to generate user behavior features. Finally, we use the Support Vector Machine (SVM) to train classifier that can be used for detecting Spammer automatically in Twitter.

中文摘要 I 英文摘要 II 致謝 III 目錄 IV 圖目錄 VII 表目錄 VIII 第一章 緒論 1.1 研究動機 1.2 研究目的 1.3 論文架構 第二章 相關工作 2.1 Twitter特色 2.1.1 “@Replies and Mentions” - 回覆和提及 2.1.2 “#Trending Topics” - 熱門話題 2.1.3 “Following” - 跟隨他人 2.2 Twitter Spammer定義與種類 2.3 網路爬蟲 2.4 正規表示式 2.5 相關研究 第三章 系統架構及資料收集 3.1 系統架構及流程 3.2 資料收集 - 網路爬蟲撰寫 3.3 資料介紹 第四章 研究方法 4.1 URL連結數量 4.2 “@Replies and Mentions” 數量 4.3 “#Trending Topics” 數量 4.4 Tweets文字重複比率 4.5 平均Tweet發表時間 4.6 Tweets平均長度 4.7 Tweets長度標準差 4.8 Tweets時間標準差 4.9 Tweet長度特徵 4.10 使用者行為特徵資料集 第五章 利用文字探勘建立特徵集 5.1 文字探勘流程 5.2 使用者資料前置處理程序 5.3 詞彙向量化 5.4 詞彙特徵選取 5.4.1 TF-IDF 5.4.2 資訊增益(Information Gain) 5.4.3 卡方積(Chi-square) 5.4.4 詞彙排名、設定σ、詞彙特徵擷取 第六章 實驗結果 6.1 資料集介紹 6.2 交叉驗證(Cross Validation,CV) 6.3 實驗結果表示法 6.3.1 True Positive Rate (TPR, Recall) 6.3.2 False Positive Rate (FPR) 6.3.3 Precision 6.3.4 F1-measure 6.3.5 Accuracy 6.3.6 實驗結果及比較 第七章 結論 參考文獻

[1]http://www.alexa.com/topsites
[2]http://royal.pingdom.com/2010/02/10/twitter-now-more-than-1-billion-tweets-per-month/
[3]http://www.pearanalytics.com/blog/wp-content/uploads/2010/05/Twitter-Study-August-2009.pdf
[4]http://support.twitter.com/groups/31-twitter-basics/topics/114-guidelines-best-practices/articles/18311-the-twitter-rules
[5]N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol. 118, no. 1-2, pp. 15-68, 2000.
[6]C.-N Hsu and M-T Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23,no. 8,pp. 521-538,1998
[7]I. Muslea, S. Minton, and C.A. Knoblock, “Hierarchical Wrapper Induction for Semi-Structured Information Sources,” Autonomous Agents and Multi-Agent Systems, vol. 4, no. 1-2, pp. 93-114, 2001.
[8]A. Sahuguet and F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Eng.,vol. 36, no. 3, pp. 283-316, 2001.
[9]L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 611-621, 2000.
[10]D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int’l Conf. Distributed Computing Systems (ICDCS), pp. 361-370, 2001.
[11]V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 109-118, 2001.
[12]C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129-147, 2003.
[13]B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 601-606, 2003.
[14]Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. Int’l World Wide Web Conf. (WWW), pp. 76-85, 2005.
[15]http://zh.wikipedia.org/wiki/%E6%AD%A3%E8%A6%8F%E8%A1%A8%E5%BC%8F
[16]http://ebiquity.umbc.edu/resource/html/id/216/Spam-in-Blogs-and-Social-Media
[17]Kyumin Lee, James Caverlee, Steve Webb, “The Social Honeypot Project: Protecting Online Communities from Spammers,” 19th International World Wide Web Conference , Raleigh, pp. 1139-1140, April 2010.
[18]Fabr’ıcio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virg’ılio Almeida, “Detecting Spammers on Twitter,” CEAS 2010 - - Seventh annual Collaboration, Electronic messaging, AntiAbuse and Spam Conference July 13-14, 2010, Redmond, Washington, US
[19]Alex Har Wang, “Don't Follow Me: Spam Dection In Twitter,” In Proceedings of the International Conference on Security and Cryptography, SECRYPT, pp. 1-10, July 2010.
[20]Shih-liang Chang, “Detection Microblog Spam using User Behavior and Content Analysis, ” National Taiwan University of Science and Technology, Master Thesis, Taipei, Taiwan, 2010.
[21]http://search.twitter.com/api/
[22]Julie Beth Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computational Linguistics, vol. 11, no. 1-2, pp. 22–31, June 1968.
[23]Porter, M.F., “An Algorithm for Suffix Stripping, Program,”, vol. 14, no. 3, pp. 130–137, 1980.
[24]http://tartarus.org/~martin/PorterStemmer/
[25]Web Site: Term Weighting Approaches in Automatic Text Retrieval. http://portal.acm.org/citation.cfm?id=866292
[26]Yiming Yang, Jan O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization, ” Proceedings of the Fourteenth International Conference on Machine Learning, page 412-420. San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., (1997)
[27]Quinlan. J. R., “Induction of decision trees,” Machine Learning, No. 1, pp. 81-106, 1986.
[28]http://www.csie.ntu.edu.tw/~cjlin/libsvm/

QR CODE