簡易檢索 / 詳目顯示

研究生: 宋文軒
Wen-Xuan Song
論文名稱: 基於微博用戶主題相似性和關係結構之社群挖掘方法
MICRO-BLOG COMMUNITY DETECTION VIA TOPIC SIMILARITY AND RELATIONAL STRUCTURE MINING
指導教授: 李育杰
Yuh-Jye Lee
鮑興國
Hsing-Kuo Pao
口試委員: 項天瑞
Tien-Ruey Hsiang
蘇黎
Li SU
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 74
中文關鍵詞: 微博社群挖掘主題模型
外文關鍵詞: Micro-blog, community detection, topic similarity
相關次數: 點閱:243下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著Web2.0時代的到來,微博作為一種新的互聯網社交網路服務迅速興起,以其“短,平,快”的特性風靡全球。隨著越來越多的人加入這個虛擬社區,互聯網上一個一個相對獨立的用戶逐漸地組成為了一個龐大而複雜的社交網絡,微博也漸漸成為了用戶們在互聯網上的第二世界,逐漸地成為了人們生活中的一部分。目前,新浪微博已經佔據中國微博用戶總量的57%,以及中國微博活動總量的87%,是中國訪問量最大的網站之一。
    這樣一個龐大而複雜的微博社交網路給分析者帶來了許許多多全新的機遇和挑戰:使用者如何在龐大的社交網路中找到屬於自己的圈子;運營商和企業如何從海量的資料中挖掘出有意義的資訊來創造商業價值;資料分析師如何對大資料進行分析和預測。為瞭解決微博大資料帶來的問題,我們希望能夠使用社區挖掘相關的技術來解決這些問題。“社區”是一個相似使用者的集合。微博社區這樣一個龐大的虛擬社群是由生活中一個個真實的人組成的,他們被微博中的關注關係聯繫起來。在傳統的複雜網路的社區挖掘中,一般只會考慮網路連結的結構。但是在微博社區中這樣一個新興而開放的社區網路中,每個人獨特的興趣同樣重要。可以說,興趣類似的用戶們構成了一個個社區的同時,他們也在被社區的屬性影響著。本文旨在從使用者主題模型和使用者之間的關注關聯式結構兩個方面綜合分析,找到一種方法可以獲得內容相似,結構緊密的社區劃分和主題分佈。在實驗中,我們使用的資料集均來自新浪微博。


    As Web2.0 coming, Micro-blog is rising rapidly as a new kind of Internet social network services. And it has taken the world by storm with the characteristics of "short, flat and fast".
    As more and more people join in this virtual community, the users have formed a complex social network gradually and let Micro-blog become the users' second world on the Internet. The Micro-blog has became a part of the users' life. Now, Sina Weibo accounts for about 57% of the Micro-blog users and 87% of the active Micro-blog users in China, and it is also one of the most-visited websites in China.

    This huge and complex social network of Micro-blog brings the analyzers many new opportunities and challenges: How the users find their own social circle from the large and complex Micro-blog community. How the business and operator create business value with the information mined from the big data of Micro-blog. How the analyzer analyze and calculate with the big data of Micro-blog. In order to solve these problem of the Micro-blog, we use the community detection technology to do this work. The “community” is a cluster of the users whose interest is similar. Such a huge and complex social network of Micro-blog is formed by the real users who are connected by the users' following relationship. In the traditional complex social network, community detection always only considers the linked structure. But in the community of Micro-blog such an emerging and open social network, the unique interest of each users is also important. In other words, as the users with different interest formed different communities, they are also affected by their communities. This paper aims to find a method to detect the community structure based on the users’ topic similarity and the relationship structure. In the experiment, all the data we use comes from the Sina Weibo.

    1 Introduction 1.1 Background 1.2 Motivation 1.3 Our Work In this paper 1.4 Organization of Thesis 2 Related Work 2.1 Topic Mining 2.1.1 Traditional Topic Mining 2.1.2 Topic Mining Based on Linear Algebra 2.1.3 Topic Mining Based on Topic Model 2.2 Community Detection 3 Introduction of LDA 3.1 Latent Dirichlet Allocation 3.2 Mixture modelling & Generative modelling 3.3 Likelihoods 3.4 Inference via Gibbs Sampling 3.5 Joint Probability Distribution 3.6 Parameter Estimation 3.7 New Come Document 4 Algorithm and Method 4.1 User Similarity 4.1.1 The Similarity of Inner Social Network 4.1.2 The Similarity of Surface Social Network 4.1.3 The Similarity of Intermediate Social Network 4.1.4 Compute User Similarity 4.2 Graph Structure 4.2.1 Pruning Process 4.2.2 Weighted Directed Graph 4.3 Community Detection 4.3.1 Initial Center Cluster Selection 4.3.2 Community Mining 4.3.3 Community Merge 5 Experiment and Result 5.1 Experiment Process 5.2 Data Format 5.3 Experiment Set 5.4 Text Data Process 5.4.1 Stop Key 5.4.2 Text Segmentation 5.5 Building Graph Structure 5.5.1 Pruning Process 5.5.2 LDA Model 5.5.3 User Similarity 5.5.4 Weighted Directed Graph 5.6 Detect The Community 5.6.1 Initial Center Cluster Selection 5.6.2 Community Mining 5.6.3 Community Merge 5.7 Evaluation Index 5.8 Discussion 5.8.1 The Comparison of MMDA and Canopy 5.8.2 The Comparison of CDRSTS and FN 5.9 Result 5.10 Community Visualization 6 Conclusion and Future Work 6.1 Conclusion 6.2 Future Work

    [1] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
    [2] Hady Lauw, John C Shafer, Rakesh Agrawal, and Alexandros Ntoulas. Homophily in the digital world: A livejournal case study. IEEE Internet Computing, 14(2):15-23, 2010.
    [3] Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645-678, 2005.
    [4] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391, 1990.
    [5] Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259-284, 1998.
    [6] Susan T Dumais. Latent semantic analysis. Annual review of information science and technology, 38(1):188-230, 2004.
    [7] Mark Steyvers and Thomas Griffiths. Latent semantic analysis: a road to meaning, chapter probabilistic topic models. Laurence Erlbaum, 2007.
    [8] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50-57. ACM, 1999.
    [9] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1-38, 1977.
    [10] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993-1022, 2003.
    [11] David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113-120. ACM, 2006.
    [12] Jordan L Boyd-Graber and David M Blei. Syntactic topic models. In Advances inneural information processing systems, pages 185-192, 2009.
    [13] Brian W Kernighan and Shen Lin. An efficient heuristic procedure for partitioning graphs. Bell system technical journal, 49(2):291-307, 1970.
    [14] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821-7826, 2002.
    [15] Gregor Heinrich. Parameter estimation for text analysis. University of Leipzig, Tech. Rep, 2008.
    [16] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228-5235, 2004.
    [17] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.
    [18] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79-86, 1951.
    [19] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012.
    [20] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
    [21] jinzong. Introduction to Pattern Recognition. Higher Education Press, 1994.
    [22] Juan Zhou, Zhong-yang XIONG, Yu-fang ZHANG, and Fang Ren. Multiseed clustering algorithm based on max-min distance means [j]. Journal of Computer Applications, 6:059, 2006.
    [23] Wei Bian and Dacheng Tao. Max-min distance analysis by using sequential sdp relaxation for dimension reduction. IEEE Transactions on Pattern Analysis and
    Machine Intelligence, 33(5):1037-1050, 2011.
    [24] Zitao Liu, Wenchao Yu, Wei Chen, Shuran Wang, and Fengyi Wu. Short text feature selection for micro-blog mining. In Computational Intelligence and Software
    Engineering (CiSE), 2010 International Conference on, pages 1-4. IEEE, 2010.
    [25] Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat
    Demirbas. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 841-842. ACM, 2010.
    [26] Mengen Chen, Xiaoming Jin, and Dou Shen. Short text classi_cation improved by learning multi-granularity topics. In IJCAI, pages 1776-1781. Citeseer, 2011.
    [27] Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261-270. ACM, 2010.
    [28] Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315. ACM, 2004.
    [29] fxsjy. Jieba Chinese text segmentation. https://github.com/fxsjy/jieba.
    [30] Wiki. Trie. https://en.wikipedia.org/wiki/Trie.
    [31] Wiki. Dag. https://en.wikipedia.org/wiki/Directed_acyclic_graph.
    [32] Wiki. Perplexity. https://en.wikipedia.org/wiki/Perplexity.
    [33] Mark EJ Newman. Fast algorithm for detecting community structure in networks. Physical review E, 69(6):066133, 2004.

    無法下載圖示 全文公開日期 2021/08/10 (校內網路)
    全文公開日期 2024/08/10 (校外網路)
    全文公開日期 2024/08/10 (國家圖書館:臺灣博碩士論文系統)
    QR CODE