簡易檢索 / 詳目顯示

研究生: 張鶴霖
Hengky Arviano
論文名稱: 使用隱含狄利克雷分佈從社交網路資料中提取熱點話題 - 詞袋和TF-IDF預處理方式的探討
Extracting Hot Topics Using Latent Dirichlet Allocation from Social Network Data – A Study of Bag of Words and TF-IDF Preprocessing
指導教授: 楊朝龍
Chao-Lung Yang
口試委員: 林希偉
Shi-Woei Lin
陳怡伶
Yi-Ling Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 59
中文關鍵詞: 熱點話題分析
外文關鍵詞: Hot Topic Analysis, Latent Dirichlet Allocation (LDA), Bigrams, Lemmatization, Bag of Words
相關次數: 點閱:142下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 社交網絡活動正在迅速增長,社交網絡平台上的大量資料也不斷累積。識別社交網絡中的重要或熱門話題已經引起了研究人員、媒體甚至企業的關注。本研究利用潛在狄利克雷分配 (Latent Dirichlet Allocation, LDA) 從社交網絡中提取熱門話題的研究來比較應用基於詞袋 (Bag of Words, BOW) 和Term Frequency–Inverse Document Frequency (TF-IDF) 預處理的不同。 LDA 是一種從語料庫或文檔集合中提取主題的先驅算法。這個研究中,設計了一系列基於二元組、詞形還原和向量化的預處理的實驗,並使用 BOW 和 TF-IDF 來比較擷取話題的一致性。此外,通過調整 LDA 的參數範圍以獲得最高熱點話題建模的一致性結果。本研究利用主題多樣性和每個話題群集的平均大小來評估擷取之熱點話題。實驗結果發現,BOW 和 TF-IDF 的向量化可以在語義上增加熱點話題的多樣性。


    Social network activities are growing rapidly and tons of information on the social network platform is accumulating. Identifying important or trending topics in social network has attracted the attention of researcher, media, and even business. This research presents a study of applying Latent Dirichlet Allocation (LDA) based on bag of words (BOW) and Term Frequency–Inverse Document Frequency (TF-IDF) preprocessing to extract hot topics from social network. LDA is a pioneer algorithm for extracting topic from the collection of corpus or documents. In this work, a set of experiments of preprocessing based on bigrams, lemmatization, and vectorization using BOW and TF-IDF to examine topic coherence. Additionally, coherence results were produced by tweaking range of parameters of LDA to obtain the highest coherent result of topic modeling. Topic diversity and average size of each topic clusters are used to evaluate the extracted topics. Experimental results show that the vectorization of BOW and TF-IDF can increase topic diversity semantically.

    摘要 ii ABSTRACT iii TABLE OF CONTENTS iv LIST OF FIGURE vi LIST OF TABLE vii CHAPTER I 1 CHAPTER 2 4 2.1 Hot Topics Analysis 4 2.2 Coherence Measure 6 CHAPTER 3 8 3.1 Framework 8 3.2 Data Preprocessing 9 3.2.1 SpaCy 9 3.2.2 Bigrams 9 3.2.3 Lemmatization 10 3.3 BOW and TF-IDF 11 3.3.1 BOW 11 3.3.2 TF-IDF 12 3.4 LDA 14 3.5 Coherence Measurement 16 3.6 Topic Diversity 17 3.7 Radius of Topics 18 CHAPTER 4 20 4.1 Data Description 20 4.2 Hyper Parameters Determination of LDA 20 4.3 Experimental Description 21 4.3.1 Experiment 1 21 4.3.2 Experiment 2 22 4.3.3 Experiment 3 24 4.3.4 Experiment 4 24 4.3.5 Chinese Dataset 25 4.4 Experimental Result 26 CHAPTER 5 30 REFERENCE 32 APPENDIX 35 APPENDIX A 35 Experiment #1 35 Experiment #2 39 Experiment #3 43 Experiment #4 47

    [1] J. Liu, Q. Xiong, W. Shi, X. Shi, and K. Wang, "Evaluating the importance of nodes in complex networks," Physica A: Statistical Mechanics and its Applications, vol. 452, pp. 209-219, 2016.
    [2] Y. Yang and G. Xie, "Efficient identification of node importance in social networks," Information Processing & Management, vol. 52, no. 5, pp. 911-922, 2016.
    [3] B. De Finetti, Theory of probability: A critical introductory treatment. John Wiley & Sons, 2017.
    [4] C.-M. Huang and C.-Y. Wu, "Effects of Word Assignment in LDA for News Topic Discovery," presented at the 2015 IEEE International Congress on Big Data, 2015.
    [5] W. J. Hutchins, "The Georgetown-IBM experiment demonstrated in January 1954," in Conference of the Association for Machine Translation in the Americas, 2004: Springer, pp. 102-114.
    [6] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, and Y. Yang, "Topic detection and tracking pilot study final report," 1998.
    [7] Z. Yang, C. Wang, F. Zhang, Y. Zhang, and H. Zhang, "Emerging rumor identification for social media with hot topic detection," in 2015 12th Web Information System and Application Conference (WISA), 2015: IEEE, pp. 53-58.
    [8] C. Zhang et al., "Triovecevent: Embedding-based online local event detection in geo-tagged tweet streams," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 595-604.
    [9] Y. Han, B. Fang, and Y. Jia, "Predicting the topic influence trends in social media with multiple models," Neurocomputing, vol. 144, pp. 463-470, 2014.
    [10] H.-G. Kim, S. Lee, and S. Kyeong, "Discovering hot topics using Twitter streaming data social topic detection and geographic clustering," in 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013), 2013: IEEE, pp. 1215-1220.
    [11] H. Jeelani and K. Singh, "„Good‟ versus „Bad‟ Opinion on Micro Blogging Networks: Polarity Classification of Twitter," International Journal of Computer Science and Mobile Computing, vol. 3, no. 8, pp. 49-56, 2014.
    [12] R. Yu, M. Zhao, P. Chang, and M. He, "Online hot topic detection from web news archive in short terms," in 2014 11th International Conference on Fuzzy Systems And Knowledge Discovery (Fskd), 2014: IEEE, pp. 919-923.
    [13] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American society for information science, vol. 41, no. 6, pp. 391-407, 1990.
    [14] D. D. Lee and H. S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, vol. 401, no. 6755, pp. 788-791, 1999.
    [15] P. Paatero, "Least squares formulation of robust non-negative factor analysis," Chemometrics and intelligent laboratory systems, vol. 37, no. 1, pp. 23-35, 1997.
    [16] L. AlSumait, D. Barbará, and C. Domeniconi, "On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking," in 2008 eighth IEEE international conference on data mining, 2008: IEEE, pp. 3-12.
    [17] A. Bindra, "SocialLDA: scalable topic modeling in social networks," 2012.
    [18] Y. Cha and J. Cho, "Social-network analysis using topic models," in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, 2012, pp. 565-574.
    [19] J. Grimmer, "A Bayesian Hierarchical Topic Model for Political Texts: Supplemental Appendix," 2009.
    [20] J. Wang, L. Li, F. Tan, Y. Zhu, and W. Feng, "Detecting hotspot information using multi-attribute based topic model," PloS one, vol. 10, no. 10, p. e0140539, 2015.
    [21] D. Naskar, S. Mokaddem, M. Rebollo, and E. Onaindia, "Sentiment analysis in social networks through topic modeling," in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, pp. 46-53.
    [22] W. Ai, K. Li, and K. Li, "An effective hot topic detection method for microblog on spark," Applied Soft Computing, vol. 70, pp. 1010-1023, 2018, doi: 10.1016/j.asoc.2017.08.053.
    [23] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno, "Evaluation methods for topic models," in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 1105-1112.
    [24] J. Chang, J. Boyd-Graber, C. Wang, S. Gerrish, and D. M. Blei, "Reading tea leaves: How humans interpret topic models," in Neural information processing systems, 2009, vol. 22: Citeseer, pp. 288-296.
    [25] X. Quan, C. Kit, Y. Ge, and S. J. Pan, "Short and sparse text topic modeling via self-aggregation," in Twenty-fourth international joint conference on artificial intelligence, 2015: Citeseer.
    [26] D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum, "Optimizing semantic coherence in topic models," in Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 262-272.
    [27] M. Röder, A. Both, and A. Hinneburg, "Exploring the space of topic coherence measures," in Proceedings of the eighth ACM international conference on Web search and data mining, 2015, pp. 399-408.
    [28] spaCy: Industrial-strength Natural Language Processing in Python, M. a. M. Honnibal, Ines and Van Landeghem, Sofie and Boyd, Adriane, 2020. [Online]. Available: https://doi.org/10.5281/zenodo.1212303
    [29] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," arXiv preprint arXiv:1310.4546, 2013.
    [30] J. Ullman, Mining of massive datasets. Cambridge University Press, 2011.
    [31] K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of documentation, 1972.
    [32] A. Singhal, C. Buckley, and M. Mitra, "Pivoted document length normalization," in Acm sigir forum, 2017, vol. 51, no. 2: ACM New York, NY, USA, pp. 176-184.
    [33] J. Koscholke and M. Jekel, "Probabilistic coherence measures: a psychological study of coherence assessment," Synthese, vol. 194, no. 4, pp. 1303-1322, 2017.
    [34] N. Aletras and M. Stevenson, "Evaluating topic coherence using distributional semantics," in Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers, 2013, pp. 13-22.
    [35] G. Bouma, "Normalized (pointwise) mutual information in collocation extraction," Proceedings of GSCL, pp. 31-40, 2009.
    [36] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, "Learning word vectors for 157 languages," arXiv preprint arXiv:1802.06893, 2018.
    [37] S. Orabhakaran. "Cosine-Similarity - Understanding the Math and How it Works." https://www.machinelearningplus.com/nlp/cosine-similarity/ (accessed 6 July, 2021).
    [38] C. Sievert and K. Shirley, "LDAvis: A method for visualizing and interpreting topics," in Proceedings of the workshop on interactive language learning, visualization, and interfaces, 2014, pp. 63-70.

    QR CODE