簡易檢索 / 詳目顯示

研究生: Vincent Utomo
Vincent Utomo
論文名稱: Automatic News-Roundup Generation using Clustering, Extraction, and Presentation
Automatic News-Roundup Generation using Clustering, Extraction, and Presentation
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 林昌鴻
Chang -Hong Lin
鄭瑞光
Ray-Guang Cheng
袁錦鋒
Kam-Fung Yuen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 108
中文關鍵詞: information systeminformation overloadnews articlesearch enginedocument clusteringsearch result clusteringsub-topic discoverytechnical experimentcomposite keyword densitycontent extractioninformation retrievalweb mininguser querysecond-stage noise removal
外文關鍵詞: information system, information overload, news article, search engine, document clustering, search result clustering, sub-topic discovery, technical experiment, composite keyword density, content extraction, information retrieval, web mining, user query, second-stage noise removal
相關次數: 點閱:312下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

Along with the growth of the internet, the number of information published increased exponentially. This huge flow of information causes information overload problem causes internet users facing difficulties in finding key information they needed on the internet. To solve this, this paper proposes an application that helps user find trending news of their query/interest easily. Three core modules of the application are clustering, extraction and presentation.
Clustering is required to separate each subtopic found to avoid blur and mixed information received. While clustering is a research topic with a rich history, none try to implement it in the domain of news articles. Several methods are tested in this study, including naïve, manual thresholding, and heuristic clustering method. The result shows that hierarchical clustering using tf-idf word weighting, cosine similarity as distance measure, and heuristically terminated using elbow point analysis achieves the best result at 50.84% Acc and 61.96% NMI when compared to human standards.
Web content extraction is also a famous research topic that have quite some history of development, one challenge commonly faced is the tendency to have lower effectivity when used in time cluster different than when it’s developed. In this paper, researcher present an idea to look at content extraction from different side, which in this paper is by using a prior known subject/keyword to help the content extraction process. Researcher describe about how to set a score for each node within DOM Tree using keyword, text, and tag inside a node to find the main content of a webpage. Second stage of noise removal process also introduced to further remove noise that exist within the content block. The evaluation result shows that by using prior knowledge, it’s possible create a time-resistant extraction algorithm that give a good and steady result compared to existing algorithms with improved score of 7.48%.
To help solving information overload problem, a good presentation method is also required. Several methods of presentation backed with human questionnaire are used in this study. The final application proposed able to receive score of 4.18 of 5 for its helpfulness and 4.35 of 5 for its effectiveness by respondents. Showing that the proposed application could really help users to find information and help to solve information overload problem.


Along with the growth of the internet, the number of information published increased exponentially. This huge flow of information causes information overload problem causes internet users facing difficulties in finding key information they needed on the internet. To solve this, this paper proposes an application that helps user find trending news of their query/interest easily. Three core modules of the application are clustering, extraction and presentation.
Clustering is required to separate each subtopic found to avoid blur and mixed information received. While clustering is a research topic with a rich history, none try to implement it in the domain of news articles. Several methods are tested in this study, including naïve, manual thresholding, and heuristic clustering method. The result shows that hierarchical clustering using tf-idf word weighting, cosine similarity as distance measure, and heuristically terminated using elbow point analysis achieves the best result at 50.84% Acc and 61.96% NMI when compared to human standards.
Web content extraction is also a famous research topic that have quite some history of development, one challenge commonly faced is the tendency to have lower effectivity when used in time cluster different than when it’s developed. In this paper, researcher present an idea to look at content extraction from different side, which in this paper is by using a prior known subject/keyword to help the content extraction process. Researcher describe about how to set a score for each node within DOM Tree using keyword, text, and tag inside a node to find the main content of a webpage. Second stage of noise removal process also introduced to further remove noise that exist within the content block. The evaluation result shows that by using prior knowledge, it’s possible create a time-resistant extraction algorithm that give a good and steady result compared to existing algorithms with improved score of 7.48%.
To help solving information overload problem, a good presentation method is also required. Several methods of presentation backed with human questionnaire are used in this study. The final application proposed able to receive score of 4.18 of 5 for its helpfulness and 4.35 of 5 for its effectiveness by respondents. Showing that the proposed application could really help users to find information and help to solve information overload problem.

ABSTRACT i ACKNOWLEDGEMENTS iii CONTENTS iv LIST OF FIGURES vii LIST OF TABLES x LIST OF ALGORITHMS xi LIST OF EQUATIONS xii CHAPTER 1 INTRODUCTION 1 1.1 Research Background 1 1.1.1 Clustering Background 2 1.1.2 Extraction Background 3 1.2 Research Objective 5 1.3 Research Scope and Constraint 6 1.4 Outline and Report 7 CHAPTER 2 LITERATURE REVIEW 9 2.1 Clustering 9 2.1.1 JSON News-Search Result 10 2.1.2 Document conversion to Word Vector 11 2.1.3 PRGraph 11 2.1.4 Famous Clustering Algorithms 13 2.1.5 Clustering Analysis 16 2.1.6 Elbow Analysis 18 2.1.7 Word weighting 18 2.1.8 Distance Function 20 2.2 Extraction 21 2.2.1 DOM Tree 24 2.2.2 Webpage Noises 25 2.3 Presentation 26 2.3.1 User Search Behavior 26 2.3.2 SumBasic 27 CHAPTER 3 METHODOLOGY 29 3.1 Application Process 29 3.2 Clustering 30 3.2.1 Consideration for Clustering 32 3.2.2 Naïve Bidirectional Phrase-Graph Method 33 3.2.3 Manual Thresholding of Clustering Algorithms 35 3.2.4 Heuristic Method 36 3.2.5 Combination using Date Published 38 3.3 Extraction 39 3.3.1 Composite Keyword Density 40 3.3.2 Selecting the Main Content 44 3.4 Presentation 48 3.4.1 User Preference 50 CHAPTER 4 EVALUATION RESULTS 53 4.1 Clustering 53 4.1.1 Datasets 53 4.1.2 Performance Metrics 54 4.1.3 Clustering using Manual Threshold Selection 55 4.1.4 Heuristic Analysis 57 4.1.5 Results 59 4.2 Extraction 64 4.2.1 Implementations 64 4.2.2 Datasets 65 4.2.3 Performance Metrics 67 4.2.4 The influence of threshold modification 68 4.2.5 Experiment on Keyword as time-resistant feature 70 4.2.6 Subject Assisted Content Extraction (SACE) 71 4.2.7 Comparison with other method 75 4.3 Presentation 80 4.3.1 General characteristic of using search engine 81 4.3.2 Assessment of the Application 82 CHAPTER 5 CONCLUSION AND FUTURE WORKS 85 5.1 Conclusion 85 5.1.1 Clustering 85 5.1.2 Extraction 86 5.2 Future Works 87 REFERENCES 89

Adelberg, B., 1998. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents-brad adelberg. SIGMOD Conference 1998.
Baumgartner, R. a. F. S. a. G. G., 2001. Visual web information extraction with lixto. VLDB 2001 ‚Proceedings of 27th International Conference on Very Large Data Bases ‚September 11- 14 ‚2001 ‚Roma ‚Italy.
Beel, J., Gipp, B., Langer, S. & Breitinger, C., 2016. paper recommender systems: a literature survey. International Journal on Digital Libraries, pp. 305-338.
Cai, D., Yu, S., Wen, J.-R. & Ma, W.-Y., 2003. Vips: a vision-based page segmentation algorithm.
Chen, H. & Dumais, S., 2000. Bringing order to the web: Automatically categorizing search results. Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pp. 145--152.
Dalvi, N., Bohannon, P. & Sha, F., 2009. Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 335-348.
Defays, D., 1977. An efficient algorithm for a complete link method. The Computer Journal, 20(4), pp. 364-366.
Ester, M., Kriegel, H.-P. a. S. J. & Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 96(34), pp. 226-231.
Gibson, D., Punera, K. & Tomkins, A., 2005. The volume and evolution of web page templates. Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 830-839.
Grangier, X., 2011. Python-Goose. [Online]
Available at: https://github.com/grangier/python-goose
Gupta, S., Kaiser, G., Neistadt, D. & Grimm, P., 2003. DOM-based content extraction of HTML documents. Proceedings of the 12th international conference on World Wide Web, pp. 207-214.
Hartigan, J. A. & Wong, M. A., 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), pp. 100-108.
Hidayat, A., 2017. PhantomJS. [Online]
Available at: https://github.com/ariya/phantomjs/
Ifrim, G., Shi, B. & Brigadir, I., 2014. Event detection in twitter using aggressive filtering and hierarchical tweet clustering. Second Workshop on Social News on the Web (SNOW), Seoul, Korea, 8 April 2014.
Insa, D., Silva, J. & Tamarit, S., 2013. Using the words/leafs ratio in the DOM tree for content extraction. The Journal of Logic and Algebraic Programming, 82(8), pp. 311-325.
Ketchen Jr, D. J. & Shook, C. L., 1996. The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, pp. 441-458.
Leskovec, J., Rajaraman, A. & Ullman, J. D., 2014. Mining of massive datasets. s.l.:Cambridge university press.
Levandowsky, M. & Winter, D., 1971. Distance between sets. Nature, 234(5323), pp. 34-35.
Liu, L., Pu, C. & Han, W., 2000. XWRAP: An XML-enabled wrapper construction system for web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on, pp. 611-621.
Lovász, L. & Plummer, M., 2009. Matching theory. Volume 367.
Ma, L., Goharian, N., Chowdhury, A. & Chung, M., 2003. Extracting unstructured data from template generated web documents. Proceedings of the twelfth international conference on Information and knowledge management, pp. 512-515.
Myllymaki, J., 2002. Effective web data extraction with standard XML technologies. Computer Networks, 39(5), pp. 635-644.
Nenkova, A. & Vanderwende, L., 2005. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, Volume 101.
Palacios, R., 2015. Eatiht. [Online]
Available at: http://rodricios.github.io/eatiht/
Parameswaran, A., Dalvi, N., Garcia-Molina, H. & Rastogi, R., 2011. Optimal schemes for robust web extraction. Proceedings of the VLDB Conference, 4(11).
Robie, J., n.d. w3.org. [Online]
Available at: https://www.w3.org/TR/WD-DOM/introduction.html
[Accessed November 2017].
Rosa, K. D. et al., 2011. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM.
Rousseeuw, P. J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, Volume 20, pp. 53-65.
Sanoja, A. & Gancarski, S., 2014. Block-o-matic: A web page segmentation framework. Multimedia Computing and Systems (ICMCS), 2014 International Conference on, pp. 595-600.
Schubert, E. et al., 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), p. 19.
Sharifi, B., Hutton, M.-A. & Kalita, J. K., 2010. Experiments in microblog summarization. Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 49-56.
Singhal, A., 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), pp. 35-43.
Song, D., Sun, F. & Liao, L., 2015. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowledge and Information Systems, 42(1), pp. 75-96.
Sun, F., Song, D. & Liao, L., 2011. Dom based content extraction via text density. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 245-254.
Tryon, R. C., 1939. Cluster analysis: Correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. s.l.:Edwards brother, Incorporated, lithoprinters and publishers.
Ward Jr, J. H., 1963. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), pp. 236-244.
Weninger, T., Hsu, W. H. & Han, J., 2010. CETR: content extraction via tag ratios. Proceedings of the 19th international conference on World wide web, pp. 971-980.
Weninger, T. et al., 2016. Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future. ACM SIGKDD Explorations Newsletter, 17(2), pp. 17-23.
Wu, S., Liu, J. & Fan, J., 2015. Automatic web content extraction by combination of learning and grouping. Proceedings of the 24th International Conference on World Wide Web, pp. 1264-1274.
Xie, P. & Xing, E. P., 2013. Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874.
Xu, W., Liu, X. & Gong, Y., 2003. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267-273.
Zamir, O. & Etzioni, O., 1999. Grouper: a dynamic clustering interface to Web search results. Computer Networks, 31(11), pp. 1361--1374.
Zeng, H.-J.et al., 2004. Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 210-217

QR CODE