Automatic News-Roundup Generation using Clustering, Extraction, and Presentation

簡易檢索 / 詳目顯示

回結果列表

研究生：	Vincent Utomo Vincent Utomo
論文名稱：	Automatic News-Roundup Generation using Clustering, Extraction, and Presentation Automatic News-Roundup Generation using Clustering, Extraction, and Presentation
指導教授：	呂政修 Jenq-Shiou Leu
口試委員:	林昌鴻 Chang -Hong Lin 鄭瑞光 Ray-Guang Cheng 袁錦鋒 Kam-Fung Yuen
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	108
中文關鍵詞：	information system 、information overload 、news article 、search engine 、document clustering 、search result clustering 、sub-topic discovery 、technical experiment 、composite keyword density 、content extraction 、information retrieval 、web mining 、user query 、second-stage noise removal
外文關鍵詞：	information system, information overload, news article, search engine, document clustering, search result clustering, sub-topic discovery, technical experiment, composite keyword density, content extraction, information retrieval, web mining, user query, second-stage noise removal
相關次數：	點閱：312 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

Along with the growth of the internet, the number of information published increased exponentially. This huge flow of information causes information overload problem causes internet users facing difficulties in finding key information they needed on the internet. To solve this, this paper proposes an application that helps user find trending news of their query/interest easily. Three core modules of the application are clustering, extraction and presentation.
Clustering is required to separate each subtopic found to avoid blur and mixed information received. While clustering is a research topic with a rich history, none try to implement it in the domain of news articles. Several methods are tested in this study, including naïve, manual thresholding, and heuristic clustering method. The result shows that hierarchical clustering using tf-idf word weighting, cosine similarity as distance measure, and heuristically terminated using elbow point analysis achieves the best result at 50.84% Acc and 61.96% NMI when compared to human standards.
Web content extraction is also a famous research topic that have quite some history of development, one challenge commonly faced is the tendency to have lower effectivity when used in time cluster different than when it’s developed. In this paper, researcher present an idea to look at content extraction from different side, which in this paper is by using a prior known subject/keyword to help the content extraction process. Researcher describe about how to set a score for each node within DOM Tree using keyword, text, and tag inside a node to find the main content of a webpage. Second stage of noise removal process also introduced to further remove noise that exist within the content block. The evaluation result shows that by using prior knowledge, it’s possible create a time-resistant extraction algorithm that give a good and steady result compared to existing algorithms with improved score of 7.48%.
To help solving information overload problem, a good presentation method is also required. Several methods of presentation backed with human questionnaire are used in this study. The final application proposed able to receive score of 4.18 of 5 for its helpfulness and 4.35 of 5 for its effectiveness by respondents. Showing that the proposed application could really help users to find information and help to solve information overload problem.

ABSTRACT    i
ACKNOWLEDGEMENTS    iii
CONTENTS    iv
LIST OF FIGURES    vii
LIST OF TABLES    x
LIST OF ALGORITHMS    xi
LIST OF EQUATIONS    xii
CHAPTER 1 INTRODUCTION    1
1    Research Background    1
1.1    Clustering Background    2
1.2    Extraction Background    3
2    Research Objective    5
3    Research Scope and Constraint    6
4    Outline and Report    7
CHAPTER 2 LITERATURE REVIEW    9
1    Clustering    9
1.1    JSON News-Search Result    10
1.2    Document conversion to Word Vector    11
1.3    PRGraph    11
1.4    Famous Clustering Algorithms    13
1.5    Clustering Analysis    16
1.6    Elbow Analysis    18
1.7    Word weighting    18
1.8    Distance Function    20
2    Extraction    21
2.1    DOM Tree    24
2.2    Webpage Noises    25
3    Presentation    26
3.1    User Search Behavior    26
3.2    SumBasic    27
CHAPTER 3 METHODOLOGY    29
1    Application Process    29
2    Clustering    30
2.1    Consideration for Clustering    32
2.2    Naïve Bidirectional Phrase-Graph Method    33
2.3    Manual Thresholding of Clustering Algorithms    35
2.4    Heuristic Method    36
2.5    Combination using Date Published    38
3    Extraction    39
3.1    Composite Keyword Density    40
3.2    Selecting the Main Content    44
4    Presentation    48
4.1    User Preference    50
CHAPTER 4 EVALUATION RESULTS    53
1    Clustering    53
1.1    Datasets    53
1.2    Performance Metrics    54
1.3    Clustering using Manual Threshold Selection    55
1.4    Heuristic Analysis    57
1.5    Results    59
2    Extraction    64
2.1    Implementations    64
2.2    Datasets    65
2.3    Performance Metrics    67
2.4    The influence of threshold modification    68
2.5    Experiment on Keyword as time-resistant feature    70
2.6    Subject Assisted Content Extraction (SACE)    71
2.7    Comparison with other method    75
3    Presentation    80
3.1    General characteristic of using search engine    81
3.2    Assessment of the Application    82
CHAPTER 5 CONCLUSION AND FUTURE WORKS    85
1    Conclusion    85
1.1    Clustering    85
1.2    Extraction    86
2    Future Works    87
REFERENCES    89

                                

Adelberg, B., 1998. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents-brad adelberg. SIGMOD Conference 1998.
Baumgartner, R. a. F. S. a. G. G., 2001. Visual web information extraction with lixto. VLDB 2001 ‚Proceedings of 27th International Conference on Very Large Data Bases ‚September 11- 14 ‚2001 ‚Roma ‚Italy.
Beel, J., Gipp, B., Langer, S. & Breitinger, C., 2016. paper recommender systems: a literature survey. International Journal on Digital Libraries, pp. 305-338.
Cai, D., Yu, S., Wen, J.-R. & Ma, W.-Y., 2003. Vips: a vision-based page segmentation algorithm.
Chen, H. & Dumais, S., 2000. Bringing order to the web: Automatically categorizing search results. Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pp. 145--152.
Dalvi, N., Bohannon, P. & Sha, F., 2009. Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 335-348.
Defays, D., 1977. An efficient algorithm for a complete link method. The Computer Journal, 20(4), pp. 364-366.
Ester, M., Kriegel, H.-P. a. S. J. & Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 96(34), pp. 226-231.
Gibson, D., Punera, K. & Tomkins, A., 2005. The volume and evolution of web page templates. Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 830-839.
Grangier, X., 2011. Python-Goose. [Online]
Available at: https://github.com/grangier/python-goose
Gupta, S., Kaiser, G., Neistadt, D. & Grimm, P., 2003. DOM-based content extraction of HTML documents. Proceedings of the 12th international conference on World Wide Web, pp. 207-214.
Hartigan, J. A. & Wong, M. A., 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), pp. 100-108.
Hidayat, A., 2017. PhantomJS. [Online]
Available at: https://github.com/ariya/phantomjs/
Ifrim, G., Shi, B. & Brigadir, I., 2014. Event detection in twitter using aggressive filtering and hierarchical tweet clustering. Second Workshop on Social News on the Web (SNOW), Seoul, Korea, 8 April 2014.
Insa, D., Silva, J. & Tamarit, S., 2013. Using the words/leafs ratio in the DOM tree for content extraction. The Journal of Logic and Algebraic Programming, 82(8), pp. 311-325.
Ketchen Jr, D. J. & Shook, C. L., 1996. The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, pp. 441-458.
Leskovec, J., Rajaraman, A. & Ullman, J. D., 2014. Mining of massive datasets. s.l.:Cambridge university press.
Levandowsky, M. & Winter, D., 1971. Distance between sets. Nature, 234(5323), pp. 34-35.
Liu, L., Pu, C. & Han, W., 2000. XWRAP: An XML-enabled wrapper construction system for web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on, pp. 611-621.
Lovász, L. & Plummer, M., 2009. Matching theory. Volume 367.
Ma, L., Goharian, N., Chowdhury, A. & Chung, M., 2003. Extracting unstructured data from template generated web documents. Proceedings of the twelfth international conference on Information and knowledge management, pp. 512-515.
Myllymaki, J., 2002. Effective web data extraction with standard XML technologies. Computer Networks, 39(5), pp. 635-644.
Nenkova, A. & Vanderwende, L., 2005. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, Volume 101.
Palacios, R., 2015. Eatiht. [Online]
Available at: http://rodricios.github.io/eatiht/
Parameswaran, A., Dalvi, N., Garcia-Molina, H. & Rastogi, R., 2011. Optimal schemes for robust web extraction. Proceedings of the VLDB Conference, 4(11).
Robie, J., n.d. w3.org. [Online]
Available at: https://www.w3.org/TR/WD-DOM/introduction.html
[Accessed November 2017].
Rosa, K. D. et al., 2011. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM.
Rousseeuw, P. J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, Volume 20, pp. 53-65.
Sanoja, A. & Gancarski, S., 2014. Block-o-matic: A web page segmentation framework. Multimedia Computing and Systems (ICMCS), 2014 International Conference on, pp. 595-600.
Schubert, E. et al., 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), p. 19.
Sharifi, B., Hutton, M.-A. & Kalita, J. K., 2010. Experiments in microblog summarization. Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 49-56.
Singhal, A., 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), pp. 35-43.
Song, D., Sun, F. & Liao, L., 2015. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowledge and Information Systems, 42(1), pp. 75-96.
Sun, F., Song, D. & Liao, L., 2011. Dom based content extraction via text density. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 245-254.
Tryon, R. C., 1939. Cluster analysis: Correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. s.l.:Edwards brother, Incorporated, lithoprinters and publishers.
Ward Jr, J. H., 1963. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), pp. 236-244.
Weninger, T., Hsu, W. H. & Han, J., 2010. CETR: content extraction via tag ratios. Proceedings of the 19th international conference on World wide web, pp. 971-980.
Weninger, T. et al., 2016. Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future. ACM SIGKDD Explorations Newsletter, 17(2), pp. 17-23.
Wu, S., Liu, J. & Fan, J., 2015. Automatic web content extraction by combination of learning and grouping. Proceedings of the 24th International Conference on World Wide Web, pp. 1264-1274.
Xie, P. & Xing, E. P., 2013. Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874.
Xu, W., Liu, X. & Gong, Y., 2003. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267-273.
Zamir, O. & Etzioni, O., 1999. Grouper: a dynamic clustering interface to Web search results. Computer Networks, 31(11), pp. 1361--1374.
Zeng, H.-J.et al., 2004. Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 210-217

簡易檢索 / 詳目顯示

相關論文