查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Along with the growth of the internet, the number of information published increased exponentially. This huge flow of information causes information overload problem causes internet users facing difficulties in finding key information they needed on the internet. To solve this, this paper proposes an application that helps user find trending news of their query/interest easily. Three core modules of the application are clustering, extraction and presentation.
Clustering is required to separate each subtopic found to avoid blur and mixed information received. While clustering is a research topic with a rich history, none try to implement it in the domain of news articles. Several methods are tested in this study, including naïve, manual thresholding, and heuristic clustering method. The result shows that hierarchical clustering using tf-idf word weighting, cosine similarity as distance measure, and heuristically terminated using elbow point analysis achieves the best result at 50.84% Acc and 61.96% NMI when compared to human standards.
Web content extraction is also a famous research topic that have quite some history of development, one challenge commonly faced is the tendency to have lower effectivity when used in time cluster different than when it’s developed. In this paper, researcher present an idea to look at content extraction from different side, which in this paper is by using a prior known subject/keyword to help the content extraction process. Researcher describe about how to set a score for each node within DOM Tree using keyword, text, and tag inside a node to find the main content of a webpage. Second stage of noise removal process also introduced to further remove noise that exist within the content block. The evaluation result shows that by using prior knowledge, it’s possible create a time-resistant extraction algorithm that give a good and steady result compared to existing algorithms with improved score of 7.48%.
To help solving information overload problem, a good presentation method is also required. Several methods of presentation backed with human questionnaire are used in this study. The final application proposed able to receive score of 4.18 of 5 for its helpfulness and 4.35 of 5 for its effectiveness by respondents. Showing that the proposed application could really help users to find information and help to solve information overload problem.
Along with the growth of the internet, the number of information published increased exponentially. This huge flow of information causes information overload problem causes internet users facing difficulties in finding key information they needed on the internet. To solve this, this paper proposes an application that helps user find trending news of their query/interest easily. Three core modules of the application are clustering, extraction and presentation.
Clustering is required to separate each subtopic found to avoid blur and mixed information received. While clustering is a research topic with a rich history, none try to implement it in the domain of news articles. Several methods are tested in this study, including naïve, manual thresholding, and heuristic clustering method. The result shows that hierarchical clustering using tf-idf word weighting, cosine similarity as distance measure, and heuristically terminated using elbow point analysis achieves the best result at 50.84% Acc and 61.96% NMI when compared to human standards.
Web content extraction is also a famous research topic that have quite some history of development, one challenge commonly faced is the tendency to have lower effectivity when used in time cluster different than when it’s developed. In this paper, researcher present an idea to look at content extraction from different side, which in this paper is by using a prior known subject/keyword to help the content extraction process. Researcher describe about how to set a score for each node within DOM Tree using keyword, text, and tag inside a node to find the main content of a webpage. Second stage of noise removal process also introduced to further remove noise that exist within the content block. The evaluation result shows that by using prior knowledge, it’s possible create a time-resistant extraction algorithm that give a good and steady result compared to existing algorithms with improved score of 7.48%.
To help solving information overload problem, a good presentation method is also required. Several methods of presentation backed with human questionnaire are used in this study. The final application proposed able to receive score of 4.18 of 5 for its helpfulness and 4.35 of 5 for its effectiveness by respondents. Showing that the proposed application could really help users to find information and help to solve information overload problem.
Adelberg, B., 1998. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents-brad adelberg. SIGMOD Conference 1998.
Baumgartner, R. a. F. S. a. G. G., 2001. Visual web information extraction with lixto. VLDB 2001 ‚Proceedings of 27th International Conference on Very Large Data Bases ‚September 11- 14 ‚2001 ‚Roma ‚Italy.
Beel, J., Gipp, B., Langer, S. & Breitinger, C., 2016. paper recommender systems: a literature survey. International Journal on Digital Libraries, pp. 305-338.
Cai, D., Yu, S., Wen, J.-R. & Ma, W.-Y., 2003. Vips: a vision-based page segmentation algorithm.
Chen, H. & Dumais, S., 2000. Bringing order to the web: Automatically categorizing search results. Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pp. 145--152.
Dalvi, N., Bohannon, P. & Sha, F., 2009. Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 335-348.
Defays, D., 1977. An efficient algorithm for a complete link method. The Computer Journal, 20(4), pp. 364-366.
Ester, M., Kriegel, H.-P. a. S. J. & Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 96(34), pp. 226-231.
Gibson, D., Punera, K. & Tomkins, A., 2005. The volume and evolution of web page templates. Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 830-839.
Grangier, X., 2011. Python-Goose. [Online]
Available at: https://github.com/grangier/python-goose
Gupta, S., Kaiser, G., Neistadt, D. & Grimm, P., 2003. DOM-based content extraction of HTML documents. Proceedings of the 12th international conference on World Wide Web, pp. 207-214.
Hartigan, J. A. & Wong, M. A., 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), pp. 100-108.
Hidayat, A., 2017. PhantomJS. [Online]
Available at: https://github.com/ariya/phantomjs/
Ifrim, G., Shi, B. & Brigadir, I., 2014. Event detection in twitter using aggressive filtering and hierarchical tweet clustering. Second Workshop on Social News on the Web (SNOW), Seoul, Korea, 8 April 2014.
Insa, D., Silva, J. & Tamarit, S., 2013. Using the words/leafs ratio in the DOM tree for content extraction. The Journal of Logic and Algebraic Programming, 82(8), pp. 311-325.
Ketchen Jr, D. J. & Shook, C. L., 1996. The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, pp. 441-458.
Leskovec, J., Rajaraman, A. & Ullman, J. D., 2014. Mining of massive datasets. s.l.:Cambridge university press.
Levandowsky, M. & Winter, D., 1971. Distance between sets. Nature, 234(5323), pp. 34-35.
Liu, L., Pu, C. & Han, W., 2000. XWRAP: An XML-enabled wrapper construction system for web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on, pp. 611-621.
Lovász, L. & Plummer, M., 2009. Matching theory. Volume 367.
Ma, L., Goharian, N., Chowdhury, A. & Chung, M., 2003. Extracting unstructured data from template generated web documents. Proceedings of the twelfth international conference on Information and knowledge management, pp. 512-515.
Myllymaki, J., 2002. Effective web data extraction with standard XML technologies. Computer Networks, 39(5), pp. 635-644.
Nenkova, A. & Vanderwende, L., 2005. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, Volume 101.
Palacios, R., 2015. Eatiht. [Online]
Available at: http://rodricios.github.io/eatiht/
Parameswaran, A., Dalvi, N., Garcia-Molina, H. & Rastogi, R., 2011. Optimal schemes for robust web extraction. Proceedings of the VLDB Conference, 4(11).
Robie, J., n.d. w3.org. [Online]
Available at: https://www.w3.org/TR/WD-DOM/introduction.html
[Accessed November 2017].
Rosa, K. D. et al., 2011. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM.
Rousseeuw, P. J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, Volume 20, pp. 53-65.
Sanoja, A. & Gancarski, S., 2014. Block-o-matic: A web page segmentation framework. Multimedia Computing and Systems (ICMCS), 2014 International Conference on, pp. 595-600.
Schubert, E. et al., 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), p. 19.
Sharifi, B., Hutton, M.-A. & Kalita, J. K., 2010. Experiments in microblog summarization. Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 49-56.
Singhal, A., 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), pp. 35-43.
Song, D., Sun, F. & Liao, L., 2015. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowledge and Information Systems, 42(1), pp. 75-96.
Sun, F., Song, D. & Liao, L., 2011. Dom based content extraction via text density. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 245-254.
Tryon, R. C., 1939. Cluster analysis: Correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. s.l.:Edwards brother, Incorporated, lithoprinters and publishers.
Ward Jr, J. H., 1963. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), pp. 236-244.
Weninger, T., Hsu, W. H. & Han, J., 2010. CETR: content extraction via tag ratios. Proceedings of the 19th international conference on World wide web, pp. 971-980.
Weninger, T. et al., 2016. Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future. ACM SIGKDD Explorations Newsletter, 17(2), pp. 17-23.
Wu, S., Liu, J. & Fan, J., 2015. Automatic web content extraction by combination of learning and grouping. Proceedings of the 24th International Conference on World Wide Web, pp. 1264-1274.
Xie, P. & Xing, E. P., 2013. Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874.
Xu, W., Liu, X. & Gong, Y., 2003. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267-273.
Zamir, O. & Etzioni, O., 1999. Grouper: a dynamic clustering interface to Web search results. Computer Networks, 31(11), pp. 1361--1374.
Zeng, H.-J.et al., 2004. Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 210-217