簡易檢索 / 詳目顯示

研究生: 葉柏毅
Po-Yi Yen
論文名稱: 以樣版分群方法評估網頁區塊重要性-應用於多樣版網站之研究
Block Importance Evaluation for Multi-Template Web Sites by Using Template Clustering
指導教授: 李漢銘
Hahn-Ming Lee
口試委員: 許清琦
Ching-Chi Hsu
何正信
Cheng-Seen Ho
何建明
Jan-Ming Ho
李育杰
Yuh-Jye Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2005
畢業學年度: 93
語文別: 英文
論文頁數: 58
中文關鍵詞: 資訊擷取分群網路探勘
外文關鍵詞: Informative Extraction, Clustering, Web Mining
相關次數: 點閱:212下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 對於商業性網站,網頁上每個區塊的資訊可能有不同的重要程度。因此,為了使不重要的區塊可以被移除對於網路探勘或者在小螢幕裝置上瀏覽網路時,評估網頁區塊重要性是一個重要的工作。當應用目前的評估網頁區塊方法於多樣版網站時,我們發現二個問題,分別是多樣版問題和較少內容之重要性區塊的問題。這二個問題均會降低評估網頁區塊的準確性。
    在此篇論文中,我們提出了一種新的評估網頁區塊技術用以解決上述的二個問題。我們藉由樣版分群方法來群聚相似樣版的網頁區塊,接著個別對網頁區塊群作分析使得多樣版網站轉換成單一樣版的網站。藉由實驗的分析結果,證明我們所提出的評估技術能應用於多樣版網站中且確實提升品質。


    The information of each block from the web pages might not be equally importance, especially in the commercial web site. Therefore, block importance evaluation is a important task such that the noisy blocks can be cleared for web mining and web browsing on small screen devices. For current block importance evaluating approaches, we discover two problems occurring while web site use several predefined templates. These two problems are the multi-template problem and the problem of informative blocks with fewer contents. These two problems both reduce the precision of block importance evaluation.
    In the thesis, we proposed a novel block importance evaluating method, named as Block Analyzer, to solve the multi-template problem and the problem of informative blocks with fewer contents. This method is based on template clustering to cluster blocks with similar template and then analyzing each cluster individually to transform a multi-template web site to a single template web site. Experiments on several news web sites with multiple predefined templates show that Block Analyzer can work well in the multi-template web site and lead to performance improvement.

    Abstract II Acknowledgements IV Content V List of Figures VII List of Tables IX Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Problems of current block importance evaluating methods 3 1.2.1 Multi-template problem 3 1.2.2 The problem of informative block with fewer contents 4 1.3 Goals 4 1.4 Outlines of the thesis 5 Chapter 2 Background 6 2.1 Block importance evaluation 6 2.2 Approaches for block importance evaluation 8 2.2.1 Web site based approaches 8 2.2.1.1 Data-rich Subtree Extraction (DSE) 8 2.2.1.2 Site Style Tree based approaches 9 2.2.1.3 Link Analysis of Mining Informative Structure (LAMIS) and Discovering informative content blocks (InfoDiscoverer) 10 2.2.2 Web page based approaches 13 2.2.2.1 Presentational layout analysis 13 2.2.2.2 Block feature based approach 13 2.3 Summary for related work 14 Chapter 3 Block Analyzer 16 3.1 Concept of Block Analyzer 16 3.2 System architecture of Block Analyzer 19 3.2.1 Page Segmentation Unit 21 3.2.2 Structure Based Clustering Unit 22 3.2.3 Cluster Importance Degree Analyzer 25 3.2.4 Pattern Matching Unit 27 3.3 Characteristics of Block Analyzer 28 Chapter 4 Experiment 30 4.1 Experimental design 30 4.1.1 Ranking criterions 32 4.2 Experimental results 33 Chapter 5 Conclusion 48 5.1 Discussion 48 5.2 Conclusion 49 5.3 Further work 50 References 52

    [1]. S. Brin and L. Page, ”The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of Seventh World Wide Web Conference, pp. 107-117, 1998.
    [2]. M.S. Chen, J.S. Park, and P.S. Yu, “Efficient Data Mining for Path Traversal Patterns,” IEEE Transactions on Knowledge and Data Engineering, vol. 10, no. 2, pp. 209-221, April 1998.
    [3]. R. Song, H. Liu, J.R. Wen, and W.Y. Ma, “Learning Block Importance Models for Web Pages,” In Proceedings of the 13th World Wide Web Conference, 2004.
    [4]. J. Han and K.C.C. Chang, “Data Mining for Web Intelligence,” IEEE Computer, vol. 35, no. 2, pp. 64-70, November 2002.
    [5]. Y. Yang, S. Slattery, and R. Ghani, “A Study of Approaches to Hypertext Categorization,” Journal of Intelligent Information Systems, 2002.
    [6]. H. Yu, J. Han, and K.C.C. Chang, “PEBL: Web Page Classification without Negative Examples,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, January 2004.
    [7]. A. Sun, E.P. Lim, and W.K. Ng, “Web Classification Using Support Vector Machine,” In Proceedings of the Fourth International Workshop on Web Information and Data Management, pp. 96-99, 2002.
    [8]. J. Furnkranz, “Exploiting Structural Information for Text Classification on the WWW,” In Proceedings of the Third Symposium on Intelligent Data Analysis, 1999.

    [9]. H.J. Oh, S.H. Myaeng, and M.H. Lee, “A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information,” In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 264-271, 2000.
    [10]. E. Glover, K. Tsioutsiouliklis, S. Lawrence, D. Pennock, and G. Flake, “Using Web Structure for Classifying and Describing Web Pages,” In Proceedings of the 11th World Wide Web Conference, 2002.
    [11]. L.K. Shih and D.R. Karger, “Using URLs and Table Layout for Web Classification Tasks,” In Proceedings of the 13th World Wide Web Conference, 2004.
    [12]. L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Department of Computer Science, Stanford University, 1998.
    [13]. S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” In Proceedings of 8th World Wide Web Conference, 1999.
    [14]. M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, and M. Gori, “Focused Crawling Using Context Graphs,” In Proceedings of 26th International Conference on Very Large Databases, pp. 527-534, 2000.
    [15]. S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated Focused Crawling Through Online Relevance Feedback,” In Proceedings of 11th World Wide Web Conference, pp. 148-159, 2002.
    [16]. C. Cardie, “Empirical Methods in Information Extraction,” Journal of AI Magazine, vol. 18, no. 4, pp. 5-79, 1997.

    [17]. D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” In Proceedings of the International Conference on Distributed Computing Systems, pp. 361-370, May 2001.
    [18]. C.H. Chang, and S.C. Lui, “IEPAD: Information Extraction based on Pattern Discovery,” In Proceedings of 10th World Wide Web Conference, pp. 681-688, 2001.
    [19]. D. Embley, Y. Jiang, and Y.K. Ng., “Record-Boundary Discovery in Web Documents,” In Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 467-478, 1999.
    [20]. N. Kushmerick, D.Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” In Proceedings of 15th International Joint Conference on Artificial Intelligence, 1997.
    [21]. I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approach to Wrapper Induction,” In Proceedings of Third International Conference on Autonomous Agents, 1999.
    [22]. K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000.
    [23]. J. Hou and Y. Zhang, “Effectively Finding Relevant Web Pages from Linkage Information,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 4, 2003.
    [24]. S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” In Proceedings of 10th World Wide Web Conference, pp. 210-220, 2001.

    [25]. K. Bharat and M.R. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” In Proceedings of 21st ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 104-111, 1998.
    [26]. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg, ”Automatic Resource Compilation by Analyzing Hyperlink Structure and Associate Text,” In Proceedings of Seventh World Wide Web Conference, pp. 65-74, 1998.
    [27]. J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Journal of the ACM, 1999.
    [28]. S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks,” In Proceedings of 24th ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 208-216, 2001.
    [29]. D. Gibson, J. Kleinberg, and P. Raghvan, “Inferring Web Communities from Link Topology,” In Proceedings of 9th ACM Conference on Hypertext and Hypermedia, pp. 225-234, 1998.
    [30]. C. Clifton, “TopCat: Data Mining for Topic Identification in a Text Corpus,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, August 2004.
    [31]. O. Buyukkokten, H. Garcia-Molina, A. Paepcke, and T. Winograd, “Power Browser: Efficient Web Browsing for PDAs,” In Proceedings of the ACM SIGCHI Special Interest Group on Computer-Human Interaction Conference on Human factors in computing systems, pp. 430-437, 2000.

    [32]. S.H. Lin and J.M. Ho, “Discovering Informative Content Blocks from Web Documents,” In Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
    [33]. H.Y. Kao, S.H. Lin, J.M. Ho, and M.S. Chen, “Mining Web Informative Structures and Contents Based on Entropy Analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, January 2004.
    [34]. L. Yi, B. Liu, and X. Li, “Eliminating Noisy Information in Web Pages for Data Mining,” In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003.
    [35]. L. Yi and B. Liu, “Web Page Cleaning for Web Mining Through Feature Weighting,” In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence, August 2003.
    [36]. J. Wang and F.H. Lochovsky, “Data-Rich Section Extraction from HTML pages,” In Proceedings of IEEE International Conference on Web Information Systems Engineering, 2002.
    [37]. B.Y. Ziv and R. Sridhar, “Template Detection via Data Mining and its Applications,” In Proceedings of the 11th World Wide Web Conference, 2002.
    [38]. M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic, “Searching for Web Information More Efficiently Using Presentational Layout Analysis,” Journal of Electronic Business, vol. 1, no. 3, pp. 310-326, 2003.
    [39]. N. Kushmerick, “Learning to remove Internet Advertisements,” In Proceedings of 3rd International Conference on Autonomous Agents, pp. 175-181, 1999.
    [40]. T. Mitchell, Machine Learning, McGraw Hill, 1997.

    [41]. D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” In Proceedings of Fifth Asia Pacific Web Conference, 2003.
    [42]. V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995.
    [43]. J. Chen, B. Zhou, J. Shi, H. Zhang, and Q. Wu, “Function-Based Object Model Towards Website Adaptation,” In Proceedings of 10th World Wide Web Conference, 2001.
    [44]. C. Shannon, “A Mathematical Theory of Communication,” Journal of Bell System, vol. 27, pp. 398-403, 1948.

    URL List:
    [45]. W3C DOM, Document Object Model (DOM), http://www.w3c.org/DOM/, 2003.
    [46]. CNN web site, http://www.cnn.com, 2005.
    [47]. BBC news web site, http://news.bbc.co.uk, 2005.
    [48]. ABC news web site, http://abcnews.go.com, 2005.
    [49]. Yahoo news web site, http://news.yahoo.com, 2005.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE