簡易檢索 / 詳目顯示

研究生: 林宥廷
Yu-ting Lin
論文名稱: 自動化網頁資料擷取-使用資料路徑與視覺資訊方法
Automatic Data Extraction from the Web -Using Data Path and Visual Information Methods
指導教授: 徐俊傑
Chiun-Chieh Hsu
口試委員: 黃世禎
Sun-Jen Huang
洪政煌
Cheng-Huang Hung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 61
中文關鍵詞: 網路探勘網頁資料擷取擷取程式文件物件模型
外文關鍵詞: Web Mining, Web data extraction, Wrapper, DOM (Document Object Model)
相關次數: 點閱:164下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 網路技術的成熟發展以及大量的資料被數位化後,網路上的資訊量以每年近60%的幅度迅速增長,全球資訊網(World Wide Web)儼然已成為龐大的資料庫。網路上的資源豐富,且多以網頁的形式呈現,重要的資訊都隱含在網頁中,然而網頁大多以HTML文件格式展現,非常不利於資料收集、分析與比較運用,因此,如何自動地擷取網頁中的重要資訊,使網路資源能有效整合,一直是近年來極受重視的研究課題。

    為了獲取網路中的資訊,本論文提出一自動化擷取網頁中重要主體資訊的方法,能精確地萃取出網頁中的資料紀錄,例如拍賣網站的商品清單、搜尋引擎的結果回應項目、新聞網站的新聞列表等。本研究所提出的方法主要分為兩大步驟:步驟一,提出資料路徑比對(DPM)法,以相對簡單少量的路徑代號相似度計算來找出HTML文件中結構重複的部份,有效地取得資料紀錄。再配合網頁視覺資訊,過濾資訊含量低的資料紀錄,以提高資料擷取的整體精確度。步驟二,使用多筆路徑代號字串對齊法,執行資料間的屬性對齊,將萃取出的資料紀錄存入資料庫中。

    本論文所提出的方法能針對任意單一網頁,有效過濾雜訊且自動擷取資訊含量高的資料,並將其以結構化的資料庫型式儲存。本研究以50個不同來源網站頁面進行實驗,實驗結果顯示網頁的資訊擷取精確度高達93%,這對於整合運用網路上不同來源之資訊將有極大的貢獻。


    Due to the explosive growth and popularity of the World Wide Web, the Internet presents a huge amount of useful information and seems to have become an enormous database. However, the majority of Web documents are HTML files which are lack of structure and hard to support data-related applications. That is the reason why automatic data extraction is essentially for multiple Web resources integration.

    In order to retrieve internet data automatically, this research proposed a novel approach to extract major information from Web pages, e.g., shopping website’s product lists, search engine results, and news website’s hotline, etc. Our method consists of two main steps: Step 1, a novel and effective technique called DPM (Data Path Matching) is proposed to extract data records in a page. Moreover, the system filters out unimportant sets of data records based on visual information. Step 2, a novel method called Data Path Code Alignment is presented to identify data items from those extracted records.

    In this work, the proposed two-step-technique can differentiate noisy and unimportant data from Web pages, extract primary information automatically and store them in a database. Experimental results based on pages from 50 diverse Web sites demonstrate the effectiveness of this method. The precision in data record extraction and data item alignment are both above 93%. It is a great contribution to the field of Web information integration.

    中文摘要 I 英文摘要 III 誌 謝 V 目 錄 VI 圖目錄 VIII 表目錄 X 第一章 緒論 1 1-1 研究背景與動機 1 1-2 研究目的 3 1-3 論文架構 5 第二章 相關研究 6 2-1 網頁內容資訊擷取系統分類 6 2-2 記錄層次擷取系統 8 2-3 網頁區塊切割與重要度分析 13 2-4 文件物件模型 15 第三章 自動化網頁資料擷取系統 17 3-1 系統流程 17 3-2 資料紀錄判別 20 3-2-1 子樹初步分群:資料路徑比對 20 3-2-2 合併且刪除部分群組:利用視覺位置資訊 24 3-3 淘汰資訊含量低的資料區域 27 3-3-1 空間特徵過濾資料區域 27 3-3-2 中央區塊設定 30 3-4 巢狀資料區域擷取目標判別 32 3-5 資料項目擷取 35 3-5-1 中心對齊演算法 36 3-5-2 兩筆路徑代號字串對齊 37 3-5-3 多筆路徑代號字串對齊 39 第四章 實驗結果與分析 43 4-1 網頁收集 43 4-2 實驗結果 44 4-3 實驗結果分析與討論 47 第五章 結論與未來方向 56 5-1 結論 56 5-2 未來研究方向 56 參考文獻 58

    [1] A. Arasu and H. Garcia-Molina, “Extracting structured data from Web pages,” ACM SIGMOD International Conference on Management of Data, San Diego, California, pp.337- 348, 2003.
    [2] M. Álvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda, “Extracting lists of data records from semi-structured web pages,” Data and Knowledge Engineering, Vol.64, No.2, pp.491-509, 2008.
    [3] R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval,” Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999.
    [4] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” 2001 International Conference on Distributed Computing Systems, Phoenix, Arizona, 2001.
    [5] H. Carrillo and D. Lipman, “The Multiple Sequence Alignment Problem in Biology,” SIAM Journal Applied Mathematics, Vol.48, No.5, pp.1073–1082, 1988.
    [6] C. H. Chang and C. N. Hsu, “Automatic Extraction of Information Blocks Using PAT Trees,” National Computer Symposium, Taipei, Taiwan, 1999.
    [7] C. H. Chang, S. C. Lui, and Y. C. Wu, “Applying pattern mining to Web information extraction,” The Fifth Pacific Asia Conference on Knowledge Discovery and Data Mining, pp.4-16, Hong Kong, 2001.
    [8] C.H. Chang and S.C. Lui, “IEPAD: Information Extraction based on Pattern Discovery,” The 10th International World Wide Web Conference, Hong Kong, 2001.
    [9] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: towards automatic data extraction from large Web sites,” The 26th International Conference on Very Large Database Systems, Rome, Italy, pp.109-118, 2001.
    [10] D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, “VIPS: a Vision-based Page Segmentation Algorithm," Microsoft Technical Report, MSR-TR-2003-79, 2003.
    [11] C.H. Chang and S.C. Kuo, “OLERA: Semisupervised Web-Data Extraction with Visual Support, “IEEE Intelligent Systems, Vol.19, No.6, pp.56- 64, 2004.
    [12] D. Embley, Y. Jiang, and Y.-K. Ng, “Record-boundary discovery in Web documents,” ACM SIGMOD Conference on Management of Data, pp.467-478, 1999.
    [13] D. Gusfield, “Algorithms on Strings, Trees and Sequences,” Cambridge University Press, 1997.
    [14] X. Gu, J. Chen, W.-Y. Ma, and G. Chen, “Visual based content understanding towards web adaptation,” The Second International Conference on Adaptive Hypermedia and Adaptive Web-based Systems, pp.164-173, Spain, 2002.
    [15] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, ”DOM-based content extraction of HTML documents,” The 12th international conference on World Wide Web, pp.207-214, 2003.
    [16] C. N. Hsu and M. Dung, “Generating finite-state transducers for semi-structured data extraction from the web,” Journal of Information Systems, Vol.23, No.8, pp.521-538, 1998.
    [17] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” The Fifteenth International Joint Conference on Artificial Intelligence, pp.729-737, 1997.
    [18] B. Liu, R. Grossman, and Y. Zhai, “Mining data records in Web Pages,” the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.601-606, 2003.
    [19] L. Li, Y. Liu, A. Obregon, and M. Weatherston, “Visual Segmentation-Based Data Record Extraction from Web Documents,” IEEE International Conference on Information Reuse and Integration, pp.502-507, 2007.
    [20] B. Liu, “Web Data Mining: Exploring hyperlinks, Contents, and Usage Data,” Springer Verlag, 2007.
    [21] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” The 3rd International Conference on Autonomous Agents, pp.190-197, 1999.
    [22] D.C. Reis, P.B. Golgher, A. S. Silva, and A. F. Laender, “Automatic web news extraction using tree edit distance,” The 13th International Conference on the World Wide Web, pp.502-511, New York, 2004.
    [23] S. Sarawagi, “Automation in Information Extraction and Data Integration (Tutorial),” The 28th International Conference on Very Large Data Bases, 2002.
    [24] R. Song, H. Liu, J.R. Wen, and W.Y. Ma, “Learning block importance models for web pages,” The 13th international conference on World Wide Web, pp.203-211, 2004.
    [25] K. Simon and G. Lausen, ”ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” the 14th ACM international conference on Information and knowledge management, pp.381-388, 2005.
    [26] Y.F. Tseng and H. Y. Kao, “The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages,” The 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp.370-373, 2006.
    [27] J. Wang and F.H. Lochovsky, “Data extraction and label assignment for Web databases,” The Twelfth International Conference on World Wide Web, Budapest, Hungary, pp.187-196, 2003.
    [28] Y. Yang and H.-J. Zhang, “HTML page analysis based on visual cues,” The 6th International Conference on Document Analysis and Recognition, pp.859–864, 2001.
    [29] H. Zhzo, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic Wrapper Generation for Search Engines”, the 14th international conference on World Wide Web, pp.66-75, 2005.
    [30] H. Zhzo, W. Meng, and C. Yu., “Automatic extraction of dynamic record sections from search engine result pages”, The 32nd international conference on Very large data bases, pp.989-1000, 2006.
    [31] Y. Zhai, B. Liu, “Structured data extraction from the web based on partial tree alignment,” IEEE Transactions on Knowledge and Data Engineering, Vol.18, No.12, pp.1614-1628, 2006.
    [32] Y. Zhai, B. Liu, “Extracting Web Data Using Instance-Based Learning,” World Wide Web, Vol.10, No.2, pp.113-132, 2007.

    無法下載圖示 全文公開日期 2013/07/09 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE