簡易檢索 / 詳目顯示

研究生: 林雅惠
YA-HUEI LIN
論文名稱: 基於條件機率域萃取引用文獻資訊於個人著述網頁
Mining Publication Records on Publication Pages based on Conditional Random Fields
指導教授: 李漢銘
Hahn-Ming Lee
何建明
Jan-Ming Ho
口試委員: 莊庭瑞
Tyng-Ruey Chuang
鄧惟中
Wei-Chung Teng
項天瑞
Tien-Ruey Hsiang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 47
中文關鍵詞: 條件機率域引用文獻網頁探勘
外文關鍵詞: Conditional Random Fields, Publication Record, Web Mining
相關次數: 點閱:359下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

一筆引用文獻資訊記載著作者、文獻標題、發表年份以及其它資訊。數位圖書館利用分析引用文獻提供許多的應用,例如學術社群分析,研究學者的專長分析。而我們可以從期刊網站或是研究學者的個人著述網頁上萃取引用文獻資訊以供數位圖書館利用。含有引用文獻資訊的網頁通常也含有其他資訊,例如在研究學者的個人著述網頁上可能存在個人經歷或是記錄曾經發表的演講。如何能正確的從不同的網頁(尤其是研究學者的個人著述網頁)萃取出引用文獻是個很有趣的議題,因為在一個網頁裡視覺上有規律性的相似引用文獻資訊並不一定由相似的網頁程式組成,並且不同網頁有不同的呈現方式來表達引用文獻資訊。
在此篇論文中,我們提出了一個引用文獻萃取系統,能有效的從不同的網頁上萃取出引用文獻。我們觀察含有引用文獻的網站發現在一個網頁裡的引用文獻資訊通常遵循著相似的文獻資訊排列順序,例如:“作者 標題 年份”或是“年份 標題 作者”,所以我們利用條件機率域演算法訓練出一個分析文獻資訊模型,分析在一個網頁中可能的文獻資訊排列順序,並且利用擴散概念的演算法切割出正確的引用文獻資訊的。最後相較以往的研究成果,由實驗證明我們所提出的系統的確能更準確的萃取出引用文獻資訊。


A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a variety of applications. A publication record is usually found among other information on a publication Web page (or ”publication page” for short). It is thus an interesting problem to extract publication record from such Web pages. The problem is difficult for several reasons, e.g., flexibility in formatting the metadata of a publication as a
semi-structured citation string and flexibility in expressing the citation string visually presentation in HTML. Furthermore, two citation strings with a similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach, based on Conditional Random Fields and data region boundary analysis, the problem of automatically extracting publication records on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication pages. The precision rate and recall rate, and F-measure are 82.5%, 87.6%, and 85.0%, respectively. This is a significant improvement over previous researches.

ABSTRACT i ACKNOWLEDGEMENTS ii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Example of personal publication pages . . . . . . . . . . . . . . . . . 5 1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Outlines of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 9 2.1 Publication Extraction and Parsing . . . . . . . . . . . . . . . . . . . 10 2.2 Information Extraction on the Web . . . . . . . . . . . . . . . . . . . 12 2.3 Conditional Random Fields based Approach . . . . . . . . . . . . . . 13 3 Publication Record Miner 14 3.1 Publication Page Segmentation . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 DOM Tree Constructor . . . . . . . . . . . . . . . . . . . . . 18 iii CONTENTS iv 3.1.2 Data Region Segmenter . . . . . . . . . . . . . . . . . . . . 19 3.2 Publication field Labeling . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 Content Tokenizer . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Feature Assigner . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 CRF-based Labelor . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Publication Record Extraction . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Diffusion-based Candidate Extractor . . . . . . . . . . . . . . 27 3.3.2 Publication Record Filter . . . . . . . . . . . . . . . . . . . . 29 4 Empirical Experiments and Results 30 4.1 CORA information extraction dataset . . . . . . . . . . . . . . . . . 31 4.2 Dataset P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Evaluation Metrics and Experiment Design . . . . . . . . . . . . . . 32 4.3.1 Results and Discussions . . . . . . . . . . . . . . . . . . . . 35 4.3.2 The Limitation of PRM . . . . . . . . . . . . . . . . . . . . . 38 5 Conclusion and Further Work 40 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

[1] D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, ”Extracting content structure for Web
pages based on visual Representation,” Proceedings of Asia Pacific Web Conference,
pp. 406-417, 2003.
[2] D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, ”VIPS: a vision-based page segmentation
algorithm,” Microsoft Research Technical Report, 2003.
[3] C.H. Chang, C.N. Hsu, and S.C. Liu, ”Automatic information extraction from
semi-structured web page by pattern discovery,” Decision Support System, 35(1):
129-147, 2003.
[4] C.C. Chen, K.H. Yang, H.Y. Kao, and J.M. Ho, ”BibPro: A citation parser based
on sequence alignment techniques,” Proceeding of the 22nd International Conference
on Advanced Information Networking and Applications, pp. 1175-1180,
2008
[5] C.H. Chen, C.Y. Lu, H.M. Lee, and J.M. Ho ”Novelty Paper Recommendation
Using Citation Authority Diffusion,” Proceedings Conference on Technologies
and Applications of Artificial Intelligence, 2011.
42
REFERENCES 43
[6] J.M. Chung, C.J. Wu, C.Y. Lu and J.M. Ho, ”Using Web-Mining Approach for
Academic Measurement and Scholar Recommendation in Expert Finding System,”
Web Intelligence, 2011.
[7] C.C. Chen, K.H. Yang, C.L. Chen, and J.M. Ho ”BibPro: A Citation Parser Based
on Sequence Alignment,” IEEE Transactions on Knowledge and Data Engineering,
Vol. 24, No. 2, pp. 236 - 250, 2012.
[8] C.H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan, ”A survey of web information
extraction systems,” IEEE Transaction on Knowledge and Data Engineering,
2006.
[9] W. Cohen, M. Hurst, and L. Jensen, ”A flexible learning system for wrapping
table and lists in HTML document,” Proceeding of the 11th International World
Wide Web Conference, pp. 232-241, 2002.
[10] Q. Cortez, A. S. da Silva, M. A. Concalves, F. Mesquita, and E. S. de Moura,
”FLUX-CIM: flexible unsupervised extraction of citation metadata,” Proceedings
of joint conference on digital libraries, 2007.
[11] I.G. Councill, C.L. Giles, and M.Y.Kan, “ParsCit: An open-source CRF reference
string parsing package,” Proceedings of the Language Resources and Evaluation
Conference (LREC 08), 2008.
[12] M.Y. Day, R.T.H. Tsai, C.L. Sung, C.C. Hsieh, C.W. Lee, S.H. Wu, K.P. Wu,
C.S. Ong, and Hsu, W.L., “Reference metadata extraction using a hierarchical
knowledge representation framework,” Decision Support Systems, 43(1): 152-
167, 2007.
REFERENCES 44
[13] L. Gao , Z. Tang , X. Lin , Y. Liu , R. Qiu , Y. Wang, Structure extraction
from PDF-based book documents, Proceeding of the 11th annual international
ACM/IEEE joint conference on Digital libraries, 2011.
[14] H. Han, C. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. Fox, “Automatic
document meta-data extracting using support vector machine,” Proceedings of
joint conference on digital libraries, 2003.
[15] Erik Hetzner, ”A simple method for citation metadata extraction using hidden
markov models,” Proceedings of joint conference on digital libraries, pp. 280-
284, 2008.
[16] D. S. Hirschberg, “A linear space algorithm for computing maximal common
subsequences,” Commun. ACM, 18(6): 341-343, 1975.
[17] C.H.A. Hong, J.P. Gozali, and M.Y. Kan, “FireCite: Lightweight real-time
reference string extraction from webpages,” Proceedings of the ACL-IJCNLP
2009 Workshop on text and citation analysis for scholarly digital libraries
(NLPIR4DL), 2009.
[18] D.T. Huynh and W. Hua, “Self-supervised Learning Approach for Extracting Citation
Information on the Web,” Web Technologies and Applications, pp. 719-
726, 2012.
[19] P. Kluegl, A. Hotho, and F. Puppe, ”Local Adaptive Extraction of References,”
33rd Annual German Conference on Artificial Intelligence, 2010.
REFERENCES 45
[20] J. Lafferty, A. McCallum, and F.C.N. Pereira, “Conditional random fields: Probabilistic
models for segmenting and labeling sequence data,” Proceedings of the
Eighteenth International Conference on Machine Learning, 2001.
[21] B. Liu, R. Grossman, and Y. Zhai, ”Mining data records inWeb Pages,” Proceeding
of the 9th ACM SIGKDD international conference on knowledge discovery
and data mining, pp. 601-606, 2003.
[22] B. Liu, Y. Zhai, ”NET - A system for extracting Web data from flat and nested
data records,” Proceeding of the 6th International Conference on Web Information
system Engineering, pp. 163-168, 2005.
[23] C.Y. Lu, S.W. Ho, J.M. Chung, H.M. Lee, and J.M. Ho, ”Mining Fuzzy Domain
Ontology Based on Concept Vector from Wikipedia Category Network,” Web
Intelligence Workshop, 2011.
[24] Y. K. Ng, “Citation Parsing using Maximum Entropy and Repairs,” Undergraduate
thesis, National University of Singapore, 2004.
[25] F. Peng and A. McCallum, “Accurate Information Extraction from Research Papers
using Conditional Random Fields,” In Proceedings of the Conference on
Human Language Technologies / North American Chapter of the Association for
Computational Linguistics (HLT-NAACL), pp. 329-336, 2004.
[26] F. Peng and A. McCallum, “Information Extraction from Researcher Papers using
Conditional Random Fields,” Information Processing and Management, pp. 963-
979, 2006.
REFERENCES 46
[27] H.T. Peng, C.Y. Lu, W. Hsu, and J.M. Ho, ”Disambiguating authors in citations
on the web and authorship correlations,” Expert Systems with Applications, 2012.
[28] K. Sato and Y. Sakakibara, ”RNA secondary structure alignment with conditional
random fields,” Bioinformatics, 21(2), p.237-242, 2005.
[29] Y. Shen, J. Yan, L. Ji, N. Liu, and Z. Chen ”Sparse hidden-dynamics conditional
random fields for user intent understanding,” Proceedings of the 20th international
conference on World wide web, pp. 7-16, 2011.
[30] K. Seymore, A. McCallum, and R. Rosenfeld, “Learning hidden Markov model
structure for information extraction,” AAAI-99 Workshop on Machine Learning
for Information Extraction, pp. 37-42, 1999.
[31] R. T.H. Tsai, B. Chiu, and C.E. Wu, ”Visual webpage block importance prediction
using conditional random fields
[32] T. Weninger, F. Fumarola, C. X. Lin, R. B., J. Han, and D. Malerba, ”Growing
Parallel Paths for Entity-Page Discovery”, Proc. of 2011 Int. World Wide Web
Conf. (WWW’11), 2011.
[33] K.H. Yang, S.S. Chen, M.T. Hsieh, H.M. Lee, and J.M. Ho, “CRE: An Automatic
Citation Record Extractor for Publication List Pages,” Proceedings of the 12th
Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2008.
[34] K.H. Yang and J.M. Ho, ”Parsing Publication Lists on the Web,” Web Intelligence,
pp. 444-447, 2010.
REFERENCES 47
[35] K.H. Yang, T.L. Kuo, H.M. Lee, and J.M. Ho, ”A Reviewer Recommendation
System Based on Collaborative Intelligence,” Web Intelligence, pp. 564-567,
2009.
[36] Y. Zhai and B. Liu, ”Web data extraction based on partial tree alignment” Proc.
WWW, pp. 76-85, 2005.
[37] J. Zhu, Z. Nie, J.R.Wen, B. Zhang, andW.Y.Ma, ”Simultaneous record detection
and attribute labeling in web data extraction,” Proceedings of the 12th SIGKDD
international conference on Knowledge discovery and data mining, 2006.
[38] J. Zou, D. Le, and G.R. Thoma, “Locating and parsing bibliographic references
in HTML medical articles,” Int. J. Doc. Anal. Recognit., 13(2): 107-119, 2010.
[39] ACM Digital Library, http://dl.acm.org/
[40] Bing Liu’s personal Web page, http://www.cs.uic.edu/ liub/
[41] CiteSeer, http://citeseerx.ist.psu.edu/
[42] The DBLP Computer Science Bibliography, website: http://www.informatik.unitrier.
de/ ley/db/
[43] FreeCite, http://freecite.library.brown.edu/
[44] Google Scholar, http://scholar.google.com/
[45] IEEE Xplore, http://ieeexplore.ieee.org/
[46] Microsoft Academic Search, http://academic.research.microsoft.com/

無法下載圖示 全文公開日期 2017/07/25 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE