Basic Search / Detailed Display

Author: 洪薏璇
Yi-Hsuan Hung
Thesis Title: 系統化多樣式段落合併處理與機器學習於法律文件之分類研究
Systematic Processing of Multi-Style Paragraph Merging and Machine Learning for Legal Document Classification
Advisor: 查士朝
Shi-Cho Cha
Committee: 查士朝
Shi-Cho Cha
Uwe Haneke
Uwe Haneke
Jannik Strötgen
Jannik Strötgen
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2024
Graduation Academic Year: 113
Language: 英文
Pages: 109
Keywords (in Chinese): 機器學習法律文件分類光學字元辨識特徵工程深度學習嵌入技術
Keywords (in other languages): Machine Learning, Legal Document Classification, OCR, Feature Engineering, Deep Learning, Embedding Techniques
Reference times: Clicks: 220Downloads: 12
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

背景:有效的合約管理在現代商業中尤為重要,特別是在供應鏈中,公司需與多個供應商協調。紙本合約會延長談判時間並增加複雜性。此外,處理和分析非結構化數據(如 PDF 合約),特別是在法律文件中,存在相當多的挑戰。
目的:本研究專注於開發方法,系統化地合併和結構化法律合約中的多樣段落,並利用機器學習技術提升法律文件的分類效果,尤其針對與醫療相關的合約。
方法:我們將光學字符識別(OCR)與機器學習技術結合,用於處理非結構化的法律合約。為了處理多樣段落,本研究引入先進工具來合併跨頁與跨欄文本。我們採用 SVM、XGBoost、KNN、W-KNN、NB 等模型,以及深度學習模型 BERT、TextCNN 和 BiLSTM。探索的特徵提取技術包括 TF-IDF、GloVe 和 Word2Vec,並使用 Optuna 進行超參數優化。
結果:傳統模型,特別是 XGBoost,在法律文件分類中表現優於深度學習模型,使用 TF-IDF 特徵達到最高的準確率和 F1 分數。XGBoost 在選擇前 2500 個特徵時表現最佳,達到 82% 的準確率。深度學習模型如 BiLSTM 和 BERT 受過擬合影響,但在較大數據集上展現潛力。TF-IDF 是最可靠的特徵提取方法,而 GloVe 嵌入則增強了深度學習的表現。
結論:本研究提供了一個處理多樣化法律文件的系統化方法,並透過全面的實驗確定了在法律文本分類中表現最佳的機器學習模型。


Background: Effective contract management is crucial in modern business, particularly within supply chains where companies must coordinate with multiple suppliers. Paper-based contracts can prolong negotiations, increasing complexity. Additionally, managing unstructured data, such as PDF contracts, presents challenges in processing and analysis, particularly for legal documents.
Objective: This research focuses on developing methods to systematically merge and structure multi-style paragraphs in legal contracts and improve legal document classification using machine learning, particularly for healthcare-related contracts.
Methods: We integrate Optical Character Recognition (OCR) with machine learning techniques to process unstructured legal contracts. To handle multi-style paragraphs, we introduce advanced tools to merge cross-page and cross-column texts. We apply models like SVM, XGBoost, KNN, W-KNN, NB, as well as deep learning models such as BERT, TextCNN, and BiLSTM. Feature extraction techniques like TF-IDF, GloVe, and Word2Vec are explored, and hyperparameters are optimized using Optuna.
Results: Traditional models, particularly XGBoost, outperformed deep learning models in legal document classification, achieving the highest accuracy and F1-score with TF-IDF features. XGBoost performed best with the Top 2500 features, achieving 82% accuracy. Deep learning models, such as BiLSTM and BERT, struggled with overfitting but showed potential with larger datasets. TF-IDF proved the most reliable feature extraction method, while GloVe embeddings enhanced deep learning performance.
Conclusion: This research provides a systematic approach for handling multi-style legal documents and identifies the best machine learning models for legal text classification through comprehensive experimentation.

摘要 I ABSTRACT II TABLE OF CONTENT III TABLE IV FIGURE V ALGORITHM VII 1 INTRODUCTION 8 1.1 PROBLEM DEFINITION 11 1.2 MAIN CONTRIBUTION 13 1.3 THESIS STRUCTURE 13 2 RELATED WORK 14 2.1 NATURAL LANGUAGE PROCESSING 14 2.2 OCR TECHNOLOGY 17 2.3 TEXT CLASSIFICATION 19 2.4 HYPERPARAMETER OPTIMIZATION 24 3 PARAGRAPH MERGED 27 3.1 PROBLEM DEFINITION 27 3.1.1 AWS Textract 28 3.1.2 Use Case 31 3.2 PROPOSED METHOD BY TEXTRACT LAYOUT 32 3.2.1 Main Merge 34 3.2.2 Performance 47 3.3 PROPOSED METHOD BY LINES 48 3.3.1 Step1: Merge Lines to Blocks 49 3.3.2 Step2: Filter Out Non-Main Blocks 56 3.3.3 Performance 69 3.4 TOLERANCE DETERMINATION 70 3.5 LIMITATION 71 4 TEXT CLASSIFICATION 72 4.1 CHALLENGE 72 4.2 DATA PREPARATION 73 4.2.1 Data Structure 73 4.2.2 Dataset Split 75 4.2.3 Data Cleaning 78 4.2.4 Feature Building 84 4.3 MODELS SELECTION 87 4.3.1 Experiment Setup Process 88 4.3.2 Model Setup 90 4.4 PERFORMANCE RESULT 95 4.4.1 Accuracy and F1 Score in Different Models 95 4.4.2 Confusion Matrix 96 4.4.3 Training and Validation Process 98 4.4.4 Different Features in Different Models 100 5 CONCLUSION 104 6 FUTURE WORK 106 REFERENCE 107

[1] T. Rohit, “AI Supply Chain Contract Management,” Grid Dynamics. Accessed: Oct. 14, 2024. [Online]. Available: https://www.griddynamics.com/blog/ai-supply-chain-contract-management
[2] Trishita Deb, “Contract Manufacturing Market Size, Share | CAGR of 9.3%,” MarketResearch.biz. Accessed: Oct. 14, 2024. [Online]. Available: https://marketresearch.biz/report/contract-manufacturing-market/
[3] Enas Mohamed Ali Quteishat, “Exploring the Role of AI in Modern Legal Practice: Opportunities, Challenges, and Ethical Implications,” J. Electr. Syst., vol. 20, no. 6s, pp. 3040–3050, Apr. 2024, doi: 10.52783/jes.3320.
[4] T. Ko, H. D. Jeong, and G. Lee, “Natural Language Processing–Driven Model to Extract Contract Change Reasons and Altered Work Items for Advanced Retrieval of Change Orders,” J. Constr. Eng. Manag., vol. 147, no. 11, p. 04021147, Nov. 2021, doi: 10.1061/(ASCE)CO.1943-7862.0002172.
[5] M. Fathima, D. P. Dhinakaran, T. Thirumalaikumari, S. R. Devi, Bindu. M.R, and S. P, “Effectual Contract Management and Analysis with AI-Powered Technology: Reducing Errors and Saving Time in Legal Document,” in 2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM), Apr. 2024, pp. 1–6. doi: 10.1109/ICONSTEM60960.2024.10568733.
[6] C. Li, J. Ge, K. Cheng, B. Luo, and V. Chang, “Statute recommendation: Re-ranking statutes by modeling case-statute relation with interpretable hand-crafted features,” Inf. Sci., vol. 607, pp. 1023–1040, Aug. 2022, doi: 10.1016/j.ins.2022.06.042.
[7] IBM, “What Is a Machine Learning Pipeline? | IBM.” Accessed: Oct. 15, 2024. [Online]. Available: https://www.ibm.com/topics/machine-learning-pipeline
[8] I. Novogroder, “Data Preprocessing in Machine Learning: Steps & Best Practices,” Git for Data - lakeFS. Accessed: Oct. 14, 2024. [Online]. Available: https://lakefs.io/blog/data-preprocessing-in-machine-learning/
[9] G. Press, “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says,” Forbes. Accessed: Oct. 14, 2024. [Online]. Available: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
[10] D. Reinsel, J. Gantz, and J. Rydning, “The Digitization of the World from Edge to Core,” 2018.
[11] A. W. S. Contributor, “Amazon Web Services BrandVoice: Machine Learning Can Set Your Document Data Free - Here’s How,” Forbes. Accessed: Oct. 14, 2024. [Online]. Available: https://www.forbes.com/sites/amazonwebservices/2021/07/13/machine-learning-can-set-your-document-data-freeheres-how/
[12] Arun Venkataswamy, “Extracting Data from PDFs | Challenges in RAG/LLM Applications.” Accessed: Oct. 14, 2024. [Online]. Available: https://unstract.com/blog/pdf-hell-and-practical-rag-applications/
[13] A. Verma, “The Evolution of Natural Language Processing (NLP): A Journey from 1950 to Today,” Medium. Accessed: Jul. 01, 2024. [Online]. Available: https://ai.plainenglish.io/the-evolution-of-natural-language-processing-nlp-a-journey-from-1950-to-today-98d1ef4d12f7
[14] P. R, “The 7 Stages of Natural Language Processing,” Generative Labs. Accessed: Jun. 18, 2024. [Online]. Available: https://www.generativelabs.co/post/the-7-stages-of-natural-language-processing
[15] Amazon Web Services, “What is RAG? - Retrieval-Augmented Generation AI Explained - AWS,” Amazon Web Services, Inc. Accessed: Aug. 02, 2024. [Online]. Available: https://aws.amazon.com/what-is/retrieval-augmented-generation/
[16] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Apr. 12, 2021, arXiv: arXiv:2005.11401. doi: 10.48550/arXiv.2005.11401.
[17] Wikipedia, “Optical character recognition,” Wikipedia. Jul. 31, 2024. Accessed: Aug. 02, 2024. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&oldid=1237799337
[18] Google, “Vision AI: Image & Visual AI Tools,” Google Cloud. Accessed: Aug. 02, 2024. [Online]. Available: https://cloud.google.com/vision
[19] Amazon Web Services, “OCR Software, Data Extraction Tool - Amazon Textract - AWS,” Amazon Web Services, Inc. Accessed: Aug. 02, 2024. [Online]. Available: https://aws.amazon.com/textract/
[20] Biswas Anjan, Edouard Belval, and Lalita Reddi, “Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks | AWS Machine Learning Blog.” Accessed: Jul. 02, 2024. [Online]. Available: https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/
[21] W. Stefan, tesseract-ocr/tesseract. (Aug. 01, 2024). C++. tesseract-ocr. Accessed: Aug. 02, 2024. [Online]. Available: https://github.com/tesseract-ocr/tesseract
[22] Jaided AI, JaidedAI/EasyOCR. (Aug. 01, 2024). Python. Jaided AI. Accessed: Aug. 02, 2024. [Online]. Available: https://github.com/JaidedAI/EasyOCR
[23] K. Taha, P. D. Yoo, C. Yeun, and A. Taha, “Text Classification: A Review, Empirical, and Experimental Evaluation,” Jan. 2024.
[24] M. Forster, C. Schulz, P. Nokku, M. Mirsafian, J. Kasundra, and S. Skylaki, “The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines,” Jan. 22, 2024, arXiv: arXiv:2401.11852. doi: 10.48550/arXiv.2401.11852.
[25] S. İ. Omurca, E. Ekinci, S. Sevim, E. B. Edinç, S. Eken, and A. Sayar, “A document image classification system fusing deep and machine learning models,” Appl. Intell., vol. 53, no. 12, pp. 15295–15310, Jun. 2023, doi: 10.1007/s10489-022-04306-5.
[26] G. Hopkins and K. Kalm, “Classifying complex documents: comparing bespoke solutions to large language models,” Dec. 12, 2023, arXiv: arXiv:2312.07182. doi: 10.48550/arXiv.2312.07182.
[27] S. Dadas, M. Kozłowski, R. Poświata, M. Perełkiewicz, M. Białas, and M. Grębowiec, “A support system for the detection of abusive clauses in B2C contracts,” Artif. Intell. Law, Jun. 2024, doi: 10.1007/s10506-024-09408-8.
[28] A. Baidya, “Document Analysis and Classification: A Robotic Process Automation (RPA) and Machine Learning Approach,” 2021 4th Int. Conf. Inf. Comput. Technol. ICICT, pp. 33–37, Mar. 2021, doi: 10.1109/ICICT52872.2021.00013.
[29] M. Lippi et al., “The Force Awakens: Artificial Intelligence for Consumer Law,” J. Artif. Intell. Res., vol. 67, pp. 169–190, Jan. 2020, doi: 10.1613/jair.1.11519.
[30] D. Jain, M. D. Borah, and A. Biswas, “Summarization of legal documents: Where are we now and the way forward,” Comput. Sci. Rev., vol. 40, p. 100388, May 2021, doi: 10.1016/j.cosrev.2021.100388.
[31] P. Bhattacharya, S. Poddar, K. Rudra, K. Ghosh, and S. Ghosh, “Incorporating domain knowledge for extractive summarization of legal case documents,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, in ICAIL ’21. New York, NY, USA: Association for Computing Machinery, Jul. 2021, pp. 22–31. doi: 10.1145/3462757.3466092.
[32] A. J. Ehidiamen, O. O. Oladapo, A. J. Ehidiamen, and O. O. Oladapo, “Optimizing contract negotiations in clinical research: Legal strategies for safeguarding sponsors, vendors, and institutions in complex trial environments,” World J. Biol. Pharm. Health Sci., vol. 20, no. 1, Art. no. 1, 2024, doi: 10.30574/wjbphs.2024.20.1.0790.
[33] A. H. Abd Jamil and M. S. Fathi, “Contractual challenges for BIM-based construction projects: a systematic review,” Built Environ. Proj. Asset Manag., vol. 8, no. 4, pp. 372–385, Jan. 2018, doi: 10.1108/BEPAM-12-2017-0131.
[34] S.-H. Park, D.-G. Lee, J.-S. Park, and J.-W. Kim, “A Survey of Research on Data Analytics-Based Legal Tech,” Sustainability, vol. 13, no. 14, p. 8085, Jul. 2021, doi: 10.3390/su13148085.
[35] Wikipedia, “Hyperparameter optimization,” Wikipedia. Oct. 09, 2024. Accessed: Oct. 17, 2024. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Hyperparameter_optimization&oldid=1250233443
[36] K. Nyuytiymbiy, “Parameters and Hyperparameters in Machine Learning and Deep Learning,” Medium. Accessed: Oct. 16, 2024. [Online]. Available: https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac
[37] A. R. Kapil, “Hyperparameters & Parameters : A Comprehensive Learning Guide,” Blogs & Updates on Data Science, Business Analytics, AI Machine Learning. Accessed: Oct. 17, 2024. [Online]. Available: https://www.analytixlabs.co.in/blog/what-are-hyperparameters/
[38] Optuna, “Optuna - A hyperparameter optimization framework,” Optuna. Accessed: Oct. 17, 2024. [Online]. Available: https://optuna.org/
[39] “RandomizedSearchCV,” scikit-learn. Accessed: Oct. 17, 2024. [Online]. Available: https://scikit-learn/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
[40] J. Bergstra, D. Yamins, and D. D. Cox, “Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures”.
[41] Scikit-Learn, “GridSearchCV,” scikit-learn. Accessed: Oct. 17, 2024. [Online]. Available: https://scikit-learn/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
[42] H. Lars, S. Peter, and Julian, “Asynchronous Successive Halving (ASHA) — SHERPA documentation.” Accessed: Oct. 17, 2024. [Online]. Available: https://parameter-sherpa.readthedocs.io/en/latest/algorithms/keras_mnist_mlp_successive_halving.html
[43] F.-A. Fortin, M. D. R. Francois, A. G. Marc, P. Marc, and G. Christian, “DEAP: Evolutionary Algorithms Made Easy”.
[44] Pytorch, “Hyperparameter tuning with Ray Tune — PyTorch Tutorials 2.5.0+cu124 documentation.” Accessed: Oct. 17, 2024. [Online]. Available: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
[45] Amazon Web Services, “Configuring Amazon Textract for Asynchronous Operations - Amazon Textract.” Accessed: Oct. 03, 2024. [Online]. Available: https://docs.aws.amazon.com/textract/latest/dg/api-async-roles.html#api-async-roles-all-topics
[46] Amazon Web Services, “Detecting or Analyzing Text in a Multipage Document - Amazon Textract.” Accessed: Jun. 18, 2024. [Online]. Available: https://docs.aws.amazon.com/textract/latest/dg/async-analyzing-with-sqs.html
[47] Amazon Web Services, “Item Location on a Document Page - Amazon Textract.” Accessed: Jun. 20, 2024. [Online]. Available: https://docs.aws.amazon.com/textract/latest/dg/text-location.html
[48] B. Guo et al., “How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection,” Jan. 18, 2023, arXiv: arXiv:2301.07597. doi: 10.48550/arXiv.2301.07597.
[49] deepset, “German Word Embeddings | deepset.” Accessed: Oct. 05, 2024. [Online]. Available: https://www.deepset.ai/german-word-embeddings
[50] R. Milos, P. Malte, G. Anna, M. Sean, L. Mathis, and W. Jay, “German BERT | State of the Art Language Model for German NLP.” Accessed: Jul. 01, 2024. [Online]. Available: https://www.deepset.ai/german-bert
[51] google-bert, “google-bert/bert-base-german-cased · Hugging Face.” Accessed: Oct. 05, 2024. [Online]. Available: https://huggingface.co/google-bert/bert-base-german-cased
[52] S. Ghosal and A. Jain, “Depression and Suicide Risk Detection on Social Media using fastText Embedding and XGBoost Classifier,” Procedia Comput. Sci., vol. 218, pp. 1631–1639, Jan. 2023, doi: 10.1016/j.procs.2023.01.141.

QR CODE