Syntactic analysis with Natural Language Processing using large language models

This document presents an approach for automatic constituency parsing of Spanish sentences based on fine-tuning Large Language Models like Bloom or GPT2 using the seq2seq approach. Furthermore, it aims to ensure the widespread accessibility of this system. To achieve this, we use the AmazonWeb Services platform for hosting and distribution. The successful completion of this project will benefit MiSintaxis [18] application, thus providing quality education to its thousands of users worldwide.
In this project, we initially delve into the history of Spanish grammar studies, exploring its components and the methodologies employed in teaching it at the elementary and secondary education levels. This analysis serves as a foundational understanding, informing the subsequent stages of our research and development.
Subsequently, we present a review of the state-of-the-art developments in Natural Language Modeling and Parsing. We traverse the history of Neural Networks and their application inthe realm of Natural Language Modeling, discussing the evolution of various architectures that laid the foundation for the advent of Transformers. We meticulously explore the intricacies ofTransformer architecture, focusing on the critical elements that propelled the success of LargeLanguage Models. In addition, we introduce the Hugging Face ecosystem, a notable platform that fosters the accessibility and usability of these advanced models. We also shed light ontraditional parsing algorithms, delineating their role and significance in the broader context of
language parsing.
Using an automatic process, we converted the Spanish AnCora corpus using our grammar notation based on the recommendations of the Nueva gramática BÁSICA de la lengua española[15]. This process resulted in a Spanish corpus comprising 500,000 words spread across 17,300 sentences, thus encompassing the entirety of AnCora.
We fine tuned Hugging Face models bloom-560m, bloom-1b1, gpt2-base-bne and gpt2-larg-bnewith this customized corpus and compared them using the F1 metric over the test dataset from the Ancora corpus, obtaining the following scores: 0.8141 for gpt2-larg-bne, 0.7939 for bloom-560m, 0.7790 for bloom-1b1, 0.7234 for gpt2-base-bne. With a simplified test dataset that we called the Argos dataset, we obtain the following F1 scores: 0.9123 for bloom-1b1, 0.8642 for bloom-560m, 0.8321 for gpt2-larg-bne, 0.8190 for gpt2-base-bne.
Finally, we present Amazon Web Services and how to deploy a large language model for a real use scenario.

Licence: Creative Commons Attribution Non Commercial No Derivatives 4.0 International

Keywords: Redes neuronales


Activity log