Fine-tuning and evaluation of RAG models
Introduction
This article deals with the fine-tuning and evaluation of Retrieval Augmented Generation (RAG) models. First, a brief review of the basics of RAG is given. It then describes the targeted system and explains how the RAG model can be tailored to specific needs. In addition, methods and metrics for evaluating the models are presented. The article concludes with the results and conclusions of this evaluation.
The fine-tuning of RAG models is a crucial step in AI development, as it enables the optimization and adaptation of models for specific application areas. By using advanced machine learning and natural language processing (NLP) techniques, the models can be improved and their model complexity can be handled more efficiently. During fine-tuning, various experiments are conducted to identify the best training data and approaches that contribute to error analysis and interpretation of the results.
Another important topic is quality assurance and the analysis of the generated answers. This includes text analysis and text generation, which are used to evaluate the models and improve their performance in data processing and data analysis. Careful selection and processing of the training data can increase the performance of the AI models. Neural networks play a central role here and contribute to the continuous improvement of AI technology.
What is RAG?
Retrieval-Augmented Generation (RAG) is an innovative method for creating question-answer systems. These models combine the strengths of retrieval and generation approaches to generate more accurate and comprehensive answers. These approaches enable RAG models to provide both fast and accurate answers by utilizing the best of both worlds: the precision of extraction and the creativity of generation. The following graphic illustrates the different approaches of the QA systems and shows how RAG combines the advantages of extraction and generation.
RAG models use internal data sources such as documents, PDFs, texts and Confluence pages. This data often includes the German language, numerous images and technical jargon. The following graphic illustrates the different approaches of the QA systems and shows how RAG combines the advantages of extraction and generation. These models play a crucial role in the field of AI development and text generation.
Orientation of the RAG model
Aligning the RAG model to specific needs is an essential step to maximize performance in different domains. A closed generative QA system, which relies exclusively on a generator, is well suited for very general and frequent use cases, as it only uses internal knowledge. In contrast, the open generative QA model (RAG) is particularly suitable for complex domains as it incorporates both internal and external knowledge sources. The challenge is to assess whether RAG is suitable for a specific domain and whether a domain-specific optimized RAG model can be created. In theory, such a model can solve almost all problems, but its implementation and use is extremely complex and demanding.
There are two main approaches for the alignment of the RAG model: parametric and non-parametric methods. Parametric approaches involve fine-tuning the generator and the retriever to improve the accuracy and relevance of the generated responses. Non-parametric approaches such as prompt engineering and the development of a complex RAG pipeline aim to improve the performance of the model by making specific adjustments to the prompts and processing. These alignment measures allow the RAG model to better address the specific requirements and challenges of the domain in question and thus provide more precise and relevant answers.
Data set generation and selection of the retriever
A suitable data set is essential for fine-tuning and evaluating the RAG model. This is created using a multimodal model such as GPT-4, where a few-shot prompt is used to generate the dataset. The dataset is then split into training, validation and testing data, with 70% of the data used for training, 15% for validation and 15% for testing. The structure of the data set includes the page ID, the question and the corresponding answer.
Selecting the right retriever is another important step. For this, the Massive Text Embedding Benchmark (MTEB) can be used to select models that are optimized for the German language. Examples of such models are the open-source model “Multilingual-e5-large” and the closed-source model “Cohere-embed-multilingual-v3.0”. The retriever is fine-tuned using Multiple Negative Ranking (MNR) loss, whereby the distance between a question and the relevant page is minimized and the distance to randomly selected, non-relevant pages is maximized.
There are challenges in aligning the generator, such as the enormous size of the models and the fact that most models are English-based. Solutions include the use of 4-bit quantization and the fine-tuning of a 7B model only. German-adapted benchmarks and models from the LAION team can be used for this. Two options are available for fine-tuning the generator: full language modeling and prefix language modeling. By carefully creating and adapting the dataset and selecting and fine-tuning the retriever and generator, the RAG model is optimally adapted to the specific requirements and can provide more precise and relevant answers.
Methods and metrics for the fine-tuning and evaluation of RAG models
The RAG models are evaluated using various metrics for the retriever and the generator as well as end-to-end metrics. The most important metrics for the retriever include the DocHitRate@k and the HitRate@k, which measure the accuracy of the retrieved documents. The generator is evaluated based on metrics such as Faithfulness, which checks how well the generated response matches the facts, and Average Word Count. End-to-end metrics include the correctness of the answers and the Rouge-N-Score, which measures the overlap of the N-grams between the generated answer and the actual answer.
The results show that fine-tuning the retriever significantly increases performance and fewer documents need to be retrieved, which reduces costs. Fine-tuning the generator does not always lead to better results, but can contribute to more precise and shorter answers. Fine-tuning the prefix language model usually outperforms full language modeling in domain-specific RAG pipelines.
Conclusion
A domain-specific optimized RAG model outperforms conventional RAG models, but requires considerable implementation effort. Fine-tuning the retriever proves to be particularly advantageous, while fine-tuning the generator is situation-dependent and can be particularly useful for large input requests. The evaluation systems are essential for performance assessment and are currently more of an art than a science. Overall, it is clear that the customization and careful evaluation of RAG models is critical to their effectiveness in specific application domains.
More exciting articles on this topic:
The traceability of AI in language understanding: focus on transparency and quality
Retrieval Augmented Fine-Tuning (RAFT): How language models become smarter with new knowledge
Knowledge Graphs and Retrieval-Augmented Generation: A Guide to Improving RAG Systems
External links: