Supervised Fine-Tuning: Tailoring LLMs for Specific Tasks
Written on
In the fast-paced domain of Natural Language Processing (NLP), fine-tuning has become a vital method for tailoring pre-trained Large Language Models (LLMs) to specialized downstream applications. These large-scale models, such as those in the GPT series, have achieved remarkable progress in understanding and generating language. However, their initial training typically relies on a vast corpus of text data through unsupervised learning, which may not align perfectly with specific tasks.
Fine-tuning addresses this limitation by leveraging the general language knowledge acquired during pre-training and refining it for a particular application through supervised learning. By adapting a pre-trained model on a dataset that is specific to the task at hand, NLP professionals can yield impressive outcomes using considerably less training data and computational resources compared to developing a model from scratch. This is particularly important for Large Language Models, as retraining on the entire dataset can be computationally intensive.
The effectiveness of fine-tuning has resulted in numerous state-of-the-art achievements across various NLP tasks, establishing it as a standard practice in creating high-performing language models. Researchers and practitioners are continuously investigating different variations and improvements of fine-tuning methods to expand the capabilities of NLP further.
This article will provide an in-depth look at fine-tuning an Instruction-based Large Language Model using the transformers library in two distinct approaches: utilizing the basic transformers library and the trl module.
Supervised Fine-Tuning (SFT)
Supervised fine-tuning refers to the process of customizing a pre-trained Language Model (LLM) for a specific downstream task through the use of labeled data. In this method, the fine-tuning dataset consists of responses that have been previously validated, contrasting with unsupervised techniques, which do not require prior validation. While LLM training is generally unsupervised, fine-tuning is predominantly supervised.
During the supervised fine-tuning phase, the pre-trained LLM is adapted to this labeled dataset using supervised learning techniques. The model's weights are modified based on gradients derived from the task-specific loss, which assesses the discrepancy between the model's predictions and the true labels.
This supervised fine-tuning process enables the model to recognize task-specific patterns and intricacies found in the labeled data. By adjusting its parameters to fit the specific data distribution and task demands, the model enhances its performance on the target task.
For instance, consider a pre-trained LLM responding to the input I can't log into my account. What should I do? with the reply Try to reset your password using the "Forgot Password" option.
Now, imagine developing a chatbot for Customer Support. While the response above may be correct, it lacks adequacy for a Customer Support context, which necessitates more empathy, a different presentation, additional contact information, or other specific guidelines. This is where Supervised Fine-Tuning becomes essential.
By providing a series of validated Training Examples, your model can learn to respond more effectively to prompts and inquiries. In the illustration below, we demonstrate how we taught the model some Customer Support empathy statements.
There are several reasons to consider fine-tuning LLMs:
- Achieve enhanced responses that align with your business guidelines, as previously mentioned.
- Incorporate new specific/private data that was not publicly available during the initial training phase, ensuring the LLM aligns with your unique knowledge base.
- Train the LLM to address new (unseen) prompts.
Supervised Fine-Tuning (SFT) Using Transformers Library
Hugging Face’s transformers library has become the primary tool for training and fine-tuning models, including LLMs. Fine-tuning has always been a core feature, nearly seamlessly integrated within the Trainer facilitator class.
Recently, the introduction of the trl module for Reinforcement Learning has led to the creation of a new class, SFTTrainer, specifically designed for Supervised Fine-Tuning on LLMs. Let’s explore the differences between the two.
Fine-Tuning Using the Trainer Class
The Trainer class simplifies the pre-training and fine-tuning processes for models, including LLMs. It requires several key arguments:
- A model, loaded with AutoModelWithLMHead.from_pretrained;
- TrainingArgs;
- A training dataset and an evaluation dataset;
- A data collator that applies various transformations to the datasets. Padding (to create batches of uniform length) is one such transformation, though this can also be handled by the tokenizer. In the context of LLMs, another necessary function is the masking of random tokens for next-token prediction.
The eval_dataset and train_dataset are instances of the Dataset class. Datasets can be constructed from various formats. For this example, let’s assume I have the datasets saved in two text files at <TRAINING_DATASET_PATH> and <TEST_DATASET_PATH>. To obtain the datasets for both splits and the collator, you would do:
If you’re looking to train an Instruction-based LLM, consider these datasets:
- GPT-4all Dataset: GPT-4all (Pairs, English, 400k entries) — A mix of subsets from OIG, P3, and Stackoverflow, covering general Q&A and tailored creative inquiries.
- RedPajama-Data-1T: RedPajama (PT, Primarily English, 1.2T tokens, 5TB) — An entirely open pre-training dataset adhering to LLaMA’s methodology.
- OASST1: OpenAssistant (Pairs, Dialog, Multilingual, 66,497 conversation trees) — A significant, human-written, and annotated dialogue dataset aimed at enhancing LLM responses.
- databricks-dolly-15k: Dolly2.0 (Pairs, English, 15K+ entries) — A collection of human-crafted prompts and answers, encompassing tasks like Q&A and summarization.
- AlpacaDataCleaned: Models similar to Alpaca/LLaMA (Pairs, English) — A refined version of Alpaca, GPT_LLM, and GPTeacher.
- GPT-4-LLM Dataset: Models akin to Alpaca (Pairs, RLHF, English, Chinese, 52K entries for English and Chinese, 9K entries unnatural-instruction) — A dataset generated by GPT-4 and other LLMs for improved pairs and RLHF, featuring instruction and comparison data.
- GPTeacher: (Pairs, English, 20k entries) — A dataset containing targets produced by GPT-4, including seed tasks from Alpaca and new roleplay tasks.
- Alpaca data: Alpaca, ChatGLM-fine-tune-LoRA, Koala (Dialog, Pairs, English, 52K entries, 21.4MB) — A dataset generated by text-davinci-003 to enhance LLMs' capacity to follow human instructions.
Fine-Tuning Using trl SFTTrainer Class
As previously mentioned, the SFTTrainer class is found within Hugging Face’s trl library, which is dedicated to Reinforcement Learning. Given that Supervised Fine-Tuning is the initial stage in Reinforcement Learning by Human Feedback (RLHF), developers recognized the need to isolate this process into its own class while adding functions that simplify tasks which would otherwise need to be performed manually using the Trainer library. Let’s examine its structure.
You might not observe many differences in the above example. Indeed, you are correct; the SFTTrainer class inherits from the Trainer function if you examine the source code. Essentially, the same model, train_dataset, evaluation dataset, and collator are required.
However, the SFTTrainer class introduces several features that facilitate training when working with LLMs:
- Support for `peft`: The SFTTrainer includes support for the Parameter Efficient Fine-tuning library, which encompasses methods like Lora, QLora, and others. Lora permits the integration of Adapters with weights, which are the only parameters being fine-tuned, while keeping the rest static. QLora serves as the quantized variant. Both methods significantly enhance fine-tuning efficiency, which is particularly relevant given the high computational costs associated with LLM fine-tuning.
- Batch `packing`: Rather than relying on the Tokenizer to pad your sentences to the model's maximum supported length, packing enables you to concatenate inputs sequentially, thereby improving batch capabilities.
Conclusions
Fine-tuning has traditionally been a strategy to optimize the performance of Transformer architectures. Whether you need to train a Language Model for next-token prediction or masking, or if you aim to train a sequence or token classifier, fine-tuning is carried out in a supervised manner, necessitating data labeled by humans.
For Large Language Models, which are typically trained on extensive, openly available datasets, customizing their responses to meet your specific needs is often essential.
There are numerous methods to fine-tune a Large Language Model. One of the most straightforward options is the Trainer class from the transformers library, which has been employed for quite some time to fine-tune various transformer-based models.
Recently, Hugging Face introduced the trl library to facilitate Reinforcement Learning by Human Feedback training. A key step in this training process is Supervised Fine-Tuning, for which they have made available the SFTTrainer class, which not only streamlines the procedure but also includes parameter-efficient (peft) and packing optimizations.
Want to Know More?
If you'd like assistance in fine-tuning your Language Model, reach out to us at [email protected].