Language models have quickly become cornerstones of many business applications in recent years. Their usefulness has been proven by many people who interact with them daily. As language models continue to find their place in people’s lives, the community has made many breakthroughs to improve models’ capabilities, primarily through fine-tuning.
Language model fine-tuning is a process of adapting the pre-trained language model to specific downstream tasks after training it on a relevant dataset. The process leverages the base model knowledge and incorporates the new dataset insight to customize the model for more focused applications.
There are several different methodologies for fine-tuning language models. In this article, we will explore three easy ways to do that.
Let’s get into it!
Full Fine-Tuning
Full fine-tuning is a technique for adapting pre-trained models by updating all the weights or parameters. It optimizes the pre-trained model fully for specific downstream tasks such as sentiment analysis, question answering, translation, and more.
As all the parameters within the model are updated, the model can fully adapt to perform the specific tasks and achieve SOTA performance. However, the process will require much more computational power, especially with a large language model. Moreover, catastrophic forgetting, which is an event where a model forgets pre-trained knowledge while learning a new task, could occur.
Nevertheless, it is still an important method to learn. Let’s start by trying full fine-tuning by installing all the essential packages. You can install it using the following code.
pip install transformers datasets peft |
We will also use PyTorch in our work, so select and install the version that is most appropriate for the system.
We will fine-tune the language model for the sentiment analysis task using the IMDB sample dataset for this example. It’s a dataset containing IMDB review with negative (0) or positive (1) labels.
from datasets import load_dataset dataset = load_dataset(“imdb”) |
We will not use the full dataset as it takes too long to fine-tune. Instead, we will use a small subset for training and test data.
train_subset = dataset[“train”].shuffle(seed=42).select(range(500)) test_subset = dataset[“test”].shuffle(seed=42).select(range(100)) |
Next, we will prepare the pre-trained language model and tokenizer. For our example, we will use the standard BERT model.
model_name = “bert-base-uncased”
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
def tokenize_function(examples):
return tokenizer(examples[“text”], padding=”max_length”, truncation=True)
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments model_name = “bert-base-uncased” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) def tokenize_function(examples): return tokenizer(examples[“text”], padding=“max_length”, truncation=True) |
We then tokenize our dataset using the tokenizer function we have prepared previously.
tokenized_train = train_subset.map(tokenize_function, batched=True) tokenized_test = test_subset.map(tokenize_function, batched=True) |
Next, we will prepare training arguments to direct the training process. For our example, we will use the simplest process with one epoch, as we want to see the results of a quick training process.
training_args = TrainingArguments( output_dir=“./results”, eval_strategy=“epoch”, learning_rate=2e–5, per_device_train_batch_size=8, num_train_epochs=1, weight_decay=0.01, ) |
Once everything is ready, we will set up the training object and start the full fine-tuning process.
trainer.train()
trainer.evaluate()
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_test, ) trainer.train() trainer.evaluate() |
Output:
{‘eval_loss’: 0.6262330412864685, ‘eval_runtime’: 1.4327, ‘eval_samples_per_second’: 69.798, ‘eval_steps_per_second’: 9.074, ‘epoch’: 1.0} |
As we can see, the full fine-tuning process produced an adequate model with the dataset we provided. The fine-tuning process was fast and did not take much memory. However, as you might be able to guess, the process can take much longer using a bigger dataset.
And this is why we now turn our attention to the following technique, PEFT.
Parameter-Efficient Fine-Tuning (PEFT)
Parameter-efficient fine-tuning (PEFT) is a language model fine-tuning technique specifically designed to update only a small portion of the model’s parameters instead of all of the parameters. It alleviates the computational problem and catastrophic forget problem that full fine-tuning has.
PEFT is a perfect technique for working with LLMs when resources restrain us. The base model trained via PEFT will be versatile enough to be reused across multiple tasks by switching out task-specific components.
The most famous technique within PEFT is LoRA (Low-Rank Adaptation). It’s a method for adapting a pre-trained model by injecting low-rank matrices into the model’s layer to modify certain parts’ behavior while keeping the original parameters frozen. This technique is valuable and has been proven to alter the pre-trained model.
Let’s try PEFT with a code example.
First, we will use the same dataset as the previous example. However, we will use the essential peft library in the code below.
model_name = “bert-base-uncased”
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
from peft import get_peft_model, LoraConfig, PeftType from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments model_name = “bert-base-uncased” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) |
To train the PEFT model, we will set the LoRA configuration while downloading the PEFT pre-trained model, which we can modify. You can try playing out with the LoRA parameters to see how good the model output is.
peft_config = LoraConfig( peft_type=PeftType.LORA, task_type=“SEQ_CLS”, r=8, lora_alpha=32, lora_dropout=0.1, ) peft_model = get_peft_model(model, peft_config) |
Next, we will tokenize the dataset and set up the model training arguments.
tokenized_train = train_subset.map(tokenize_function, batched=True)
tokenized_test = test_subset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir=”./peft_results”,
eval_strategy=”epoch”,
learning_rate=1e-4,
per_device_train_batch_size=8,
num_train_epochs=1,
)
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def tokenize_function(examples): return tokenizer(examples[“text”], padding=“max_length”, truncation=True) tokenized_train = train_subset.map(tokenize_function, batched=True) tokenized_test = test_subset.map(tokenize_function, batched=True) training_args = TrainingArguments( output_dir=“./peft_results”, eval_strategy=“epoch”, learning_rate=1e–4, per_device_train_batch_size=8, num_train_epochs=1, ) |
Lastly, we will fine-tune the model using PEFT with the code below.
trainer.train()
trainer.evaluate()
trainer = Trainer( model=peft_model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_test, ) trainer.train() trainer.evaluate() |
Output:
{‘eval_loss’: 0.6886218190193176, ‘eval_runtime’: 1.5295, ‘eval_samples_per_second’: 65.382, ‘eval_steps_per_second’: 8.5, ‘epoch’: 1.0} |
There are few different results yet, as we have only used the data subset with one epoch. You will see increasingly different output evaluations if you vary the parameters.
Instruction Tuning
Instruction tuning is a fine-tuning technique for the pre-trained model to follow natural language directions for various tasks. Unlike the previous fine-tuning processes we have discussed thus far, instruction tuning usually does not focus on specific tasks; instead, it uses a dataset that includes diverse tasks that were formatted as instructions with the expected output.
The intention behind instruction tuning is that the model can interpret and execute these instructions by becoming more capable of generalizing to unseen tasks. The performance is very dependent on the quality of the instruction dataset, but it’s an ideal approach if we want a more general-purpose model, which may initially seem incongruent with the concept of fine-tuning.
Let’s try out the instruction tuning with code. First, we will prepare the sample data. As developing an instruction dataset can take some time, we will create a few toy examples instead.
data = {
“instruction”: [
“Summarize the following text in one sentence.”,
“Answer the question based on the text.”,
],
“input”: [
“The rain in Spain stays mainly in the plain.”,
“Who is the president of the United States who won the 2024 election?”,
],
“output”: [
“Rain in Spain falls in the plain.”,
“Donald Trump.”,
],
}
dataset = Dataset.from_dict(data)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments from datasets import Dataset data = { “instruction”: [ “Summarize the following text in one sentence.”, “Answer the question based on the text.”, ], “input”: [ “The rain in Spain stays mainly in the plain.”, “Who is the president of the United States who won the 2024 election?”, ], “output”: [ “Rain in Spain falls in the plain.”, “Donald Trump.”, ], } dataset = Dataset.from_dict(data) |
For the next part, we will need the train and test dataset. As we only have two pieces of data, I will use the first one for training and the second one as a test.
train_dataset = dataset.select(range(1)) eval_dataset = dataset.select(range(1, 2)) |
Next, we will prepare the pre-trained model we want to fine-tune. In this example, let’s use the Flan T5 family model.
model_name = “t5-small” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
Then, we will tokenize the dataset. For the instruction tuning, we will add the input into different forms that combine instruction and input columns.
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_eval = eval_dataset.map(preprocess_function, batched=True)
1 2 3 4 5 6 7 8 9 10 11 12 |
def preprocess_function(examples): inputs = [ f“Instruction: {inst}nInput: {inp}” for inst, inp in zip(examples[“instruction”], examples[“input”]) ] labels = examples[“output”] model_inputs = tokenizer(inputs, padding=“max_length”, truncation=True) labels = tokenizer(labels, padding=“max_length”, truncation=True)[“input_ids”] model_inputs[“labels”] = labels return model_inputs tokenized_train = train_dataset.map(preprocess_function, batched=True) tokenized_eval = eval_dataset.map(preprocess_function, batched=True) |
Once everything is ready, we will instruction tuning our pre-trained model.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_eval,
)
trainer.train()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
training_args = TrainingArguments( output_dir=“./instruction_result”, eval_strategy=“epoch”, learning_rate=5e–5, per_device_train_batch_size=8, num_train_epochs=1, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_eval, ) trainer.train() |
Output:
TrainOutput(global_step=1, training_loss=19.483064651489258, metrics={‘train_runtime’: 2.0692, ‘train_samples_per_second’: 0.483, ‘train_steps_per_second’: 0.483, ‘total_flos’: 135341801472.0, ‘train_loss’: 19.483064651489258, ‘epoch’: 1.0}) |
The evaluation process will require a more extensive dataset, but for now, we have succeeded in performing the instruction tuning process on our simple examples.
Conclusion
In this article, we have explored three easy ways to fine-tune language models, including full fine-tuning, parameter-efficient fine-tuning, and instruction tuning.
Chances are that language models will continue to get larger in the years to come. By fine-tuning these large foundational language models, their usefulness is increased on the resulting fine-tuned models become much more versatile.
I hope this has helped!
This post was originally published on here