Fine Tuning a Transformer Model
Suppose you have some training data you’d like to use to update a language model for a specific task: in this case, I’d like to “teach” the model that two passages of text are semantically similar, to help me with similarity search.
We’ll use the (wonderful) sentence-transformers library, and I’ll assume you have some training data of the form
data = [['sentence 0', 'very similar to sentence 0'],
['sentence 1', 'very similar to sentence 1'],
...
]
Import necessary libraries:
from numpy.random import permutation, random
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample
from sentence_transformers import evaluation, losses, util
from torch.utils.data import DataLoader
Our first step will be to process the training data to get it ready for the model. Specifically, we need to assigne a “score” for how similar our sentence pairs are. If you have some human-generated score, this is preferable – if not, we can randomly assign one as shown below. Here I model the scores as a Gaussian distribution about 0.9, with a variance of 0.05.
mean = 0.9
scaling = 20.0
scores = mean + (random(len(data)) - 0.5) / scaling
In the absence of better scoring, this is a reasonable substitute. Reducing the mean or widening the variance will give the model less confidence in the similarity between the language, and may avoid overfitting. Now, let’s build data structures suitable for the trainer, and split the data set into training and evaluation. I first permute the data so that if I do this again, I have a different split of training / evaluation.
fract_train = 0.8
pairs = [InputExample(texts=d, label=scores[ix]) for ix, d in enumerate(data)]
perm = permutation(len(pairs))
num_train_pairs = int(len(pairs) * fract_train)
num_eval_pairs = int(len(pairs) * (1 - fract_train))
train_pairs = [pairs[perm[ix]] for ix in range(num_train_pairs)]
eval_pairs = [pairs[perm[ix]] for ix in range(num_train_pairs, num_train_pairs + num_eval_pairs)]
Now we’re ready to set up our data loader:
train_batch_size = 32
train_dataset = SentencesDataset(train_pairs, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
Now, we define our model and corresponding loss function.
model = SentenceTransformer('all-MiniLM-L6-v2',
cache_folder='./')
train_loss = losses.MultipleNegativesRankingLoss(model=model,
similarity_fct=util.dot_score,
scale=1)
We’re ready to fit the model:
warmup_steps = 100
num_epochs = 8
model.fit(train_objectives=[(train_dataloader, train_loss)],
epochs=num_epochs,
warmup_steps=warmup_steps,
output_path='./new-model/')