πŸ›»Training with QLoRA

Fine-tuning models on consumer hardware

Thanks for Benjamin Marie's article, all the contents are from his article.

Overview

I am interested in almost everything about AI with consumer hardware. So, let's look at this article.

It introduces the QLoRa technique,

Here are some data from the article. If we want to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 800 Gb GPUs.

And now, with QLoRa, we can do this with only one A100. And the computing resource in that article is an Nvidia RTX 3060 12Gb.(I do not think that the Google Colab has a free instance). And the model we use in the example is a GPT model with 20 billion parameters.

What is QLoRa?

Quantized LLMs with Low-Rank Adapters. It mentioned that "LoRa adds a tiny amount of trainable parameters, i.e., adapters, for each layer of the LLM and freezes all the original parameters. For fine-tuning, we only have to update the adapter weights which significantly reduces the memory footprint".

QLoRa goes three steps further by introducing:

4-bit NormalFloat quantization

This is a method that improves upon quantile quantization. It ensures an equal number of values in each quantization bin. This avoids computational issues and errors for outlier values.

Double quantization

The authors of QLoRa define it as follows: "the process of quantizing the quantization constants for additional memory savings."

Paging with unified memory

It relies on the NVIDIA Unified Memory feature and automatically handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.

Fine-tuning a GPT model with QLoRa

Hardware requirements for QLoRa

GPU

We mentioned it above

RAM

At least 6 Gb

Hard drive

GPT-J and GPT-NEoX-20b are both very big models. I recommend at least 80 Gb of free space.

Software requirements for QLoRa

We need CUDA and other dependencies:

bitsandbytes

A library that contains all we need to quantize an LLM

Hugging Face Transformers and Accelerate

These are standard libraries that are used to efficiently train models from Hugging Face hub.

PEFT

A library that provides the implementations for various methods to only fine-tune a small number of (extra) model parameters. It supports LoRa.

Datasets

This one is not a requirement. We will only use it to get a dataset for fine-tuning. Of course, you can provide instead your own dataset.

Get all of them

pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git 
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip install -q datasets

The Python script

Loading and Quantization of a GPT model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Here we will fine-tune the GPT NeoX model pre-trained by EleutherAI. This is a model with 20 billion parameters. Note: GPT NeoX has a permissive license (Apache 2.0) that allows commercial use.

Get the model and the associated tokenizer

model_name ="EleutherAI/gpt-neox-20b"

#Tokenizer
toeknizer = AutoTokenizer.from_pretrained(model_name)

Then, we need to detail the configuration of the quantizer, as follows:

quant_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
  • load_in_4bit: The modeal will loaded in the memory with 4-bit precision

  • bnb_4bit_use_double_quant: We will do the double quantization proposed by QLoRa.

  • bnb_4bit-quant_type: This is the type of quantization. "nf4" stands for 4-bit NormalFloat.

  • bnb_4bit_compute_dtype: While we load the store the model in 4-bit, we will partially dequantize it when needed and do all the computations with 16-bit orecision (bfloat16).

Loading the model in 4-bit

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_cinfig=quant_config, device_mapp={"":0})

Then, we enable gradient checkpointing:

model.gradient_checkpointing_enable()

Preprocessing the GPT model for LoRa

We use PEFT, we prepare the model for LoRa, adding trainable adapters for each layer.

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model=prepare_model_for_kbit_training(model)

config=LoraConfig(r=8,
                lora_alpha=32,
                targe_modules=["query_key_value"],
                lora_dropout=0.05,
                bias="none",
                task_type="CAUSAL_LM"
)

model=get_peft_model(model, config)

In LoraConfig, you can play with r, alpha, and dropout to obtain better results on your task. More details and options in the PEFT repo.

With LoRa, we add 8 million parameters. We will only train these parameters and freeze everthing else. Fine-tuning should be fast.

Dataset

Here the author uses the "english_quotes" dataset. This is a dataset made of famous quotes distrabuted under a CC BY 4.0 license.

from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote]), betched=True)

Fine-tuning GPT-NeoX-20B with QLoRa

Fianlly, the fine-tuning with Hugging Face Transformers is very standard.

import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_steps=2,
        max_steps=20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()

Don't forget optim="paged_adamw_8bit". It activates the paging for better memory management. Without it, we ge out-of-memory errors.

The VRAM consumption should peak at 15Gb.

GPT Inference with QLoRa

The QLoRa model we fine-tuned can be directly used with the standard Hugging Face Transformer'sinference, as follows:

text = "Ask not what your country"
device="cuda:0"
inputs=tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

You should get this quote as output:

Ask not what your country can do for you, ask what you can do for your country.”

– John F.

Conclusion

Thanks for the new techniques like LoRa, QLoRa, PEFT and DeepSpeed. We can fine-tune models with billion parameters without relying on cloud computing and without a significant drop in performance according to papers of these techniques.

Credit

Last updated