π»Training with QLoRA
Fine-tuning models on consumer hardware
Thanks for Benjamin Marie's article, all the contents are from his article.
Overview
I am interested in almost everything about AI with consumer hardware. So, let's look at this article.
It introduces the QLoRa technique,
Here are some data from the article. If we want to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 800 Gb GPUs.
And now, with QLoRa, we can do this with only one A100. And the computing resource in that article is an Nvidia RTX 3060 12Gb.(I do not think that the Google Colab has a free instance). And the model we use in the example is a GPT model with 20 billion parameters.
What is QLoRa?
Quantized LLMs with Low-Rank Adapters. It mentioned that "LoRa adds a tiny amount of trainable parameters, i.e., adapters, for each layer of the LLM and freezes all the original parameters. For fine-tuning, we only have to update the adapter weights which significantly reduces the memory footprint".
QLoRa goes three steps further by introducing:
4-bit NormalFloat quantization
This is a method that improves upon quantile quantization. It ensures an equal number of values in each quantization bin. This avoids computational issues and errors for outlier values.
Double quantization
The authors of QLoRa define it as follows: "the process of quantizing the quantization constants for additional memory savings."
Paging with unified memory
It relies on the NVIDIA Unified Memory feature and automatically handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.
Fine-tuning a GPT model with QLoRa
Hardware requirements for QLoRa
GPU
We mentioned it above
RAM
At least 6 Gb
Hard drive
GPT-J and GPT-NEoX-20b are both very big models. I recommend at least 80 Gb of free space.
Software requirements for QLoRa
We need CUDA and other dependencies:
bitsandbytes
A library that contains all we need to quantize an LLM
Hugging Face Transformers and Accelerate
These are standard libraries that are used to efficiently train models from Hugging Face hub.
PEFT
A library that provides the implementations for various methods to only fine-tune a small number of (extra) model parameters. It supports LoRa.
Datasets
This one is not a requirement. We will only use it to get a dataset for fine-tuning. Of course, you can provide instead your own dataset.
Get all of them
The Python script
Loading and Quantization of a GPT model
Here we will fine-tune the GPT NeoX model pre-trained by EleutherAI. This is a model with 20 billion parameters. Note: GPT NeoX has a permissive license (Apache 2.0) that allows commercial use.
Get the model and the associated tokenizer
Then, we need to detail the configuration of the quantizer, as follows:
load_in_4bit: The modeal will loaded in the memory with 4-bit precision
bnb_4bit_use_double_quant: We will do the double quantization proposed by QLoRa.
bnb_4bit-quant_type: This is the type of quantization. "nf4" stands for 4-bit NormalFloat.
bnb_4bit_compute_dtype: While we load the store the model in 4-bit, we will partially dequantize it when needed and do all the computations with 16-bit orecision (bfloat16).
Loading the model in 4-bit
Then, we enable gradient checkpointing:
Preprocessing the GPT model for LoRa
We use PEFT, we prepare the model for LoRa, adding trainable adapters for each layer.
In LoraConfig, you can play with r, alpha, and dropout to obtain better results on your task. More details and options in the PEFT repo.
With LoRa, we add 8 million parameters. We will only train these parameters and freeze everthing else. Fine-tuning should be fast.
Dataset
Here the author uses the "english_quotes" dataset. This is a dataset made of famous quotes distrabuted under a CC BY 4.0 license.
Fine-tuning GPT-NeoX-20B with QLoRa
Fianlly, the fine-tuning with Hugging Face Transformers is very standard.
Don't forget optim="paged_adamw_8bit". It activates the paging for better memory management. Without it, we ge out-of-memory errors.
The VRAM consumption should peak at 15Gb.
GPT Inference with QLoRa
The QLoRa model we fine-tuned can be directly used with the standard Hugging Face Transformer'sinference, as follows:
You should get this quote as output:
Conclusion
Thanks for the new techniques like LoRa, QLoRa, PEFT and DeepSpeed. We can fine-tune models with billion parameters without relying on cloud computing and without a significant drop in performance according to papers of these techniques.
Credit
Last updated