Training with QLoRA
Fine-tuning models on consumer hardware
Last updated
Fine-tuning models on consumer hardware
Last updated
Thanks for Benjamin Marie's article, all the contents are from his article.
I am interested in almost everything about AI with consumer hardware. So, let's look at this article.
It introduces the QLoRa technique,
Here are some data from the article. If we want to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 800 Gb GPUs.
And now, with QLoRa, we can do this with only one A100. And the computing resource in that article is an Nvidia RTX 3060 12Gb.(I do not think that the Google Colab has a free instance). And the model we use in the example is a GPT model with 20 billion parameters.
Quantized LLMs with Low-Rank Adapters. It mentioned that "LoRa adds a tiny amount of trainable parameters, i.e., adapters, for each layer of the LLM and freezes all the original parameters. For fine-tuning, we only have to update the adapter weights which significantly reduces the memory footprint".
QLoRa goes three steps further by introducing:
This is a method that improves upon quantile quantization. It ensures an equal number of values in each quantization bin. This avoids computational issues and errors for outlier values.
The authors of QLoRa define it as follows: "the process of quantizing the quantization constants for additional memory savings."
It relies on the NVIDIA Unified Memory feature and automatically handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.
We mentioned it above
At least 6 Gb
GPT-J and GPT-NEoX-20b are both very big models. I recommend at least 80 Gb of free space.
We need CUDA and other dependencies:
A library that contains all we need to quantize an LLM
These are standard libraries that are used to efficiently train models from Hugging Face hub.
A library that provides the implementations for various methods to only fine-tune a small number of (extra) model parameters. It supports LoRa.
This one is not a requirement. We will only use it to get a dataset for fine-tuning. Of course, you can provide instead your own dataset.
Here we will fine-tune the GPT NeoX model pre-trained by EleutherAI. This is a model with 20 billion parameters. Note: GPT NeoX has a permissive license (Apache 2.0) that allows commercial use.
Then, we need to detail the configuration of the quantizer, as follows:
load_in_4bit: The modeal will loaded in the memory with 4-bit precision
bnb_4bit_use_double_quant: We will do the double quantization proposed by QLoRa.
bnb_4bit-quant_type: This is the type of quantization. "nf4" stands for 4-bit NormalFloat.
bnb_4bit_compute_dtype: While we load the store the model in 4-bit, we will partially dequantize it when needed and do all the computations with 16-bit orecision (bfloat16).
Then, we enable gradient checkpointing:
We use PEFT, we prepare the model for LoRa, adding trainable adapters for each layer.
In LoraConfig, you can play with r, alpha, and dropout to obtain better results on your task. More details and options in the PEFT repo.
With LoRa, we add 8 million parameters. We will only train these parameters and freeze everthing else. Fine-tuning should be fast.
Here the author uses the "english_quotes" dataset. This is a dataset made of famous quotes distrabuted under a CC BY 4.0 license.
Fianlly, the fine-tuning with Hugging Face Transformers is very standard.
Don't forget optim="paged_adamw_8bit". It activates the paging for better memory management. Without it, we ge out-of-memory errors.
The VRAM consumption should peak at 15Gb.
The QLoRa model we fine-tuned can be directly used with the standard Hugging Face Transformer'sinference, as follows:
You should get this quote as output:
Thanks for the new techniques like LoRa, QLoRa, PEFT and DeepSpeed. We can fine-tune models with billion parameters without relying on cloud computing and without a significant drop in performance according to papers of these techniques.