πŸ“–
Wiki
CNCFSkywardAIHuggingFaceLinkedInKaggleMedium
  • Home
    • πŸš€About
  • πŸ‘©β€πŸ’»πŸ‘©Freesoftware
    • πŸ‰The GNU Hurd
      • πŸ˜„The files extension
      • πŸ“½οΈTutorial for starting
      • 🚚Continue Working for the Hurd
      • πŸš΄β€β™‚οΈcgo
        • πŸ‘―β€β™€οΈStatically VS Dynamically binding
        • 🧌Different ways in binding
        • πŸ‘¨β€πŸ’»Segfault
      • πŸ›ƒRust FFI
    • πŸ§šπŸ»β€β™‚οΈProgramming
      • πŸ“–Introduction to programming
      • πŸ“–Mutable Value Semantics
      • πŸ“–Linked List
      • πŸ“–Rust
        • πŸ“–Keyword dyn
        • πŸ“–Tonic framework
        • πŸ“–Tokio
        • πŸ“–Rust read files
  • πŸ›€οΈAI techniques
    • πŸ—„οΈframework
      • 🧷pytorch
      • πŸ““Time components
      • πŸ““burn
    • 🍑Adaptation
      • 🎁LoRA
        • ℹ️Matrix Factorization
        • πŸ“€SVD
          • ✝️Distillation of SVD
          • 🦎Eigenvalues of a covariance matrix
            • 🧧Eigenvalues
            • πŸͺCovariance Matrix
        • πŸ›«Checkpoint
      • 🎨PEFT
    • πŸ™‹β€β™‚οΈTraining
      • πŸ›»Training with QLoRA
      • 🦌Deep Speed
    • 🧠Stable Diffusion
      • πŸ€‘Stable Diffusion model
      • πŸ“ΌStable Diffusion v1 vs v2
      • πŸ€Όβ€β™€οΈThe important parameters for stunning AI image
      • ⚾Diffusion in image
      • 🚬Classifier Free Guidance
      • ⚜️Denoising strength
      • πŸ‘·Stable Diffusion workflow
      • πŸ“™LoRA(Stable Diffusion)
      • πŸ—ΊοΈDepth maps
      • πŸ“‹CLIP
      • βš•οΈEmbeddings
      • πŸ• VAE
      • πŸ’₯Conditioning
      • 🍁Diffusion sampling/samplers
      • πŸ₯ Prompt
      • πŸ˜„ControlNet
        • πŸͺ‘Settings Explained
        • 🐳ControlNet with models
    • πŸ¦™Large Language Model
      • ☺️SMID
      • πŸ‘¨β€πŸŒΎARM NEON
      • 🍊Metal
      • 🏁BLAS
      • πŸ‰ggml
      • πŸ’»llama.cpp
      • 🎞️Measuring model quality
      • πŸ₯žType for NNC
      • πŸ₯žToken
      • πŸ€Όβ€β™‚οΈDoc Retrieval && QA with LLMs
      • Hallucination(AI)
    • 🐹diffusers
      • πŸ’ͺDeconstruct the Stable Diffusion pipeline
  • 🎹Implementing
    • πŸ‘¨β€πŸ’»diffusers
      • πŸ“–The Annotated Diffusion Model
  • 🧩Trending
    • πŸ“–Trending
      • πŸ“–Vector database
      • 🍎Programming Languages
        • πŸ“–Go & Rust manage their memories
        • πŸ“–Performance of Rust and Python
        • πŸ“–Rust ownership and borrowing
      • πŸ“–Neural Network
        • 🎹Sliding window/convolutional filter
      • Quantum Machine Learning
  • 🎾Courses Collection
    • πŸ“–Courses Collection
      • πŸ“šAcademic In IT
        • πŸ“Reflective Writing
      • πŸ“–UCB
        • πŸ“–CS 61A
          • πŸ“–Computer Science
          • πŸ“–Scheme
          • πŸ“–Python
          • πŸ“–Data Abstraction
          • πŸ“–Object-Oriented Programming
          • πŸ“–Interpreters
          • πŸ“–Streams
      • 🍎MIT Algorithm Courses
        • 0️MIT 18.01
          • 0️Limits and continuity
          • 1️Derivatives
          • 3️Integrals
        • 1️MIT 6.042J
          • πŸ”’Number Theory
          • πŸ“ŠGraph Theory
            • 🌴Graph and Trees
            • 🌲Shortest Paths and Minimum Spanning Trees
        • 2️MIT 6.006
          • Intro and asymptotic notation
          • Sorting and Trees
            • Sorting
            • Trees
          • Hashing
          • Graphs
          • Shortest Paths
          • Dynamic Programming
          • Advanced
        • 3️MIT 6.046J
          • Divide and conquer
          • Dynamic programming
          • Greedy algorithms
          • Graph algorithms
Powered by GitBook
On this page
  • Overview
  • What is QLoRa?
  • 4-bit NormalFloat quantization
  • Double quantization
  • Paging with unified memory
  • Fine-tuning a GPT model with QLoRa
  • Hardware requirements for QLoRa
  • Software requirements for QLoRa
  • The Python script
  • Loading and Quantization of a GPT model
  • Preprocessing the GPT model for LoRa
  • Dataset
  • Fine-tuning GPT-NeoX-20B with QLoRa
  • GPT Inference with QLoRa
  • Conclusion
  • Credit

Was this helpful?

Edit on GitHub
  1. AI techniques
  2. Training

Training with QLoRA

Fine-tuning models on consumer hardware

PreviousTrainingNextDeep Speed

Last updated 1 year ago

Was this helpful?

Thanks for Benjamin Marie's , all the contents are from his article.

Overview

I am interested in almost everything about AI with consumer hardware. So, let's look at this article.

It introduces the QLoRa technique,

Here are some data from the article. If we want to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 800 Gb GPUs.

What is QLoRa?

QLoRa goes three steps further by introducing:

4-bit NormalFloat quantization

This is a method that improves upon quantile quantization. It ensures an equal number of values in each quantization bin. This avoids computational issues and errors for outlier values.

Double quantization

The authors of QLoRa define it as follows: "the process of quantizing the quantization constants for additional memory savings."

Paging with unified memory

It relies on the NVIDIA Unified Memory feature and automatically handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.

Fine-tuning a GPT model with QLoRa

Hardware requirements for QLoRa

GPU

We mentioned it above

RAM

At least 6 Gb

Hard drive

GPT-J and GPT-NEoX-20b are both very big models. I recommend at least 80 Gb of free space.

Software requirements for QLoRa

We need CUDA and other dependencies:

bitsandbytes

A library that contains all we need to quantize an LLM

Hugging Face Transformers and Accelerate

These are standard libraries that are used to efficiently train models from Hugging Face hub.

PEFT

A library that provides the implementations for various methods to only fine-tune a small number of (extra) model parameters. It supports LoRa.

Datasets

This one is not a requirement. We will only use it to get a dataset for fine-tuning. Of course, you can provide instead your own dataset.

Get all of them

pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git 
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip install -q datasets

The Python script

Loading and Quantization of a GPT model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Get the model and the associated tokenizer

model_name ="EleutherAI/gpt-neox-20b"

#Tokenizer
toeknizer = AutoTokenizer.from_pretrained(model_name)

Then, we need to detail the configuration of the quantizer, as follows:

quant_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
  • load_in_4bit: The modeal will loaded in the memory with 4-bit precision

  • bnb_4bit_use_double_quant: We will do the double quantization proposed by QLoRa.

  • bnb_4bit-quant_type: This is the type of quantization. "nf4" stands for 4-bit NormalFloat.

  • bnb_4bit_compute_dtype: While we load the store the model in 4-bit, we will partially dequantize it when needed and do all the computations with 16-bit orecision (bfloat16).

Loading the model in 4-bit

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_cinfig=quant_config, device_mapp={"":0})

Then, we enable gradient checkpointing:

model.gradient_checkpointing_enable()

Preprocessing the GPT model for LoRa

We use PEFT, we prepare the model for LoRa, adding trainable adapters for each layer.

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model=prepare_model_for_kbit_training(model)

config=LoraConfig(r=8,
                lora_alpha=32,
                targe_modules=["query_key_value"],
                lora_dropout=0.05,
                bias="none",
                task_type="CAUSAL_LM"
)

model=get_peft_model(model, config)

With LoRa, we add 8 million parameters. We will only train these parameters and freeze everthing else. Fine-tuning should be fast.

Dataset

Here the author uses the "english_quotes" dataset. This is a dataset made of famous quotes distrabuted under a CC BY 4.0 license.

from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote]), betched=True)

Fine-tuning GPT-NeoX-20B with QLoRa

Fianlly, the fine-tuning with Hugging Face Transformers is very standard.

import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_steps=2,
        max_steps=20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()

Don't forget optim="paged_adamw_8bit". It activates the paging for better memory management. Without it, we ge out-of-memory errors.

The VRAM consumption should peak at 15Gb.

GPT Inference with QLoRa

The QLoRa model we fine-tuned can be directly used with the standard Hugging Face Transformer'sinference, as follows:

text = "Ask not what your country"
device="cuda:0"
inputs=tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

You should get this quote as output:

Ask not what your country can do for you, ask what you can do for your country.”

– John F.

Conclusion

Thanks for the new techniques like LoRa, QLoRa, PEFT and DeepSpeed. We can fine-tune models with billion parameters without relying on cloud computing and without a significant drop in performance according to papers of these techniques.

Credit

And now, with , we can do this with only one A100. And the computing resource in that article is an Nvidia RTX 3060 12Gb.(I do not think that the Google Colab has a free instance). And the model we use in the example is a GPT model with 20 billion parameters.

Quantized LLMs with . It mentioned that "LoRa adds a tiny amount of trainable parameters, i.e., adapters, for each layer of the LLM and freezes all the original parameters. For fine-tuning, we only have to update the adapter weights which significantly reduces the memory footprint".

Here we will fine-tune the model pre-trained by . This is a model with 20 billion parameters. Note: GPT NeoX has a permissive license (Apache 2.0) that allows commercial use.

In LoraConfig, you can play with r, alpha, and dropout to obtain better results on your task. More details and options in the .

πŸ›€οΈ
πŸ™‹β€β™‚οΈ
πŸ›»
QLoRa
Low-Rank Adapters
GPT NeoX
EleutherAI
PEFT repo
article
QLoRa: Fine-Tune a Large Language Model on Your GPUMedium
Logo
Source from the
article