πŸ‰ggml

How to use quantization to democratize access to LLMs?

Overview

GGML is a C library for ML - the "GG" refers to the author (Georgi Gerganov). It defines:

  • Low-level ML primitives like the tensor type

  • The binary format for distributing LLMs

It makes use of technique quantization that allows for LLMs to run on consumer hardware.

The Binary Format

GGML files consist of binary-encoded data that is laid out according to a specified format. The format specifies what kind of data is present in the file, how it is represented, and the order in which it appears.

The first piece of information present in the GGML file is a GGML version number, it uses three components that define a LLM:

  • The model's Hyperparameters

  • Vocabulary

  • Weights

GGML Versions

Support backward-compatibility

GGML uses versioning to introduce improvements that may change the format of the encoding.

It uses to describe a value that is used to configure the behavior of a large language model;

It is in contrast to the model's parameters like the weights that were derived in the training process that was used to create the model.

Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model.

Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model. And valid GGML files must list these values in the correct order and each of them must be represented using the correct data type.

Although hyperparameters are different across models, the most models have:

  • `n_vocab` the size of the model's vocabulary

  • `n_embd` the size of the model's embedding layer, which is used during prompt ingestion

  • `n_layer` the number of layers in the model, each layer represents a set of weights.

Vocabulary

As the name implies, a model's vocabulary comprises components that are used by the model to generate language (text).

The vocabulary of a LLM consists of "tokens". A token can be an entire word, but oftentimes they are word** fragments. **LLM uses tokens to express a large number of words from a relatively smaller number of components.

For example, Consider a vocabulary with the following tokens: whi, ch le, who, and a; This vocabulary can be used to create English words. With a vocabulary like this, the model creator can tune the models' behavior and performance.

The model's hyperparameters typically contain a value that specifies the number of tokens in the vocabulary. The vocabulary is encoded as a list of tokens, each of which includes a 32-bit integer that specifies the length of the token. And it depends on different GGML versions, the token may also include a 32-bit floating point score, which represents the frequency of that token in the model's training data.

Weight

LLMs can be improved by increasing the number of weights in the model. The total number of weights in a model is referred to as the "size" of that model. For example, the StableLM implementation of the GPT-NeoX language model architecture is available in a number of sizes, like 3B and 7B, which stand for 3 billion and 7 billion, respectively.

These numbers above refer to the total number of weights in the model. Weights are grouped together in sets called "layers", and within a layer, weights are grouped together in structures called "tensors".

So, for instance, both StableLM 3B and StableLM 7B use layers that comprise the same tensors, but StableLM 3B has relatively fewer** **layers when compared to StableLM 7B.

Tensor

In GGML, a tensor consists of a number of components, including:

  • Name

  • The 4-element list that represents the number of dimensions in the tensor and their lengths

  • A list of the weights in that tensor

Representing tensor in GGML

`{"tensor_a0", [2, 2, 1, 1], [1.0, 0.0, 0.1, 1.0]}` Note that the 4-element list of dimensions uses 1 as a placeholder for unused dimensions - this is because the product of the dimensions should not equal zero.

The weights in a GGML file are encoded as a list of layers, the length of which is typically specified in the model's hyperparameters; each layer is encoded as an ordered set of tensors.

Features

Integer Quantization

Integer quantization support means that the library can convert data from floating-point format (such as FP32) to integer format (such as INT8) and perform computations using integers instead of floating-point numbersΒΉ. This can reduce the memory and computing requirements of the library, as well as improve the performance on some hardware platformsΒ². However, integer quantization also introduces some errors and loss of precision, so the library has to use some techniques to recover the accuracy and maintain the quality of the resultsΒ³.

Automatic Differentiation

Automatic differentiation means that the library can evaluate the partial derivative of a function specified by a computer programΒΉ. This is useful for machine learning because many algorithms require computing gradients of functions for optimization or learning purposes. Automatic differentiation uses the chain rule of calculus to apply symbolic rules for differentiation to elementary arithmetic operations and functionsΒ². This way, it can compute derivatives accurately and efficiently without relying on finite difference approximations or symbolic manipulationΒΉ.

Optimizers for Loss Function

ADAM and L-BFGS are two types of optimizers that are used to find the optimal values of the parameters of a function by minimizing a loss functionΒΉ. They are both based on the gradient descent method, which updates the parameters in the opposite direction of the gradient of the loss functionΒ².

ADAM stands for adaptive moment estimation and is a popular optimizer for deep learning modelsΒ². It uses adaptive learning rates for each parameter based on the first and second moments (mean and variance) of the gradientsΒ³. It also incorporates momentum, which helps to accelerate the convergence and avoid local minimaΒ³.

L-BFGS stands for limited-memory Broyden–Fletcher–Goldfarb–Shanno and is a quasi-Newton optimizer that approximates the inverse Hessian matrix using a limited amount of memory⁴. It uses a line search algorithm to find the optimal step size along the search direction⁴. It is often faster and more accurate than gradient descent but requires more computation and memory⁴.

Intrinsic AVX/AVX2 on x86

The x86 architectures are a family of processors that use a 32-bit or 64-bit instruction set. AVX and AVX2 are extensions of the x86 instruction set that enable advanced vector extensions for parallel processing of dataΒ². Intrinsics are functions that provide access to these extensions without writing assembly codeΒ². Therefore, x86 architectures that utilize AVX / AVX2 intrinsics can perform faster and more efficient computations on large amounts of data, such as tensors for machine learning.

Reference

Last updated