# πggml

How to use quantization to democratize access to LLMs?

## Overview

GGML is a C library for ML - the "GG" refers to the author (Georgi Gerganov). It defines:

Low-level ML primitives like the tensor type

The binary format for distributing LLMs

It makes use of technique **quantization that allows for LLMs to run on consumer hardware.**

### The Binary Format

GGML files consist of **binary-encoded data that is laid out according to a specified format**. The format **specifies what kind of data is present in the file**, **how it is represented**, and **the order** in which it appears.

The **first piece** of information present in the GGML file is a **GGML version number,** it uses **three components** that define a LLM:

The model's Hyperparameters

Vocabulary

Weights

### GGML Versions

#### Support backward-compatibility

GGML uses **versioning** to introduce improvements that may change the format of the encoding.

It uses to describe a value that is used to configure the behavior of a large language model;

It is in contrast to the model's parameters like the weights that were derived in the training process that was used to create the model.

Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model.

**Each model defines its own hyperparameter structure** that **defines the hyperparameter values** accepted by that model. And valid GGML files must **list these values** in the **correct order** and each of them must be represented using the **correct data type**.

Although hyperparameters are different across models, the most models have:

`n_vocab` the size of the model's

**vocabulary**`n_embd` the size of the model's

**embedding**layer, which is used during prompt ingestion`n_layer` the number of layers in the model, each layer represents a set of

**weights**.

### Vocabulary

**As the name implies**, a model's vocabulary comprises components that are used by the model to generate language (text).

The **vocabulary of a LLM** **consists** of "**tokens**". A token **can be an entire word**, but oftentimes they are **word**** * fragments*. **

**LLM uses tokens to express a large number of words from a relatively smaller number of components.**

For example, Consider a vocabulary with the following tokens: `whi`

, `ch`

`le`

, `who`

, and `a`

; This vocabulary can be used to create English words. With a vocabulary like this, the model creator can tune the models' behavior and performance.

The model's hyperparameters typically contain a value that specifies the number of tokens in the vocabulary. The vocabulary is encoded as a list of tokens, each of which includes a 32-bit integer that specifies the length of the token. And it depends on different GGML versions, the token may also include a 32-bit floating point score, which represents the frequency of that token in the model's training data.

### Weight

LLMs can be improved by increasing the number of *weights* in the model. **The total number of weights in a model is referred to as the "size" of that model.** For example, the StableLM implementation of the GPT-NeoX language model architecture is available in a number of sizes, like 3B and 7B, which stand for 3 billion and 7 billion, respectively.

These numbers above refer to the total number of weights in the model. **Weights are grouped together in sets called "layers"**, and **within a layer,** **weights are grouped together in structures** called "**tensors**".

So, for instance, both StableLM 3B and StableLM 7B use layers that comprise the same tensors, but StableLM 3B has relatively * fewer*** **

**layers**when compared to StableLM 7B.

#### Tensor

In GGML, a tensor consists of a number of components, including:

Name

The 4-element list that represents the number of dimensions in the tensor and their lengths

A list of the weights in that tensor

#### Representing tensor in GGML

`{"tensor_a0", [2, 2, 1, 1], [1.0, 0.0, 0.1, 1.0]}` Note that the 4-element list of dimensions uses `1`

as a placeholder for unused dimensions - this is because the product of the dimensions should not equal zero.

The **weights in a GGML file are encoded as a list of layers**, the **length** of which is typically specified **in the model's hyperparameter****s**; each layer is encoded as an ordered set of tensors.

## Features

### Integer Quantization

Integer quantization support means that the library can convert data from **floating-point format** (such as FP32) to **integer format** (such as INT8) and perform computations using integers instead of floating-point numbersΒΉ. **This can reduce the memory** and computing requirements of the library, as well as **improve the performance on some hardware platforms**Β². **However**, **integer** quantization also **introduces some errors and loss of precision**, so the library has to use some techniques to recover the accuracy and maintain the quality of the resultsΒ³.

### Automatic Differentiation

Automatic differentiation means that the library can **evaluate the partial derivative of a function** **specified by a computer program**ΒΉ. This is useful for machine learning because **many algorithms require computing gradients of functions** for optimization or learning purposes. Automatic differentiation uses the **chain rule** of **calculus** to apply symbolic rules for differentiation to **elementary arithmetic operations** and functionsΒ². This way, it can compute derivatives **accurately** and **efficiently** without relying on finite difference approximations or symbolic manipulationΒΉ.

### Optimizers for Loss Function

**ADAM** and **L-BFGS** are two types of **optimizers** that are used to find the * optimal values of the parameters of a function by minimizing a loss function*ΒΉ. They are both based on the

**gradient descent**method, which updates the parameters in the opposite direction of the gradient of the loss functionΒ².

ADAM stands for **adaptive moment estimation** and is a popular optimizer for deep learning modelsΒ². It uses **adaptive learning rates** for each parameter based on the first and second moments (* mean and variance*) of the gradientsΒ³. It also incorporates

**momentum**, which helps to

**accelerate**the convergence and avoid local minimaΒ³.

L-BFGS stands for **limited-memory BroydenβFletcherβGoldfarbβShanno** and is a quasi-Newton optimizer that approximates the inverse Hessian matrix using a limited amount of memoryβ΄. It uses a line search algorithm to find the optimal step size along the search directionβ΄. It is **often** **faster** and **more accurate than gradient descent** but **requires more computation and memory**β΄.

### Intrinsic AVX/AVX2 on x86

The x86 architectures are a family of processors that use a 32-bit or 64-bit instruction set. **AVX and AVX2 are extensions of the x86 instruction set** that enable **advanced vector extensions** for **parallel processing of data**Β². Intrinsics are functions that provide access to these extensions **without writing assembly code**Β². Therefore, x86 architectures that utilize **AVX / AVX2 intrinsics can perform faster and more efficient computations on large amounts of data**, such as tensors for machine learning.

## Reference

Last updated