📖
Wiki
CNCFSkywardAIHuggingFaceLinkedInKaggleMedium
  • Home
    • 🚀About
  • 👩‍💻👩Freesoftware
    • 🍉The GNU Hurd
      • 😄The files extension
      • 📽️Tutorial for starting
      • 🚚Continue Working for the Hurd
      • 🚴‍♂️cgo
        • 👯‍♀️Statically VS Dynamically binding
        • 🧌Different ways in binding
        • 👨‍💻Segfault
      • 🛃Rust FFI
    • 🧚🏻‍♂️Programming
      • 📖Introduction to programming
      • 📖Mutable Value Semantics
      • 📖Linked List
      • 📖Rust
        • 📖Keyword dyn
        • 📖Tonic framework
        • 📖Tokio
        • 📖Rust read files
  • 🛤️AI techniques
    • 🗄️framework
      • 🧷pytorch
      • 📓Time components
      • 📓burn
    • 🍡Adaptation
      • 🎁LoRA
        • ℹ️Matrix Factorization
        • 📀SVD
          • ✝️Distillation of SVD
          • 🦎Eigenvalues of a covariance matrix
            • 🧧Eigenvalues
            • 🏪Covariance Matrix
        • 🛫Checkpoint
      • 🎨PEFT
    • 🙋‍♂️Training
      • 🛻Training with QLoRA
      • 🦌Deep Speed
    • 🧠Stable Diffusion
      • 🤑Stable Diffusion model
      • 📼Stable Diffusion v1 vs v2
      • 🤼‍♀️The important parameters for stunning AI image
      • ⚾Diffusion in image
      • 🚬Classifier Free Guidance
      • ⚜️Denoising strength
      • 👷Stable Diffusion workflow
      • 📙LoRA(Stable Diffusion)
      • 🗺️Depth maps
      • 📋CLIP
      • ⚕️Embeddings
      • 🕠VAE
      • 💥Conditioning
      • 🍁Diffusion sampling/samplers
      • 🥠Prompt
      • 😄ControlNet
        • 🪡Settings Explained
        • 🐳ControlNet with models
    • 🦙Large Language Model
      • ☺️SMID
      • 👨‍🌾ARM NEON
      • 🍊Metal
      • 🏁BLAS
      • 🍉ggml
      • 💻llama.cpp
      • 🎞️Measuring model quality
      • 🥞Type for NNC
      • 🥞Token
      • 🤼‍♂️Doc Retrieval && QA with LLMs
      • Hallucination(AI)
    • 🐹diffusers
      • 💪Deconstruct the Stable Diffusion pipeline
  • 🎹Implementing
    • 👨‍💻diffusers
      • 📖The Annotated Diffusion Model
  • 🧩Trending
    • 📖Trending
      • 📖Vector database
      • 🍎Programming Languages
        • 📖Go & Rust manage their memories
        • 📖Performance of Rust and Python
        • 📖Rust ownership and borrowing
      • 📖Neural Network
        • 🎹Sliding window/convolutional filter
      • Quantum Machine Learning
  • 🎾Courses Collection
    • 📖Courses Collection
      • 📚Academic In IT
        • 📝Reflective Writing
      • 📖UCB
        • 📖CS 61A
          • 📖Computer Science
          • 📖Scheme
          • 📖Python
          • 📖Data Abstraction
          • 📖Object-Oriented Programming
          • 📖Interpreters
          • 📖Streams
      • 🍎MIT Algorithm Courses
        • 0️MIT 18.01
          • 0️Limits and continuity
          • 1️Derivatives
          • 3️Integrals
        • 1️MIT 6.042J
          • 🔢Number Theory
          • 📊Graph Theory
            • 🌴Graph and Trees
            • 🌲Shortest Paths and Minimum Spanning Trees
        • 2️MIT 6.006
          • Intro and asymptotic notation
          • Sorting and Trees
            • Sorting
            • Trees
          • Hashing
          • Graphs
          • Shortest Paths
          • Dynamic Programming
          • Advanced
        • 3️MIT 6.046J
          • Divide and conquer
          • Dynamic programming
          • Greedy algorithms
          • Graph algorithms
Powered by GitBook
On this page
  • Overview
  • The Binary Format
  • GGML Versions
  • Hyperparameters
  • Vocabulary
  • Weight
  • Features
  • Integer Quantization
  • Automatic Differentiation
  • Optimizers for Loss Function
  • Intrinsic AVX/AVX2 on x86
  • Reference

Was this helpful?

Edit on GitHub
  1. AI techniques
  2. Large Language Model

ggml

How to use quantization to democratize access to LLMs?

PreviousBLASNextllama.cpp

Last updated 1 year ago

Was this helpful?

Overview

GGML is a C library for ML - the "GG" refers to the author (). It defines:

  • Low-level ML primitives like the tensor type

  • The binary format for distributing LLMs

It makes use of technique quantization that allows for LLMs to run on consumer hardware.

The Binary Format

GGML files consist of binary-encoded data that is laid out according to a specified format. The format specifies what kind of data is present in the file, how it is represented, and the order in which it appears.

The first piece of information present in the GGML file is a GGML version number, it uses three components that define a LLM:

  • The model's Hyperparameters

  • Vocabulary

  • Weights

GGML Versions

Support backward-compatibility

GGML uses versioning to introduce improvements that may change the format of the encoding.

It uses to describe a value that is used to configure the behavior of a large language model;

It is in contrast to the model's parameters like the weights that were derived in the training process that was used to create the model.

Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model.

Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model. And valid GGML files must list these values in the correct order and each of them must be represented using the correct data type.

Although hyperparameters are different across models, the most models have:

  • `n_vocab` the size of the model's vocabulary

  • `n_embd` the size of the model's embedding layer, which is used during prompt ingestion

  • `n_layer` the number of layers in the model, each layer represents a set of weights.

Vocabulary

As the name implies, a model's vocabulary comprises components that are used by the model to generate language (text).

The vocabulary of a LLM consists of "tokens". A token can be an entire word, but oftentimes they are word** fragments. **LLM uses tokens to express a large number of words from a relatively smaller number of components.

For example, Consider a vocabulary with the following tokens: whi, ch le, who, and a; This vocabulary can be used to create English words. With a vocabulary like this, the model creator can tune the models' behavior and performance.

The model's hyperparameters typically contain a value that specifies the number of tokens in the vocabulary. The vocabulary is encoded as a list of tokens, each of which includes a 32-bit integer that specifies the length of the token. And it depends on different GGML versions, the token may also include a 32-bit floating point score, which represents the frequency of that token in the model's training data.

Weight

These numbers above refer to the total number of weights in the model. Weights are grouped together in sets called "layers", and within a layer, weights are grouped together in structures called "tensors".

So, for instance, both StableLM 3B and StableLM 7B use layers that comprise the same tensors, but StableLM 3B has relatively fewer** **layers when compared to StableLM 7B.

Tensor

In GGML, a tensor consists of a number of components, including:

  • Name

  • The 4-element list that represents the number of dimensions in the tensor and their lengths

  • A list of the weights in that tensor

Representing tensor in GGML

`{"tensor_a0", [2, 2, 1, 1], [1.0, 0.0, 0.1, 1.0]}` Note that the 4-element list of dimensions uses 1 as a placeholder for unused dimensions - this is because the product of the dimensions should not equal zero.

The weights in a GGML file are encoded as a list of layers, the length of which is typically specified in the model's hyperparameters; each layer is encoded as an ordered set of tensors.

Features

Integer Quantization

Integer quantization support means that the library can convert data from floating-point format (such as FP32) to integer format (such as INT8) and perform computations using integers instead of floating-point numbers¹. This can reduce the memory and computing requirements of the library, as well as improve the performance on some hardware platforms². However, integer quantization also introduces some errors and loss of precision, so the library has to use some techniques to recover the accuracy and maintain the quality of the results³.

Automatic Differentiation

Optimizers for Loss Function

Intrinsic AVX/AVX2 on x86

Reference

LLMs can be improved by increasing the number of weights in the model. The total number of weights in a model is referred to as the "size" of that model. For example, the implementation of the language model architecture is available in a number of sizes, like 3B and 7B, which stand for 3 billion and 7 billion, respectively.

Automatic differentiation means that the library can evaluate the partial derivative of a function specified by a computer program. This is useful for machine learning because many algorithms require computing gradients of functions for optimization or learning purposes. Automatic differentiation uses the chain rule of calculus to apply symbolic rules for differentiation to elementary arithmetic operations and functions. This way, it can compute derivatives accurately and efficiently without relying on finite difference approximations or symbolic manipulation.

ADAM and L-BFGS are two types of optimizers that are used to find the optimal values of the parameters of a function by minimizing a loss function. They are both based on the gradient descent method, which updates the parameters in the opposite direction of the gradient of the loss function.

ADAM stands for adaptive moment estimation and is a popular optimizer for deep learning models. It uses adaptive learning rates for each parameter based on the first and second moments (mean and variance) of the gradients. It also incorporates momentum, which helps to accelerate the convergence and avoid local minima.

L-BFGS stands for limited-memory Broyden–Fletcher–Goldfarb–Shanno and is a quasi-Newton optimizer that approximates the inverse Hessian matrix using a limited amount of memory. It uses a line search algorithm to find the optimal step size along the search direction. It is often faster and more accurate than gradient descent but requires more computation and memory.

The x86 architectures are a family of processors that use a 32-bit or 64-bit instruction set. AVX and AVX2 are extensions of the x86 instruction set that enable advanced vector extensions for parallel processing of data. Intrinsics are functions that provide access to these extensions without writing assembly code. Therefore, x86 architectures that utilize AVX / AVX2 intrinsics can perform faster and more efficient computations on large amounts of data, such as tensors for machine learning.

🛤️
🦙
🍉
Georgi Gerganov
Hyperparameters
StableLM
GPT-NeoX
¹
²
¹
¹
²
²
³
³
⁴
⁴
⁴
²
²
https://iq.opengenus.org/basics-of-quantization-in-ml/
https://www.mathworks.com/company/newsletters/articles/what-is-int8-quantization-and-why-is-it-popular-for-deep-neural-networks.html
https://arxiv.org/abs/2004.09602
https://iq.opengenus.org/basics-of-quantization-in-ml/
rustformers/llm
ggerganov/ggml
Example
Drawing