πŸ“–
Wiki
CNCFSkywardAIHuggingFaceLinkedInKaggleMedium
  • Home
    • πŸš€About
  • πŸ‘©β€πŸ’»πŸ‘©Freesoftware
    • πŸ‰The GNU Hurd
      • πŸ˜„The files extension
      • πŸ“½οΈTutorial for starting
      • 🚚Continue Working for the Hurd
      • πŸš΄β€β™‚οΈcgo
        • πŸ‘―β€β™€οΈStatically VS Dynamically binding
        • 🧌Different ways in binding
        • πŸ‘¨β€πŸ’»Segfault
      • πŸ›ƒRust FFI
    • πŸ§šπŸ»β€β™‚οΈProgramming
      • πŸ“–Introduction to programming
      • πŸ“–Mutable Value Semantics
      • πŸ“–Linked List
      • πŸ“–Rust
        • πŸ“–Keyword dyn
        • πŸ“–Tonic framework
        • πŸ“–Tokio
        • πŸ“–Rust read files
  • πŸ›€οΈAI techniques
    • πŸ—„οΈframework
      • 🧷pytorch
      • πŸ““Time components
      • πŸ““burn
    • 🍑Adaptation
      • 🎁LoRA
        • ℹ️Matrix Factorization
        • πŸ“€SVD
          • ✝️Distillation of SVD
          • 🦎Eigenvalues of a covariance matrix
            • 🧧Eigenvalues
            • πŸͺCovariance Matrix
        • πŸ›«Checkpoint
      • 🎨PEFT
    • πŸ™‹β€β™‚οΈTraining
      • πŸ›»Training with QLoRA
      • 🦌Deep Speed
    • 🧠Stable Diffusion
      • πŸ€‘Stable Diffusion model
      • πŸ“ΌStable Diffusion v1 vs v2
      • πŸ€Όβ€β™€οΈThe important parameters for stunning AI image
      • ⚾Diffusion in image
      • 🚬Classifier Free Guidance
      • ⚜️Denoising strength
      • πŸ‘·Stable Diffusion workflow
      • πŸ“™LoRA(Stable Diffusion)
      • πŸ—ΊοΈDepth maps
      • πŸ“‹CLIP
      • βš•οΈEmbeddings
      • πŸ• VAE
      • πŸ’₯Conditioning
      • 🍁Diffusion sampling/samplers
      • πŸ₯ Prompt
      • πŸ˜„ControlNet
        • πŸͺ‘Settings Explained
        • 🐳ControlNet with models
    • πŸ¦™Large Language Model
      • ☺️SMID
      • πŸ‘¨β€πŸŒΎARM NEON
      • 🍊Metal
      • 🏁BLAS
      • πŸ‰ggml
      • πŸ’»llama.cpp
      • 🎞️Measuring model quality
      • πŸ₯žType for NNC
      • πŸ₯žToken
      • πŸ€Όβ€β™‚οΈDoc Retrieval && QA with LLMs
      • Hallucination(AI)
    • 🐹diffusers
      • πŸ’ͺDeconstruct the Stable Diffusion pipeline
  • 🎹Implementing
    • πŸ‘¨β€πŸ’»diffusers
      • πŸ“–The Annotated Diffusion Model
  • 🧩Trending
    • πŸ“–Trending
      • πŸ“–Vector database
      • 🍎Programming Languages
        • πŸ“–Go & Rust manage their memories
        • πŸ“–Performance of Rust and Python
        • πŸ“–Rust ownership and borrowing
      • πŸ“–Neural Network
        • 🎹Sliding window/convolutional filter
      • Quantum Machine Learning
  • 🎾Courses Collection
    • πŸ“–Courses Collection
      • πŸ“šAcademic In IT
        • πŸ“Reflective Writing
      • πŸ“–UCB
        • πŸ“–CS 61A
          • πŸ“–Computer Science
          • πŸ“–Scheme
          • πŸ“–Python
          • πŸ“–Data Abstraction
          • πŸ“–Object-Oriented Programming
          • πŸ“–Interpreters
          • πŸ“–Streams
      • 🍎MIT Algorithm Courses
        • 0️MIT 18.01
          • 0️Limits and continuity
          • 1️Derivatives
          • 3️Integrals
        • 1️MIT 6.042J
          • πŸ”’Number Theory
          • πŸ“ŠGraph Theory
            • 🌴Graph and Trees
            • 🌲Shortest Paths and Minimum Spanning Trees
        • 2️MIT 6.006
          • Intro and asymptotic notation
          • Sorting and Trees
            • Sorting
            • Trees
          • Hashing
          • Graphs
          • Shortest Paths
          • Dynamic Programming
          • Advanced
        • 3️MIT 6.046J
          • Divide and conquer
          • Dynamic programming
          • Greedy algorithms
          • Graph algorithms
Powered by GitBook
On this page
  • Overview
  • Build
  • Metal Build
  • BLAS Build

Was this helpful?

Edit on GitHub
  1. AI techniques
  2. Large Language Model

llama.cpp

Port of Facebook's LLaMA model in C/C++

PreviousggmlNextMeasuring model quality

Last updated 1 year ago

Was this helpful?

The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook

Overview

It uses the LLaMA model to generate predictions or outputs based on input data. And it runs without any dependencies or external libraries. This allows for faster and more efficient inference on desktop CPUs.

Why is the model without external libraries and any dependencies that will good performance on the desktop CPUs?

There are several reasons:

  • Without external libraries and dependencies, the model can reduce the overhead associated with loading and managing these resources.

  • It is designed to run on the macOS

Features

  • Optimizing via, Accelerate, and frameworks

  • 4,5,8 bit support

  • Supports /Apple BLAS/ARM Performance Lib/ATLAS/BLIS/Intel MKL/NVHPC/ACML/SCSL/SGIMAH and more in BLAS

  • cuBLAS and CLBlast support

Build

  • make

  • cmake

  • zig

Metal allows the computation to be executed on the GPU for Apple devices

  • make

    • LLAMA_METAL=1 make

  • cmake

    • cmake -DLLAMA_METAL=ON ..

Enable GPU inference with the --gpu-layers|-nglcommand-line argument. Any value larger than 0 will offload the computation to the GPU.

./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1

Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). There are currently three different implementations of it:

  • Accelerate Framework

    • Only available on Mac PCs and it is enabled by default.

  • OpenBLAS

    • This provides BLAS acceleration using only the CPU(Need to install OpenBLAS first)

  • Intel MKL

  • cuBLAS

    • This provides BLAS acceleration using the CUDA cores of your Nvidia GPU(Need the CUDA toolkit installed)

  • CLBlast

Build

Build

OpenCL acceleration is provided by the matrix multiplication kernels from the project and custom kernels for ggml that can generate tokens on the GPU. (Need the OpenCL SDK)

πŸ›€οΈ
πŸ¦™
πŸ’»
Metal
BLAS
BLIS
CLBlast
ARM NEON
Metal
Mixed F16/F32 precision
OpenBLAS
AVX, AVX2 and AVX512 support for x86 architectures
integer quantization