An intro to llama.cpp

A practical guide to running LLMs locally. AI summary of SteelPh0enix's blog post.

Overview

Llama.cpp enables local deployment of Large Language Models through efficient C++ implementation and model quantization. It supports various hardware configurations and optimization techniques, making it possible to run substantial models on consumer hardware (e.g., running a 7B parameter model on 8GB RAM).

Key benefits include privacy protection, offline operation, and flexible deployment options across different hardware setups. For example, you can run smaller quantized models on a laptop for development or deploy larger models on GPUs for production use.

Core Concepts

Model Basics

  • Quantization: Reduces model size and memory usage through precision reduction

  • Context Window: Determines how much text the model can "remember" (typically 2K-8K tokens)

  • Tokenization: Converts text into tokens for model processing

Backend Selection Guide

Choose based on your hardware:

  • NVIDIA GPU → CUDA

  • AMD/Intel GPU → Vulkan

  • CPU only → OpenBLAS (AMD) or oneMKL (Intel)

Quick-Start Guide

System Requirements

Minimum:
- CPU with AVX2 support
- 8GB RAM
- Compatible GPU (optional)

Recommended:
- 16GB RAM
- GPU with 8GB+ VRAM
- SSD for model storage

Installation

# Build with GPU support (Vulkan example)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -G Ninja \
    -DGGML_VULKAN=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build

Basic Usage

  1. Download and quantize a model:

# Convert to 4-bit quantization
./quantize model.gguf model-q4_0.gguf q4_0
  1. Run the server:

./server \
    -m model-q4_0.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers 99

Advanced Configuration

Sampling Parameters

ParameterRangePurposeTemperature0.1-1.0Response creativityTop-P0.1-0.95Token diversityTop-K20-60Token selection

Performance Optimization

  • GPU Layers: Use --n-gpu-layers 99 for full GPU offloading

  • Batch Size: Increase for better throughput (--batch-size 512)

  • Context Size: Adjust based on available memory (--ctx-size 2048)

Benchmarking

# Run performance test
./llama-bench \
    -m model-q4_0.gguf \
    --n-prompt 512 \
    --n-gen 128

Troubleshooting

Common Issues:

  1. Out of Memory

    • Try lower quantization (q4_0 instead of q5_0)

    • Reduce context size

    • Enable GPU offloading

  2. Poor Performance

    • Check GPU utilization

    • Adjust batch size

    • Verify backend configuration

  3. Generation Quality

    • Tune sampling parameters

    • Check model quantization level

    • Verify context window size

Recommended Models

  1. Small/Testing: SmolLM2 1.7B

    • Fast, lightweight

    • Good for development

  2. Production: Qwen 14B

    • Strong performance

    • Reasonable resource requirements

  3. Balanced: Llama-2 7B

    • Good performance/resource ratio

    • Wide compatibility

Additional Resources

Documentation

Tools

  • llama-bench: Performance testing

  • llama-cli: Command-line interface

  • Server API: OpenAI-compatible endpoint