An intro to llama.cpp

A practical guide to running LLMs locally. AI summary of SteelPh0enix's blog post.

Overview

Llama.cpp enables local deployment of Large Language Models through efficient C++ implementation and model quantization. It supports various hardware configurations and optimization techniques, making it possible to run substantial models on consumer hardware (e.g., running a 7B parameter model on 8GB RAM).

Key benefits include privacy protection, offline operation, and flexible deployment options across different hardware setups. For example, you can run smaller quantized models on a laptop for development or deploy larger models on GPUs for production use.

Core Concepts

Model Basics

Quantization: Reduces model size and memory usage through precision reduction
Context Window: Determines how much text the model can "remember" (typically 2K-8K tokens)
Tokenization: Converts text into tokens for model processing

Backend Selection Guide

Choose based on your hardware:

NVIDIA GPU → CUDA
AMD/Intel GPU → Vulkan
CPU only → OpenBLAS (AMD) or oneMKL (Intel)

Quick-Start Guide

System Requirements

Minimum:
- CPU with AVX2 support
- 8GB RAM
- Compatible GPU (optional)

Recommended:
- 16GB RAM
- GPU with 8GB+ VRAM
- SSD for model storage

Installation

# Build with GPU support (Vulkan example)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -G Ninja \
    -DGGML_VULKAN=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build

Basic Usage

Download and quantize a model:

# Convert to 4-bit quantization
./quantize model.gguf model-q4_0.gguf q4_0

Run the server:

./server \
    -m model-q4_0.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers 99

Advanced Configuration

Sampling Parameters

ParameterRangePurposeTemperature0.1-1.0Response creativityTop-P0.1-0.95Token diversityTop-K20-60Token selection

Performance Optimization

GPU Layers: Use --n-gpu-layers 99 for full GPU offloading
Batch Size: Increase for better throughput (--batch-size 512)
Context Size: Adjust based on available memory (--ctx-size 2048)

Benchmarking

# Run performance test
./llama-bench \
    -m model-q4_0.gguf \
    --n-prompt 512 \
    --n-gen 128

Troubleshooting

Common Issues:

Out of Memory
- Try lower quantization (q4_0 instead of q5_0)
- Reduce context size
- Enable GPU offloading
Poor Performance
- Check GPU utilization
- Adjust batch size
- Verify backend configuration
Generation Quality
- Tune sampling parameters
- Check model quantization level
- Verify context window size

Recommended Models

Small/Testing: SmolLM2 1.7B
- Fast, lightweight
- Good for development
Production: Qwen 14B
- Strong performance
- Reasonable resource requirements
Balanced: Llama-2 7B
- Good performance/resource ratio
- Wide compatibility

Additional Resources

Documentation

Tools

llama-bench: Performance testing
llama-cli: Command-line interface
Server API: OpenAI-compatible endpoint