An intro to llama.cpp
A practical guide to running LLMs locally. AI summary of SteelPh0enix's blog post.
Overview
Llama.cpp enables local deployment of Large Language Models through efficient C++ implementation and model quantization. It supports various hardware configurations and optimization techniques, making it possible to run substantial models on consumer hardware (e.g., running a 7B parameter model on 8GB RAM).
Key benefits include privacy protection, offline operation, and flexible deployment options across different hardware setups. For example, you can run smaller quantized models on a laptop for development or deploy larger models on GPUs for production use.
Core Concepts
Model Basics
Quantization: Reduces model size and memory usage through precision reduction
Context Window: Determines how much text the model can "remember" (typically 2K-8K tokens)
Tokenization: Converts text into tokens for model processing
Backend Selection Guide
Choose based on your hardware:
NVIDIA GPU → CUDA
AMD/Intel GPU → Vulkan
CPU only → OpenBLAS (AMD) or oneMKL (Intel)
Quick-Start Guide
System Requirements
Minimum:
- CPU with AVX2 support
- 8GB RAM
- Compatible GPU (optional)
Recommended:
- 16GB RAM
- GPU with 8GB+ VRAM
- SSD for model storage
Installation
# Build with GPU support (Vulkan example)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -G Ninja \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build
Basic Usage
Download and quantize a model:
# Convert to 4-bit quantization
./quantize model.gguf model-q4_0.gguf q4_0
Run the server:
./server \
-m model-q4_0.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 99
Advanced Configuration
Sampling Parameters
ParameterRangePurposeTemperature0.1-1.0Response creativityTop-P0.1-0.95Token diversityTop-K20-60Token selection
Performance Optimization
GPU Layers: Use
--n-gpu-layers 99
for full GPU offloadingBatch Size: Increase for better throughput (
--batch-size 512
)Context Size: Adjust based on available memory (
--ctx-size 2048
)
Benchmarking
# Run performance test
./llama-bench \
-m model-q4_0.gguf \
--n-prompt 512 \
--n-gen 128
Troubleshooting
Common Issues:
Out of Memory
Try lower quantization (q4_0 instead of q5_0)
Reduce context size
Enable GPU offloading
Poor Performance
Check GPU utilization
Adjust batch size
Verify backend configuration
Generation Quality
Tune sampling parameters
Check model quantization level
Verify context window size
Recommended Models
Small/Testing: SmolLM2 1.7B
Fast, lightweight
Good for development
Production: Qwen 14B
Strong performance
Reasonable resource requirements
Balanced: Llama-2 7B
Good performance/resource ratio
Wide compatibility
Additional Resources
Documentation
Tools
llama-bench: Performance testing
llama-cli: Command-line interface
Server API: OpenAI-compatible endpoint