
Image by Author | Ideogram
Running multiple large language models can be useful, whether for comparing model outputs, setting up a fallback in case one fails, or customizing behavior (like using one model for coding and another for technical writing). This is how we often use LLMs in practice. There are apps like poe.com that offer this kind of setup. It’s a single platform where you can run multiple LLMs. But what if you want to do it all locally, save on API costs, and keep your data private?
Well, that’s where the real problem shows up. Setting this up usually means juggling different ports, running separate processes, and switching between them manually. Not ideal.
That’s exactly the pain Llama-Swap solves. It’s an open-source proxy server that’s super lightweight (just a single binary), and it lets you switch between multiple local LLMs easily. In simple terms, it listens for OpenAI-style API calls on your machine and automatically starts or stops the right model server based on the model you request. Let’s break down how it works and walk through a step-by-step setup to get it running on your local machine.
# How Llama-Swap Works
Conceptually, Llama-Swap sits in front of your LLM servers as a smart router. When an API request arrives (e.g., a POST /v1/chat/completions
call), it looks at the "model"
field in the JSON payload. It then loads the appropriate server process for that model, shutting down any other model if needed. For example, if you first request model "A"
and then request model "B"
, Llama-Swap will automatically stop the server for “A” and start the server for “B” so that each request is served by the correct model. This dynamic swapping happens transparently, so clients see the expected response without worrying about the underlying processes.
By default, Llama-Swap allows only one model to run at a time (it unloads others when switching). However, its Groups feature lets you change this behavior. A group can list several models and control their swap behavior. For example, setting swap: false
in a group means all group members can run together without unloading. In practice, you might use one group for heavyweight models (only one active at a time) and another “parallel” group for small models you want running concurrently. This gives you full control over resource usage and concurrency on a single server.
# Prerequisites
Before getting started, ensure your system has the following:
- Python 3 (>=3.8): Needed for basic scripting and tooling.
- Homebrew (on macOS): Makes installing LLM runtimes easy. For example, you can install the llama.cpp server with:
This provides the llama-server
binary for hosting models locally.
- llama.cpp (
llama-server
): The OpenAI-compatible server binary (installed via Homebrew above, or built from source) that actually runs the LLM model. - Hugging Face CLI: For downloading models directly to your local machine without logging into the site or manually navigating model pages. Install it using:
pip install -U "huggingface_hub[cli]"
- Hardware: Any modern CPU will work. For faster inference, a GPU is useful. (On Apple Silicon Macs, you can run on the CPU or try PyTorch’s MPS backend for supported models. On Linux/Windows with NVIDIA GPUs, you can use Docker/CUDA containers for acceleration.)
- Docker (Optional): To run the pre-built Docker images. However, I chose not to use this for this guide because these images are designed mainly for x86 (Intel/AMD) systems and don’t work reliably on Apple Silicon (M1/M2) Macs. Instead, I used the bare-metal installation method, which works directly on macOS without any container overhead.
In summary, you’ll need a Python environment and a local LLM server (like the `llama.cpp` server). We will use these to host two example models on one machine.
# Step-by-Step Instructions
// 1. Installing Llama-Swap
Download the latest Llama-Swap release for your OS from the GitHub releases page. For example, I could see v126
as the latest release. Run the following commands:
# Step 1: Download the correct file
curl -L -o llama-swap.tar.gz
https://github.com/mostlygeek/llama-swap/releases/download/v126/llama-swap_126_darwin_arm64.tar.gz
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 3445k 100 3445k 0 0 1283k 0 0:00:02 0:00:02 --:--:-- 5417k
Now, extract the file, make it executable, and test it by checking the version:
# Step 2: Extract it
tar -xzf llama-swap.tar.gz
# Step 3: Make it executable
chmod +x llama-swap
# Step 4: Test it
./llama-swap --version
Output:
version: 126 (591a9cdf4d3314fe4b3906e939a17e76402e1655), built at 2025-06-16T23:53:50Z
// 2. Downloading and Preparing Two or More LLMs
Choose two example models to run. We’ll use Qwen2.5-0.5B and SmolLM2-135M (small models) from Hugging Face. You need the model files (in GGUF or similar format) on your machine. For example, using the Hugging Face CLI:
mkdir -p ~/llm-models
huggingface-cli download bartowski/SmolLM2-135M-Instruct-GGUF
--include "SmolLM2-135M-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models
huggingface-cli download bartowski/Qwen2.5-0.5B-Instruct-GGUF
--include "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models
This will:
- Create the directory
llm-models
in your user’s home folder - Download the GGUF model files safely into that folder. After download, you can confirm it’s there:
Output:
SmolLM2-135M-Instruct-Q4_K_M.gguf
Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
// 3. Creating a Llama-Swap Configuration
Llama-Swap uses a single YAML file to define models and server commands. Create a config.yaml
file with contents like this:
models:
"smollm2":
cmd: |
llama-server
--model /path/to/models/llm-models/SmolLM2-135M-Instruct-Q4_K_M.gguf
--port ${PORT}
"qwen2.5":
cmd: |
llama-server
--model /path/to/models/llm-models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
--port ${PORT}
Replace /path/to/models/
with your actual local path. Each entry under models:
gives an ID (like "qwen2.5"
) and a shell cmd:
to run its server. We use llama-server
(from llama.cpp) with --model
pointing to the GGUF file and --port ${PORT}
. The ${PORT}
macro tells Llama-Swap to assign a free port to each model automatically. The groups
section is optional. I have omitted it for this example, so by default, Llama-Swap will only run one model at a time. You can customize many options per model (aliases, timeouts, etc.) in this configuration. For more details on available options, see the Full Configuration Example File.
// 4. Running Llama-Swap
With the binary and config.yaml
ready, start Llama-Swap pointing to your config:
./llama-swap --config config.yaml --listen 127.0.0.1:8080
This launches the proxy server on localhost:8080
. It will read config.yaml
and (at first) load no models until the first request arrives. Llama-Swap will now handle API requests on port 8080
, forwarding them to the appropriate underlying llama-server
process based on the "model"
parameter.
// 5. Interacting with Your Models
Now you can make OpenAI-style API calls to test each model. Install jq if you don’t have it before running the commands below:
// Using Qwen2.5
curl -s http://localhost:8080/v1/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"model": "qwen2.5",
"prompt": "User: What is Python?nAssistant:",
"max_tokens": 100
}' | jq '.choices[0].text'
Output:
"Python is a popular general-purpose programming language. It is easy to learn, has a large standard library, and is compatible with many operating systems. Python is used for web development, data analysis, scientific computing, and machine learning.nPython is a language that is popular for web development due to its simplicity, versatility and its use of modern features. It is used in a wide range of applications including web development, data analysis, scientific computing, machine learning and more. Python is a popular language in the"
// Using SmolLM2
curl -s http://localhost:8080/v1/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"model": "smollm2",
"prompt": "User: What is Python?nAssistant:",
"max_tokens": 100
}' | jq '.choices[0].text'
Output:
"Python is a high-level programming language designed for simplicity and efficiency. It's known for its readability, syntax, and versatility, making it a popular choice for beginners and developers alike.nnWhat is Python?"
Each model will respond according to its training. The beauty of Llama-Swap is you don’t have to restart anything manually — just change the "model"
field, and it handles the rest. As shown in the examples above, you’ll see:
qwen2.5
: a more verbose, technical responsesmollm2
: a simpler, more concise answer
That confirms Llama-Swap is routing requests to the correct model!
# Conclusion
Congratulations! You’ve set up Llama-Swap to run two LLMs on one machine, and you can now switch between them on the fly via API calls. We installed a proxy, prepared a YAML configuration with two models, and saw how Llama-Swap routes requests to the correct backend.
Next steps: You can expand this to include:
- Larger models (like
TinyLlama
,Phi-2
,Mistral
) - Groups for concurrent serving
- Integration with LangChain, FastAPI, or other frontends
Have fun exploring different models and configurations!
Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.