How to Run Multiple LLMs Locally Using Llama-Swap on a Single Server

Image by Author | Ideogram

Running multiple large language models can be useful, whether for comparing model outputs, setting up a fallback in case one fails, or customizing behavior (like using one model for coding and another for technical writing). This is how we often use LLMs in practice. There are apps like poe.com that offer this kind of setup. It’s a single platform where you can run multiple LLMs. But what if you want to do it all locally, save on API costs, and keep your data private?

Well, that’s where the real problem shows up. Setting this up usually means juggling different ports, running separate processes, and switching between them manually. Not ideal.

That’s exactly the pain Llama-Swap solves. It’s an open-source proxy server that’s super lightweight (just a single binary), and it lets you switch between multiple local LLMs easily. In simple terms, it listens for OpenAI-style API calls on your machine and automatically starts or stops the right model server based on the model you request. Let’s break down how it works and walk through a step-by-step setup to get it running on your local machine.

# How Llama-Swap Works

Conceptually, Llama-Swap sits in front of your LLM servers as a smart router. When an API request arrives (e.g., a POST /v1/chat/completions call), it looks at the "model" field in the JSON payload. It then loads the appropriate server process for that model, shutting down any other model if needed. For example, if you first request model "A" and then request model "B", Llama-Swap will automatically stop the server for “A” and start the server for “B” so that each request is served by the correct model. This dynamic swapping happens transparently, so clients see the expected response without worrying about the underlying processes.

By default, Llama-Swap allows only one model to run at a time (it unloads others when switching). However, its Groups feature lets you change this behavior. A group can list several models and control their swap behavior. For example, setting swap: false in a group means all group members can run together without unloading. In practice, you might use one group for heavyweight models (only one active at a time) and another “parallel” group for small models you want running concurrently. This gives you full control over resource usage and concurrency on a single server.

# Prerequisites

Before getting started, ensure your system has the following:

Python 3 (>=3.8): Needed for basic scripting and tooling.
Homebrew (on macOS): Makes installing LLM runtimes easy. For example, you can install the llama.cpp server with:

This provides the llama-server binary for hosting models locally.

llama.cpp (llama-server): The OpenAI-compatible server binary (installed via Homebrew above, or built from source) that actually runs the LLM model.
Hugging Face CLI: For downloading models directly to your local machine without logging into the site or manually navigating model pages. Install it using:

pip install -U "huggingface_hub[cli]"

Hardware: Any modern CPU will work. For faster inference, a GPU is useful. (On Apple Silicon Macs, you can run on the CPU or try PyTorch’s MPS backend for supported models. On Linux/Windows with NVIDIA GPUs, you can use Docker/CUDA containers for acceleration.)
Docker (Optional): To run the pre-built Docker images. However, I chose not to use this for this guide because these images are designed mainly for x86 (Intel/AMD) systems and don’t work reliably on Apple Silicon (M1/M2) Macs. Instead, I used the bare-metal installation method, which works directly on macOS without any container overhead.

In summary, you’ll need a Python environment and a local LLM server (like the `llama.cpp` server). We will use these to host two example models on one machine.

# Step-by-Step Instructions

// 1. Installing Llama-Swap

Download the latest Llama-Swap release for your OS from the GitHub releases page. For example, I could see v126 as the latest release. Run the following commands:

# Step 1: Download the correct file
curl -L -o llama-swap.tar.gz 
  https://github.com/mostlygeek/llama-swap/releases/download/v126/llama-swap_126_darwin_arm64.tar.gz

Output:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3445k  100 3445k    0     0  1283k      0  0:00:02  0:00:02 --:--:-- 5417k

Now, extract the file, make it executable, and test it by checking the version:

# Step 2: Extract it
tar -xzf llama-swap.tar.gz

# Step 3: Make it executable
chmod +x llama-swap

# Step 4: Test it
./llama-swap --version

Output:
version: 126 (591a9cdf4d3314fe4b3906e939a17e76402e1655), built at 2025-06-16T23:53:50Z

// 2. Downloading and Preparing Two or More LLMs

Choose two example models to run. We’ll use Qwen2.5-0.5B and SmolLM2-135M (small models) from Hugging Face. You need the model files (in GGUF or similar format) on your machine. For example, using the Hugging Face CLI:

mkdir -p ~/llm-models

huggingface-cli download bartowski/SmolLM2-135M-Instruct-GGUF 
  --include "SmolLM2-135M-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models

huggingface-cli download bartowski/Qwen2.5-0.5B-Instruct-GGUF 
  --include "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models

This will:

Create the directory llm-models in your user’s home folder
Download the GGUF model files safely into that folder. After download, you can confirm it’s there:

Output:

SmolLM2-135M-Instruct-Q4_K_M.gguf
Qwen2.5-0.5B-Instruct-Q4_K_M.gguf

// 3. Creating a Llama-Swap Configuration

Llama-Swap uses a single YAML file to define models and server commands. Create a config.yaml file with contents like this:

models:
  "smollm2":
    cmd: |
      llama-server
      --model /path/to/models/llm-models/SmolLM2-135M-Instruct-Q4_K_M.gguf
      --port ${PORT}

  "qwen2.5":
    cmd: |
      llama-server
      --model /path/to/models/llm-models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
      --port ${PORT}

Replace /path/to/models/ with your actual local path. Each entry under models: gives an ID (like "qwen2.5") and a shell cmd: to run its server. We use llama-server (from llama.cpp) with --model pointing to the GGUF file and --port ${PORT}. The ${PORT} macro tells Llama-Swap to assign a free port to each model automatically. The groups section is optional. I have omitted it for this example, so by default, Llama-Swap will only run one model at a time. You can customize many options per model (aliases, timeouts, etc.) in this configuration. For more details on available options, see the Full Configuration Example File.

// 4. Running Llama-Swap

With the binary and config.yaml ready, start Llama-Swap pointing to your config:

./llama-swap --config config.yaml --listen 127.0.0.1:8080

This launches the proxy server on localhost:8080. It will read config.yaml and (at first) load no models until the first request arrives. Llama-Swap will now handle API requests on port 8080, forwarding them to the appropriate underlying llama-server process based on the "model" parameter.

// 5. Interacting with Your Models

Now you can make OpenAI-style API calls to test each model. Install jq if you don’t have it before running the commands below:

// Using Qwen2.5

curl -s http://localhost:8080/v1/completions 
  -H "Content-Type: application/json" 
  -H "Authorization: Bearer no-key" 
  -d '{
        "model": "qwen2.5",
        "prompt": "User: What is Python?nAssistant:",
        "max_tokens": 100
      }' | jq '.choices[0].text'

Output:
"Python is a popular general-purpose programming language. It is easy to learn, has a large standard library, and is compatible with many operating systems. Python is used for web development, data analysis, scientific computing, and machine learning.nPython is a language that is popular for web development due to its simplicity, versatility and its use of modern features. It is used in a wide range of applications including web development, data analysis, scientific computing, machine learning and more. Python is a popular language in the"

// Using SmolLM2

curl -s http://localhost:8080/v1/completions 
  -H "Content-Type: application/json" 
  -H "Authorization: Bearer no-key" 
  -d '{
        "model": "smollm2",
        "prompt": "User: What is Python?nAssistant:",
        "max_tokens": 100
      }' | jq '.choices[0].text'

Output:
"Python is a high-level programming language designed for simplicity and efficiency. It's known for its readability, syntax, and versatility, making it a popular choice for beginners and developers alike.nnWhat is Python?"

Each model will respond according to its training. The beauty of Llama-Swap is you don’t have to restart anything manually — just change the "model" field, and it handles the rest. As shown in the examples above, you’ll see:

qwen2.5: a more verbose, technical response
smollm2: a simpler, more concise answer

That confirms Llama-Swap is routing requests to the correct model!

# Conclusion

Congratulations! You’ve set up Llama-Swap to run two LLMs on one machine, and you can now switch between them on the fly via API calls. We installed a proxy, prepared a YAML configuration with two models, and saw how Llama-Swap routes requests to the correct backend.

Next steps: You can expand this to include:

Larger models (like TinyLlama, Phi-2, Mistral)
Groups for concurrent serving
Integration with LangChain, FastAPI, or other frontends

Have fun exploring different models and configurations!

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.