
Image by Author
Have you ever wondered if there’s a better way to install and run llama.cpp locally? Almost every local large language model (LLM) application today relies on llama.cpp
as the backend for running models. But here’s the catch: most setups are either too complex, require multiple tools, or don’t give you a powerful user interface (UI) out of the box.
Wouldn’t it be great if you could:
- Run a powerful model like GPT-OSS 20B with just a few commands
- Get a modern Web UI instantly, without extra hassle
- Have the fastest and most optimized setup for local inference
That’s exactly what this tutorial is about.
In this guide, we will walk through the best, most optimized, and fastest way to run the GPT-OSS 20B model locally using the llama-cpp-python
package together with Open WebUI. By the end, you will have a fully working local LLM environment that’s easy to use, efficient, and production-ready.
# 1. Setting Up Your Environment
If you already have the uv
command installed, your life just got easier.
If not, don’t worry. You can install it quickly by following the official uv installation guide.
Once uv
is installed, open your terminal and install Python 3.12 with:
Next, let’s set up a project directory, create a virtual environment, and activate it:
mkdir -p ~/gpt-oss && cd ~/gpt-oss
uv venv .venv --python 3.12
source .venv/bin/activate
# 2. Installing Python Packages
Now that your environment is ready, let’s install the required Python packages.
First, update pip to the latest version. Next, install the llama-cpp-python
server package. This version is built with CUDA support (for NVIDIA GPUs), so you will get maximum performance if you have a compatible GPU:
uv pip install --upgrade pip
uv pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
Finally, install Open WebUI and Hugging Face Hub:
uv pip install open-webui huggingface_hub
- Open WebUI: Provides a ChatGPT-style web interface for your local LLM server
- Hugging Face Hub: Makes it easy to download and manage models directly from Hugging Face
# 3. Downloading the GPT-OSS 20B Model
Next, let’s download the GPT-OSS 20B model in a quantized format (MXFP4) from Hugging Face. Quantized models are optimized to use less memory while still maintaining strong performance, which is perfect for running locally.
Run the following command in your terminal:
huggingface-cli download bartowski/openai_gpt-oss-20b-GGUF openai_gpt-oss-20b-MXFP4.gguf --local-dir models
# 4. Serving GPT-OSS 20B Locally Using llama.cpp
Now that the model is downloaded, let’s serve it using the llama.cpp
Python server.
Run the following command in your terminal:
python -m llama_cpp.server
--model models/openai_gpt-oss-20b-MXFP4.gguf
--host 127.0.0.1 --port 10000
--n_ctx 16384
Here’s what each flag does:
--model
: Path to your quantized model file--host
: Local host address (127.0.0.1)--port
: Port number (10000 in this case)--n_ctx
: Context length (16,384 tokens for longer conversations)
If everything is working, you will see logs like this:
INFO: Started server process [16470]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
To confirm the server is running and the model is available, run:
curl http://127.0.0.1:10000/v1/models
Expected output:
{"object":"list","data":[{"id":"models/openai_gpt-oss-20b-MXFP4.gguf","object":"model","owned_by":"me","permissions":[]}]}
Next, we will integrate it with Open WebUI to get a ChatGPT-style interface.
# 5. Launching Open WebUI
We have already installed the open-webui
Python package. Now, let’s launch it.
Open a new terminal window (keep your llama.cpp
server running in the first one) and run:
open-webui serve --host 127.0.0.1 --port 9000
This will start the WebUI server at: http://127.0.0.1:9000
When you open the link in your browser for the first time, you will be prompted to:
- Create an admin account (using your email and a password)
- Log in to access the dashboard
This admin account ensures your settings, connections, and model configurations are saved for future sessions.
# 6. Setting Up Open WebUI
By default, Open WebUI is configured to work with Ollama. Since we are running our model with llama.cpp
, we need to adjust the settings.
Follow these steps inside the WebUI:
// Add llama.cpp as an OpenAI Connection
- Open the WebUI: http://127.0.0.1:9000 (or your forwarded URL).
- Click on your avatar (top-right corner) → Admin Settings.
- Go to: Connections → OpenAI Connections.
- Edit the existing connection:
- Base URL:
http://127.0.0.1:10000/v1
- API Key: (leave blank)
- Base URL:
- Save the connection.
- (Optional) Disable Ollama API and Direct Connections to avoid errors.
// Map a Friendly Model Alias
- Go to: Admin Settings → Models (or under the connection you just created)
- Edit the model name to
gpt-oss-20b
- Save the model
// Start Chatting
- Open a new chat
- In the model dropdown, select:
gpt-oss-20b
(the alias you created) - Send a test message
# Final Thoughts
I honestly didn’t expect it to be this easy to get everything running with just Python. In the past, setting up llama.cpp
meant cloning repositories, running CMake
builds, and debugging endless errors — a painful process many of us are familiar with.
But with this approach, using the llama.cpp
Python server together with Open WebUI, the setup worked right out of the box. No messy builds, no complicated configs, just a few simple commands.
In this tutorial, we:
- Set up a clean Python environment with
uv
- Installed the
llama.cpp
Python server and Open WebUI - Downloaded the GPT-OSS 20B quantized model
- Served it locally and connected it to a ChatGPT-style interface
The result? A fully local, private, and optimized LLM setup that you can run on your own machine with minimal effort.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.