Home » How We Reduced LLM Costs by 90% with 5 Lines of Code

How We Reduced LLM Costs by 90% with 5 Lines of Code

feeling when everything seems to be working just fine, until you look under the hood and realize your system is burning 10× more fuel than it needs to?

We had a client script firing off requests to validate our prompts, built with async Python code and running smoothly in a Jupyter notebook. Clean, simple, and fast. We ran it regularly to test our models and collect evaluation data. No red flags. No warnings.

But beneath that polished surface, something was quietly going wrong.

We weren’t seeing failures. We weren’t getting exceptions. We weren’t even noticing slowness. But our system was doing a lot more work than it needed to, and we didn’t realize it.

In this post, we’ll walk through how we discovered the issue, what caused it, and how a simple structural change in our async code reduced LLM traffic and cost by 90%, with virtually no loss in speed or functionality.

Now, fair warning, reading this post won’t magically slash your LLM costs by 90%. But the takeaway here is broader: small, overlooked design decisions, sometimes just a few lines of code, can lead to massive inefficiencies. And being intentional about how your code runs can save you time, money, and frustration in the long run.

The fix itself might feel niche at first. It involves the subtleties of Python’s asynchronous behavior, how tasks are scheduled and dispatched. If you’re familiar with Python and async/await, you’ll get more out of the code examples, but even if you’re not, there’s still plenty to take away. Because the real story here isn’t just about LLMs or Python, it’s about responsible, efficient engineering.

Let’s dig in.

The Setup

To automate validation, we use a predefined dataset and trigger our system through a client script. The validation focuses on a small subset of the dataset, so the client code only stops after receiving a certain number of responses.

Here’s a simplified version of our client in Python:

import asyncio
from aiohttp import ClientSession
from tqdm.asyncio import tqdm_asyncio

URL = "http://localhost:8000/example"
NUMBER_OF_REQUESTS = 100
STOP_AFTER = 10

async def fetch(session: ClientSession, url: str) -> bool:
    async with session.get(url) as response:
        body = await response.json()
        return body["value"]

async def main():
    results = []

    async with ClientSession() as session:
        tasks = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]

        for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
            response = await future
            if response is True:
                results.append(response)
                if len(results) >= STOP_AFTER:
                    print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
                    break

asyncio.run(main())

This script reads requests from a dataset, fires them concurrently, and stops once we collect enough true responses for our evaluation. In production, the logic is more complex and based on the diversity of responses we need. But the structure is the same.

Let’s use a dummy FastAPI server to simulate real behavior:

import asyncio
import fastapi
import uvicorn
import random

app = fastapi.FastAPI()

@app.get("/example")
async def example():
    sleeping_time = random.uniform(1, 2)
    await asyncio.sleep(sleeping_time)
    return {"value": random.choice([True, False])}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Now let’s fire up that dummy server and run the client. You’ll see something like this from the client terminal:

The progress bar stopped after receiving 10 responses

Can You Spot the Problem?

Photo by Keiteu Ko on Unsplash

Nice! Fast, clean, and… wait is everything working as expected?

On the surface, it seems like the client is doing the right thing: sending requests, getting 10 true responses, then stopping.

But is it?

Let’s add a few print statements to our server to see what it’s actually doing under the hood:

import asyncio
import fastapi
import uvicorn
import random

app = fastapi.FastAPI()

@app.get("/example")
async def example():
    print("Got a request")
    sleeping_time = random.uniform(1, 2)
    print(f"Sleeping for {sleeping_time:.2f} seconds")
    await asyncio.sleep(sleeping_time)
    value = random.choice([True, False])
    print(f"Returning value: {value}")
    return {"value": value}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0", port=8000)

Now re-run everything.

You’ll start seeing logs like this:

Got a request
Sleeping for 1.11 seconds
Got a request
Sleeping for 1.29 seconds
Got a request
Sleeping for 1.98 seconds
...
Returning value: True
Returning value: False
Returning value: False
...

Take a closer look at the server logs. You’ll notice something unexpected: instead of processing just 14 requests like we see in the progress bar, the server handles all 100. Even though the client stops after receiving 10 true responses, it still sends every request up front. As a result, the server must process all of them.

It’s an easy mistake to miss, especially because everything appears to be working correctly from the client’s perspective: responses come in quickly, the progress bar advances, and the script exits early. But behind the scenes, all 100 requests are sent immediately, regardless of when we decide to stop listening. This results in 10× more traffic than needed, driving up costs, increasing load, and risking rate limits.

So the key question becomes: why is this happening, and how can we make sure we only send the requests we actually need? The answer turned out to be a small but powerful change.

The root of the issue lies in how the tasks are scheduled. In our original code, we create a list of 100 tasks all at once:

tasks = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]

for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
    response = await future

When you pass a list of coroutines to as_completed, Python immediately wraps each coroutine in a Task and schedules it on the event loop. This happens before you start iterating over the loop body. Once a coroutine becomes a Task, the event loop starts running it in the background right away.

as_completed itself doesn’t control concurrency, it simply waits for tasks to finish and yields them one by one in the order they complete. Think of it as an iterator over completed futures, not a traffic controller. This means that by the time you start looping, all 100 requests are already in progress. Breaking out after 10 true results stops you from processing the rest, but it doesn’t stop them from being sent.

To fix this, we introduced a semaphore to limit concurrency. The semaphore adds a lightweight lock inside fetch so that only a fixed number of requests can start at the same time. The rest remain paused, waiting for a slot. Once we hit our stopping condition, the paused tasks never acquire the lock, so they never send their requests.

Here’s the adjusted version:

import asyncio
from aiohttp import ClientSession
from tqdm.asyncio import tqdm_asyncio

URL = "http://localhost:8000/example"
NUMBER_OF_REQUESTS = 100
STOP_AFTER = 10

async def fetch(session: ClientSession, url: str, semaphore: asyncio.Semaphore) -> str:
    async with semaphore:
        async with session.get(url) as response:
            body = await response.json()
            return body["value"]

async def main():
    results = []
    semaphore = asyncio.Semaphore(int(STOP_AFTER * 1.5))

    async with ClientSession() as session:
        tasks = [fetch(session, URL, semaphore) for _ in range(NUMBER_OF_REQUESTS)]

        for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
            response = await future
            if response:
                results.append(response)
                if len(results) >= STOP_AFTER:
                    print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
                    break

asyncio.run(main())

With this change, we still define 100 requests upfront, but only a small group is allowed to run at the same time, 15 in that example. If we reach our stopping condition early, the event loop stops before launching more requests. This keeps the behavior responsive while reducing unnecessary calls.

Now, the server logs will display only around 20 "Got a request/Returning response" entries. On the client side, the progress bar will appear identical to the original.

The progress bar stopped after receiving 10 responses

With this change in place, we saw immediate impact: 90% reduction in request volume and LLM cost, with no noticeable degradation in client experience. It also improved throughput across the team, reduced queuing, and eliminated rate-limit issues from our LLM providers.

This small structural adjustment made our validation pipeline dramatically more efficient, without adding much complexity to the code. It’s a good reminder that in async systems, control flow doesn’t always behave the way you assume unless you’re explicit about how tasks are scheduled and when they should run.

Bonus Insight: Closing the Event Loop

If we had run the original client code without asyncio.run, we might have noticed the problem earlier. 
For example, if we had used manual event loop management like this:

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

Python would have printed warnings such as:

Task was destroyed but it is pending!

These warnings appear when the program exits while there are still unfinished async tasks scheduled in the loop. If we had seen a screen full of those warnings, it likely would’ve triggered a red flag much sooner.

So why didn’t we see that warning when using asyncio.run()?

Because asyncio.run() takes care of cleanup behind the scenes. It doesn’t just run your coroutine and exit, it also cancels any remaining tasks, waits for them to finish, and only then shuts down the event loop. This built-in safety net prevents those “pending task” warnings from showing up, even if your code quietly launched more tasks than it needed to.

As a result, it suppresses those “pending task” warnings when you manually close the loop with loop.close() after run_until_complete(), any leftover tasks that haven’t been awaited will still be hanging around. Python detects that you’re forcefully shutting down the loop while work is still scheduled, and warns you about it.

This isn’t to say that every async Python program should avoid asyncio.run() or always use loop.run_until_complete() with a manual loop.close(). But it does highlight something important: you should be aware of what tasks are still running when your program exits. At the very least, it’s a good idea to monitor or log any pending tasks before shutdown.

Final Thoughts

By stepping back and rethinking the control flow, we were able to make our validation process dramatically more efficient — not by adding more infrastructure, but by using what we already had more carefully. A few lines of code change led to a 90% cost reduction with almost no added complexity. It resolved rate-limit errors, reduced system load, and allowed the team to run evaluations more frequently without causing bottlenecks.

It is an important reminder that “clean” async code doesn’t always mean efficient code, being intentional about how we use system resources is crucial. Responsible, efficient engineering is about more than just writing code that works. It’s about designing systems that respect time, money, and shared resources, especially in collaborative environments. When you treat compute as a shared asset instead of an infinite pool, everyone benefits: systems scale better, teams move faster, and costs stay predictable.

So, whether you’re making LLM calls, launching Kubernetes jobs, or processing data in batches, pause and ask yourself: am I only using what I really need?

Often, the answer and the improvement are just one line of code away.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *