
# Introduction
With the surge of large language models (LLMs) in recent years, many LLM-powered applications are emerging. LLM implementation has introduced features that were previously non-existent.
As time goes on, many LLM models and products have become available, each with its pros and cons. Unfortunately, there is still no standard way to access all these models, as each company can develop its own framework. That is why having an open-source tool such as LiteLLM is useful when you need standardized access to your LLM apps without any additional cost.
In this article, we will explore why LiteLLM is beneficial for building LLM applications.
Let’s get into it.
# Benefit 1: Unified Access
LiteLLM’s biggest advantage is its compatibility with different model providers. The tool supports over 100 different LLM services through standardized interfaces, allowing us to access them regardless of the model provider we use. It’s especially useful if your applications utilize multiple different models that need to work interchangeably.
A few examples of the major model providers that LiteLLM supports include:
- OpenAI and Azure OpenAI, like GPT-4.
- Anthropic, like Claude.
- AWS Bedrock & SageMaker, supporting models like Amazon Titan and Claude.
- Google Vertex AI, like Gemini.
- Hugging Face Hub and Ollama for open-source models like LLaMA and Mistral.
The standardized format follows OpenAI’s framework, using its chat/completions schema. This means that we can switch models easily without needing to understand the original model provider’s schema.
For example, here is the Python code to use Google’s Gemini model with LiteLLM.
from litellm import completion
prompt = "YOUR-PROMPT-FOR-LITELLM"
api_key = "YOUR-API-KEY-FOR-LLM"
response = completion(
model="gemini/gemini-1.5-flash-latest",
messages=[{"content": prompt, "role": "user"}],
api_key=api_key)
response['choices'][0]['message']['content']
You only need to obtain the model name and the respective API keys from the model provider to access them. This flexibility makes LiteLLM ideal for applications that use multiple models or for performing model comparisons.
# Benefit 2: Cost Tracking and Optimization
When working with LLM applications, it is important to track token usage and spending for each model you implement and across all integrated providers, especially in real-time scenarios.
LiteLLM enables users to maintain a detailed log of model API call usage, providing all the necessary information to control costs effectively. For example, the `completion` call above will have information about the token usage, as shown below.
usage=Usage(completion_tokens=10, prompt_tokens=8, total_tokens=18, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=8, image_tokens=None))
Accessing the response’s hidden parameters will also provide more detailed information, including the cost.
With the output similar to below:
{'custom_llm_provider': 'gemini',
'region_name': None,
'vertex_ai_grounding_metadata': [],
'vertex_ai_url_context_metadata': [],
'vertex_ai_safety_results': [],
'vertex_ai_citation_metadata': [],
'optional_params': {},
'litellm_call_id': '558e4b42-95c3-46de-beb7-9086d6a954c1',
'api_base': 'https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-latest:generateContent',
'model_id': None,
'response_cost': 4.8e-06,
'additional_headers': {},
'litellm_model_name': 'gemini/gemini-1.5-flash-latest'}
There is a lot of information, but the most important piece is `response_cost`, as it estimates the actual charge you will incur during that call, although it could still be offset if the model provider offers free access. Users can also define custom pricing for models (per token or per second) to calculate costs accurately.
A more advanced cost-tracking implementation will also allow users to set a spending budget and limit, while also connecting the LiteLLM cost usage information to an analytics dashboard to more easily aggregate information. It’s also possible to provide custom label tags to help attribute costs to certain usage or departments.
By providing detailed cost usage data, LiteLLM helps users and organizations optimize their LLM application costs and budget more effectively.
# Benefit 3: Ease of Deployment
LiteLLM is designed for easy deployment, whether you use it for local development or a production environment. With modest resources required for Python library installation, we can run LiteLLM on our local laptop or host it in a containerized deployment with Docker without a need for complex additional configuration.
Speaking of configuration, we can set up LiteLLM more efficiently using a YAML config file to list all the necessary information, such as the model name, API keys, and any essential custom settings for your LLM Apps. You can also use a backend database such as SQLite or PostgreSQL to store its state.
For data privacy, you are responsible for your own privacy as a user deploying LiteLLM yourself, but this approach is more secure since the data never leaves your controlled environment except when sent to the LLM providers. One feature LiteLLM provides for enterprise users is Single Sign-On (SSO), role-based access control, and audit logs if your application needs a more secure environment.
Overall, LiteLLM provides flexible deployment options and configuration while keeping the data secure.
# Benefit 4: Resilience Features
Resilience is crucial when building LLM Apps, as we want our application to remain operational even in the face of unexpected issues. To promote resilience, LiteLLM provides many features that are useful in application development.
One feature that LiteLLM has is built-in caching, where users can cache LLM prompts and responses so that identical requests don’t incur repeated costs or latency. It is a useful feature if our application frequently receives the same queries. The caching system is flexible, supporting both in-memory and remote caching, such as with a vector database.
Another feature of LiteLLM is automatic retries, allowing users to configure a mechanism when requests fail due to errors like timeouts or rate-limit errors to automatically retry the request. It’s also possible to set up additional fallback mechanisms, such as using another model if the request has already hit the retry limit.
Lastly, we can set rate limiting for defined requests per minute (RPM) or tokens per minute (TPM) to limit the usage level. It’s a great way to cap specific model integrations to prevent failures and respect application infrastructure requirements.
# Conclusion
In the era of LLM product growth, it has become much easier to build LLM applications. However, with so many model providers out there, it becomes hard to establish a standard for LLM implementation, especially in the case of multi-model system architectures. This is why LiteLLM can help us build LLM Apps efficiently.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.