are everywhere — but are they always the right choice? In today’s AI world, it seems like everyone wants to use foundation models and agents.
From GPT to CLIP to SAM, companies are racing to build applications around large, general-purpose models. And for good reason: these models are powerful, flexible, and often easy to prototype with. But do you really need one?
In many cases — especially in production scenarios — a simpler, custom-trained model can perform just as well, if not better. With lower cost, lower latency, and more control.
This article aims to help you navigate this decision by covering:
- What foundation models are, and their pros and cons
- What custom models are, and their pros and cons
- How to choose the right approach based on your needs, with real world examples
- A visual decision framework to wrap it all up
Let’s get into it.
Foundation Models
A foundation model is a large, pretrained model trained on massive datasets across multiple domains. These models are designed to be flexible enough to solve a wide range of downstream tasks with little or no additional training. They can be seen as generalist models.
They come in various types:
- LLMs (Large Language Models) such as GPT-4, Claude, Gemini, LLaMA, Mistral… We hear a lot about them since the launch of ChatGPT.
- VLMs (Vision-Language Models) such as CLIP, Flamingo, Gemini Vision… They now tend to be used more and more, even in solutions like ChatGPT.
- Vision-specific models such as SAM, DINO, Stable Diffusion, FLUX. They are a bit more specialized and mostly used by practitioners, yet extremely powerful.
- Video-specific models such as RunwayML, SORA, Veo… This field has made incredible progress in the last couple of years, and is now reaching impressive results.
Most are accessible through APIs or open-source libraries, and many support zero-shot or few-shot learning.
These models are usually trained at a scale that is just not reachable by most companies, both in terms of data and computing power. That makes them really attractive for many reasons:
- General-purpose and versatile: One model can tackle many different tasks.
- Fast to prototype with: No need for your own dataset or training pipeline.
- Pretrained on vast, diverse data: They encode world knowledge and general reasoning.
- Zero/few-shot capabilities: They work reasonably well out of the box.
- Multimodal and flexible: They can sometimes handle text, images, code, audio, and more, which can be hard to reproduce for small teams.
While they are powerful, they come with some drawbacks and limitations:
- High operational cost: Inference is expensive, especially at scale.
- Opaque behavior: Results can be hard to debug or explain.
- Latency limitations: These models tend to be very large and have high latency, which may not be ideal for real-time applications.
- Privacy and compliance concerns: Data often needs to be sent to third-party APIs.
- Lack of control: Difficult to fine-tune or optimize for specific use cases, sometimes not even an option.
To recap, foundation models are very powerful: they are trained on massive datasets, can handle text, image, video and more. They don’t need to be trained on your data to work. But they are usually not cost effective, may have high latency and may required sending your data to third parties.
The alternative is to use custom models. Let’s now see what that means.
Custom Models
A custom model is a model built and trained specifically for a defined task using your own data. This could be as simple as a logistic regression or as complex as a deep learning architecture tailored to your unique problem.
They often require more upfront work but offer greater control, lower cost, and better performance on narrow tasks. Many powerful and business-driving models are actually custom models, some famous and widely used, some addressing really niche problems:
- Netflix’s recommendation engine, used by billions, is a custom model
- Most churn prediction models, widely used in many subscription-based companies, are custom models (sometimes just a well-tuned logistic regression)
- Credit scoring models
When using custom models, you master every single step, making them really powerful for several reasons:
- Task-specific and optimized: You control the model, the training data, and the evaluation.
- Lower latency and cost: Custom models are usually smaller and less expensive. It’s critical in edge or real-time environments.
- Full control and explainability: They are easier to debug, retrain, and monitor.
- Better for tabular or structured data: Foundation models excel with unstructured data. Custom models tend to do better on tabular data.
- Improved data privacy: No need to send data to external APIs.
On the other hand, you have to train and deploy your custom models yourself to get business value out of them. It comes with some drawbacks:
- Labeled data may be required: Which can be expensive or time-consuming to get.
- Slower to develop: Custom models require training a model, implement pipelines, deploy and maintain. This is time consuming.
- Skilled resources needed: In-house ML expertise is a must.
Feel free to dig into deployment strategies and how to choose the best approach in that article:

In one word, custom models give more control and are usually less expensive to scale. But it comes at the cost of a more expensive and longer development phase — not to mention the skills. Then how to choose wisely whether to use a custom model or a foundation model? Let’s try to answer that question.
Foundation Model or Custom Model: How to Choose?
When to Choose a Custom Model
I would say that a custom model must be the default choice overall. But to be more fair, let’s see in what specific cases it is clearly a better solution than a foundation model. It comes down a few requirements:
- Teams & Resources: you have a machine learning engineer or data team, you can label or generate training data, and you’re able to spend time training and optimizing your model
- Business: either you have a really specific case to solve, you have privacy requirements, you need low infra cost, or you need low latency or even edge deployment
- Long-term goals: you want control, and you don’t want to rely on third-party APIs
If you find yourself in one or more of these situations, a custom model may be your best option. Some typical examples I faced in my career were in that situation, for example:
- Building an in-house, custom forecasting model for YouTube video revenue: you can’t compromise on privacy, and no foundation model will do well enough on such specific use cases
- Deploying real-time video solution on smartphone: when you need to work at more than 30 frames per second, no VLM can handle the task yet
- Credit scoring for a bank: you can’t compromise on privacy, and can’t use third-party solutions
If you want to dig into it, here is an article about how to forecast YouTube video revenue:
That being said, while in some cases foundation models are not the solution, let’s see when they actually are a viable option.
When to Choose a Foundation Model
Let’s make the equivalent exercise for foundation models: let’s first check the requirements that make them a good option, and let’s look at some typical business cases where they would thrive:
- Team & Resources: you don’t necessarily have labeled data, nor ML engineers or data scientists, but you do have AI or Software engineers
- Business: you want to test an idea quickly or ship an MVP, you’re fine with using external APIs, and latency or scaling cost aren’t major concerns
- Task Characteristics: your task is open-ended, or you’re exploring a novel or creative problem space
Here are some typical examples where foundation models have proven valuable
- Prototyping a chatbot for internal support or knowledge management: you have an open-ended task, with low requirements on latency and scale
- Many early-stage MVPs without long-term infra concerns are good candidates
As of now, foundation models are really popular for many MVPs revolving around text and image, while custom models have proven their value in many business cases. But why not combining both? In some cases, it’s possible to get the best solutions with hybrid approaches. Let’s see what that means.
When to Use Hybrid Solutions
In many real-world workflows, the best answer is a combination of both approaches. For example, here are a few common hybrid patterns that can leverage the best of both worlds
- Foundation model as a labeling tool: use SAM or GPT to create labeled data, then train a smaller model.
- Knowledge distillation: train a custom model to mimic the outputs of a foundation model.
- Bootstrapping: start with foundation model to test, then switch to custom later.
- Feature extraction: use CLIP or GPT embeddings as input to a simpler downstream model.
I used some of these approaches in past projects during my career, and they sometimes allow to get state-of-the-art solutions, using the generalistic power of foundation models and the flexibility and scalability of custom models.
- In computer vision projects, I used Stable Diffusion to create diverse and realistic datasets, as well as SAM to annotate data quickly and efficiently
- Small Language Models are getting traction, and sometimes get advantage of knowledge distillation to get the best out of LLMs while remaining smaller, more specialized and more scalable
- One can also use tools like ChatGPT to easily annotate data at scale before training custom models
Here is a concrete example of using foundation models in hybrid solutions for computer vision:
In a word, in many cases when dealing with unstructured data, a hybrid approach can be powerful and give the best of both worlds.
Conclusion: Decision Framework
Let’s now summarize with a decision chart when to go for a foundation model, when to go for a custom model, and when to explore a hybrid approach.

In a few words, it all comes down to the project and the need. Sure, foundation models are buzzing right now, and they are at the heart of the current agents revolution. Still, many very valuable business problems can be addressed with custom models, while foundation models are proven powerful in many unstructured data problems. To choose wisely, a proper analysis of the needs and requirements with stakeholders and engineers, along with a decision framework remains a good solution.
What about you: have you faced any situation where the best solution is not what you might think?
References
- Mentioned LLMs: GPT by OpenAI, Claude by Anthropic, Llama by Meta, Gemini by Google, and we could cite more such as Mistral, DeepSeek, etc…
- Vision-related models: SAM by Meta, CLIP by OpenAI, DINO by Meta, StableDiffusion by StabilityAI, FLUX by Black Forest Labs
- Video-specific models: Veo by Google, RunwayML, SORA by OpenAI…