Beyond Model Stacking: The Architecture Principles That Make Multimodal AI Systems Work

1. It with a Vision

While rewatching Iron Man, I found myself captivated by how deeply JARVIS could understand a scene. It wasn’t just recognizing objects, it understood context and described the scene in natural language: “This is a busy intersection where pedestrians are waiting to cross, and traffic is flowing smoothly.” That moment sparked a deeper question: could AI ever truly understand what’s happening in a scene — the way humans intuitively do?

That idea became clearer after I finished building PawMatchAI. The system was able to accurately identify 124 dog breeds, but I began to realize that recognizing a Labrador wasn’t the same as understanding what it was actually doing. True scene understanding means asking questions like: Where is this? and What’s going on here? , not just listing object labels.

That realization led me to design VisionScout , a multimodal AI system built to genuinely understand scenes, not just recognize objects.

The challenge wasn’t about stacking a few models together. It was an architectural puzzle:

how do you get YOLOv8 (for detection), CLIP (for semantic reasoning), Places365 (for scene classification), and Llama 3.2 (for language generation) to not just coexist, but collaborate like a team?

While building VisionScout, I realized the real challenge lay in breaking down complex problems, setting clear boundaries between modules, and designing the logic that allowed them to work together effectively.

💡 The sections that follow walk through this evolution step by step, from the earliest concept to three major architectural overhauls, highlighting the key principles that shaped VisionScout into a cohesive and adaptable system.

2. Three Critical Stages of System Evolution

2.1 First Evolution: The Cognitive Leap from Detection to Understanding

Building on what I learned from PawMatchAI, I started with the idea that combining several detection models might be enough for scene understanding. I built a foundational architecture where DetectionModel handled core inference, ColorMapper provided color coding for different categories, VisualizationHelper mapped colors to bounding boxes, and EvaluationMetrics took care of the stats. The system was about 1,000 lines long and could reliably detect objects and show basic visualizations.

But I soon realized the system was only producing detection data, which wasn’t all that useful to users. When it reported “3 people, 2 cars, 1 traffic light detected,” users were really asking: Where is this? What’s going on here? Is there anything I should be aware of?

That led me to try a template-based approach. It generated fixed-format descriptions based on combinations of detected objects. For example, if it detected a person, a car, and a traffic light, it would return: “This is a traffic scene with pedestrians and vehicles.” While it made the system seem like it “understood” the scene, the limits of this approach quickly became obvious.

When I ran the system on a nighttime street photo, it still gave clearly wrong descriptions like: “This is a bright traffic scene.” Looking closer, I saw the real issue: traditional visual analysis just reports what’s in the frame. But understanding a scene means figuring out what’s going on, why it’s happening, and what it might imply.

That moment made something clear: there’s a big gap between what a system can technically do and what’s actually useful in practice. Solving that gap takes more than templates — it needs deeper architectural thinking.

2.2 Second Evolution: The Engineering Challenge of Multimodal Fusion

The deeper I got into scene understanding, the more obvious it became: no single model could cover everything that real comprehension demanded. That realization made me rethink how the whole system was structured.

Each model brought something different to the table. YOLO handled object detection, CLIP focused on semantics, Places365 helped classify scenes, and Llama took care of the language. The real challenge was figuring out how to make them work together.

I broke down scene understanding into several layers, detection, semantics, scene classification, and language generation. What made it tricky was getting these parts to work together smoothly , without one stepping on another’s toes.

I developed a function that adjusts each model’s weight depending on the characteristics of the scene. If one model was especially confident about a scene, the system gave it more weight. But when things were less clear, other models were allowed to take the lead.

Once I began integrating the models, things quickly became more complicated. What started with just a few categories soon expanded to dozens, and each new feature risked breaking something that used to work.Debugging became a challenge. Fixing one issue could easily trigger two more in other parts of the system.

That’s when I realized: managing complexity isn’t just a side effect, it’s a design problem in its own right.

2.3 Third Evolution: The Design Breakthrough from Chaos to Clarity

At one point, the system’s complexity got out of hand. A single class file had grown past 2,000 lines and was juggling over ten responsibilities, from model coordination and data transformation to error handling and result fusion. It clearly broke the single-responsibility principle.

Every time I needed to tweak something small, I had to dig through that giant file just to find the right section. I was always on edge, knowing that a minor change might accidentally break something else.

After wrestling with these issues for a while, I knew patching things wouldn’t be enough. I had to rethink the system’s structure entirely, in a way that would stay manageable even as it kept growing.

Over the next few days, I kept running into the same underlying issue. The real blocker wasn’t how complex the functions were, it was how tightly everything was connected. Changing anything in the lighting logic meant double-checking how it would affect spatial analysis, semantic interpretation, and even the language output.

Adjusting model weights wasn’t simple either; I had to manually sync the formats and data flow across all four models every time. That’s when I began refactoring the architecture using a layered approach.

I divided it into three levels. The bottom layer included specialized tools that handled technical operations. The middle layer focused on logic, with analysis engines tailored to specific tasks. At the top, a coordination layer managed the flow between all components.

As the pieces fell into place, the system began to feel more transparent and much easier to manage.

2.4 Fourth Evolution: Designing for Predictability over Automation

Around that time, I ran into another design challenge, this time involving landmark recognition.

The system relied on CLIP’s zero-shot capability to identify 115 well-known landmarks without any task-specific training. But in real-world usage, this feature often got in the way.

A common issue was with aerial photos of intersections. The system would sometimes mistake them for Tokyo’s Shibuya crossing, and that misclassification would throw off the entire scene interpretation.

My first instinct was to fine-tune some of the algorithm’s parameters to help it better distinguish between lookalike scenes. But that approach quickly backfired. Reducing false positives for Shibuya ended up lowering the system’s accuracy for other landmarks.

It became clear that even small tweaks in a multimodal system could trigger side effects elsewhere, making things worse instead of better.

That’s when I remembered A/B testing principles from data science. At its core, A/B testing is about isolating variables so you can see the effect of a single change. It made me rethink the system’s behavior. Rather than trying to make it automatically handle every situation, maybe it was better to let users decide.

So I designed the enable_landmark parameter. On the surface, it was just a boolean switch. But the thinking behind it mattered more. By giving users control, I could make the system more predictable and better aligned with real-world needs. For everyday photos, users could turn off landmark detection to avoid false positives. For travel images, they could turn it on to surface cultural context and location insights.

This stage helped solidify two lessons for me. First, good system design doesn’t come from stacking features, it comes from understanding the real problem deeply. Second, a system that behaves predictably is often more useful than one that tries to be fully automatic but ends up confusing or unreliable.

3. Architecture Visualization: Complete Manifestation of Design Thinking

After four major stages of system evolution, I asked myself a new question:

How could I present the architecture clearly enough to justify the design and ensure scalability?

To find out, I redrew the system diagram from scratch, initially just to tidy things up. But it quickly became a full structural review. I discovered unclear module boundaries, overlapping functions, and overlooked gaps. That forced me to re-evaluate every component’s role and necessity.

Once visualized, the system’s logic became clearer. Responsibilities, dependencies, and data flow emerged more cleanly. The diagram not only clarified the structure, it became a reflection of my thinking around layering and collaboration.

The next sections walk through the architecture layer by layer, explaining how the design took shape.

Due to formatting limitations, you can view a clearer, interactive version of this architecture diagram here.

3.1 Configuration Knowledge Layer: Utility Layer (Intelligent Foundation and Templates)

When designing this layered architecture, I followed a key principle: system complexity should decrease progressively from top to bottom.

The closer to the user, the simpler the interface; the deeper into the system, the more specialized the tools. This structure helps keep responsibilities clear and makes the system easier to maintain and extend.

To avoid duplicated logic, I grouped similar technical functions into reusable tool modules. Since the system supports a wide range of analysis tasks, having modular tool groups became essential for keeping things organized. At the base of the architecture diagram sits the system’s core toolkit—what I refer to as the Utility Layer. I structured this layer into six distinct tool groups, each with a clear role and scope.

Spatial Tools handles all components related to spatial analysis, including RegionAnalyzer, ObjectExtractor, ZoneEvaluator and six others. As I worked through different tasks that required reasoning about object positions and layout, I realized the need to bring these functions under a single, coherent module.
Lighting Tools focuses on environmental lighting analysis and includesConfigurationManager, FeatureExtractor, IndoorOutdoorClassifier and LightingConditionAnalyzer. This group directly supports the lighting challenges explored during the second stage of system evolution.
Description Tools powers the system’s content generation. It includes modules like TemplateRepository, ContentGenerator, StatisticsProcessor, and eleven other components. The size of this group reflects how central language output is to the overall user experience.
LLM Tools and CLIP Tools support interactions with the Llama and CLIP models, respectively. Each group contains four to five focused modules that manage model input/output, preprocessing, and interpretation, helping these key AI models work smoothly within the system.
Knowledge Base acts as the system’s reference layer. It stores definitions for scene types, object classification schemes, landmark metadata, and other domain knowledge files—forming the foundation for consistent understanding across components.

I organized these tools with one key goal in mind: making sure each group handled a focused task without becoming isolated. This setup keeps responsibilities clear and makes cross-module collaboration more manageable

3.2 Infrastructure Layer: Supporting Services (Independent Core Power)

The Supporting Services layer serves as the system’s backbone, and I intentionally kept it relatively independent in the overall architecture. After careful planning, I placed five of the system’s most essential AI engines and utilities here: DetectionModel (YOLO), Places365Model, ColorMapper, VisualizationHelper, and EvaluationMetrics.

This layer reflects a core principle in my architecture: AI model inference should remain fully decoupled from business logic. The Supporting Services layer handles raw machine learning outputs and core processing tasks, but it doesn’t concern itself with how those outputs are interpreted or used in higher-level reasoning. This clear separation keeps the system modular, easier to maintain, and more adaptable to future changes.

When designing this layer, I focused on defining clear boundaries for each component. DetectionModeland Places365Model are responsible for core inference tasks. ColorMapper and VisualizationHelper manage the visual presentation of results. EvaluationMetrics focuses on statistical analysis and metric calculation for detection outputs. With responsibilities well separated, I can fine-tune or replace any of these components without worrying about unintended side effects on higher-level logic.

3.3 Intelligent Analysis Layer: Module Layer (Professional Advisory Team)

The Module Layer reflects the core of how the system reasons about a scene. It contains eight specialized analysis engines, each with a clearly defined role. These modules are responsible for different aspects of scene understanding, from spatial layout and lighting conditions to semantic description and model coordination.

SpatialAnalyzer focuses on understanding the spatial layout of a scene. It uses tools from the Spatial Tools group to analyze object positions, relative distances, and regional configurations.
LightingAnalyzer interprets environmental lighting conditions. It integrates outputs from the Places365Model to infer time of day, indoor/outdoor classification, and possible weather context. It also relies on Lighting Tools for more detailed signal extraction.
EnhancedSceneDescriber generates high-level scene descriptions based on detected content. It draws on Description Tools to build structured narratives that reflect both spatial context and object interactions.
LLMEnhancer improves language output quality. Using LLM Tools, it refines descriptions to make them more fluent, coherent, and human-like.
CLIPAnalyzer and CLIPZeroShotClassifier handle multimodal semantic tasks. The former provides image-text similarity analysis, while the latter uses CLIP’s zero-shot capabilities to identify objects and scenes without explicit training.
LandmarkProcessingManager handles recognition of notable landmarks and links them to cultural or geographic context. It helps enrich scene interpretation with higher-level symbolic meaning.
SceneScoringEngine coordinates decisions across all modules. It adjusts model influence dynamically based on scene type and confidence scores, producing a final output that reflects weighted insights from multiple sources.

This setup allows each analysis engine to focus on what it does best, while pulling in whatever support it needs from the tool layer. If I want to add a new type of scene understanding later on, I can just build a new module for it, no need to change existing logic or risk breaking the system.

3.4 Coordination Management Layer: Facade Layer (System Neural Center)

Facade Layer contains two key coordinators: ComponentInitializer handles component initialization during system startup, while SceneAnalysisCoordinator orchestrates analysis workflows and manages data flow.

These two coordinators embody the core spirit of Facade design: external simplicity with internal precision. Users only need to interface with clean input and output points, while all complex initialization and coordination logic is properly handled behind the scenes.

3.5 Unified Interface Layer: SceneAnalyzer (The Single External Gateway)

SceneAnalyzer serves as the sole entry point for the entire VisionScout system. This component reflects my core design belief: no matter how sophisticated the internal architecture becomes, external users should only need to interact with a single, unified gateway.

Internally, SceneAnalyzer encapsulates all coordination logic, routing requests to the appropriate modules and tools beneath it. It standardizes inputs, manages errors, and formats outputs, providing a clean and stable interface for any client application.

This layer represents the final distillation of the system’s complexity, offering streamlined access while hiding the intricate network of underlying processes. By designing this gateway, I ensured that VisionScout could be both powerful and simple to use, no matter how much it continues to evolve.

3.6 Processing Engine Layer: Processor Layer (The Dual Execution Engines)

In actual usage workflows, ImageProcessor and VideoProcessor represent where the system truly begins its work. These two processors are responsible for handling the input data, images or videos, and executing the appropriate analysis pipeline.

ImageProcessor focuses on static image inputs, integrating object detection, scene classification, lighting evaluation, and semantic interpretation into a unified output. VideoProcessor extends this capability to video analysis, providing temporal insights by analyzing object presence patterns and detection frequency across video frames.

From a user’s point of view, this is the entry point where results are generated. But from a system design perspective, the Processor Layer reflects the final composition of all architectural layers working together. These processors encapsulate the logic, tools, and models built earlier, providing a consistent interface for real-world applications without requiring users to manage internal complexities.

3.7 Application Interface Layer: Application Layer

Finally, the Application Layer serves as the system’s presentation layer, bridging technical capabilities with the user experience. It includes Style which handles styling and visual consistency, and UIManager, which manages user interactions and interface behavior. This layer ensures that all underlying functionality is delivered through a clean, intuitive, and accessible interface, making the system not only powerful but also easy to use.

4. Conclusion

Through the actual development process, I realized that many seemingly technical bottlenecks were rooted not in model performance, but in unclear module boundaries and flawed design assumptions. Overlapping responsibilities and tight coupling between components often led to unexpected interference, making the system increasingly difficult to maintain or extend.

Take SceneScoringEngine as an example. I initially applied fixed logic to aggregate model outputs, which caused biased scene judgments in specific cases. Upon further investigation, I found that different models should play different roles depending on the scene context. In response, I implemented a dynamic weight adjustment mechanism that adapts model contributions based on contextual signals—allowing the system to better leverage the right information at the right time.

This process showed me that effective architecture requires more than simply connecting modules. The real value lies in ensuring that the system remains predictable in behavior and adaptable over time. Without a clear separation of responsibilities and structural flexibility, even well-written functions can become obstacles as the system evolves.

In the end, I came to a deeper understanding: writing functional code is rarely the hard part. The real challenge lies in designing a system that grows gracefully with new demands. That requires the ability to abstract problems correctly, define precise module boundaries, and anticipate how design choices will shape long-term system behavior.

📖 Multimodal AI System Design Series

This article marks the beginning of a series that explores how I approached building a multimodal AI system, from early design concepts to major architectural shifts.

In the upcoming parts, I’ll dive deeper into the technical core: how the models work together, how semantic understanding is structured, and the design logic behind key decision-making components.

Thank you for reading. Through developing VisionScout, I’ve learned many valuable lessons about multimodal AI architecture and the art of system design. If you have any perspectives or topics you’d like to discuss, I welcome the opportunity to exchange ideas. 🙌

References & Further Reading

Core Technologies

YOLOv8: Ultralytics. (2023). YOLOv8: Real-time Object Detection and Instance Segmentation.
CLIP: Radford, A., et al. (2021). Learning Transferable Visual Representations from Natural Language Supervision. ICML 2021.
Places365: Zhou, B., et al. (2017). Places: A 10 Million Image Database for Scene Recognition. IEEE TPAMI.
Llama 3.2: Meta AI. (2024). Llama 3.2: Multimodal and Lightweight Models.