In the future, a home robot could manage daily chores itself and learn household patterns from ongoing experience. It may serve coffee in the morning without asking, having remembered your habits over time. For a multimodal agent, this intelligence depends on (a) observing the world through multimodal sensors continuously, (b) storing its experience in long-term memories, and (c) reasoning over this memory to guide its actions. Current research is focused on LLM-based agents, but multimodal agents process diverse inputs and store richer, multimodal content. This poses new challenges in maintaining consistency in long-term memory. Instead of simply storing descriptive experiences, multimodal agents must build internal world knowledge similar to how humans learn.
Existing attempts include appending raw agent trajectories, such as dialogues or execution histories, directly to memory. Some methods enhance this by combining summaries, latent embeddings, or structured knowledge representations. In multimodal agents, memory formation is closely tied to online video understanding, where early methods like extending context windows or compressing visual tokens often fail to scale for long video streams. Memory-based methods, which store encoded visual features, improve scalability but struggle with maintaining long-term consistency. The Socratic Models framework generates language-based memory to describe videos, offering scalability, but faces challenges in tracking evolving events and entities over time.
Researchers from ByteDance Seed, Zhejiang University, and Shanghai Jiao Tong University have proposed M3-Agent, a multimodal agent framework with long-term memory. M3-Agent processes real-time visual and auditory inputs to build and update its memory, just like humans. Unlike standard episodic memory, it also develops semantic memory, allowing the accumulation of world knowledge over time. Its memory is organized in an entity-centric, multimodal structure, ensuring a deeper and more coherent understanding of the environment. When given instructions, M3-Agent engages in multi-turn reasoning and autonomously retrieves relevant information. Moreover, M3-Bench is developed for long-video question answering to evaluate the effectiveness of M3-Agent.
M3-Agent contains a multimodal LLM and a long-term memory module, operating through two parallel processes: memorization and control. Long-term memory is an external database that stores structured, multimodal data in a memory graph, where nodes represent distinct memory items with unique IDs, modalities, raw content, embeddings, and metadata. During memorization, M3-Agent processes video streams clip by clip, generating episodic memory for raw content and semantic memory for abstract knowledge, such as identities and relationships. For control, the agent conducts multi-turn reasoning, using search functions to fetch relevant memory in up to H rounds. RL optimizes the framework, with separate models trained for memorization and control to achieve peak performance.
M3-Agent and all baselines are evaluated on both M3-Bench-robot and M3-Bench-web. On M3-Bench-robot, M3-agent achieves a 6.3% accuracy improvement over the strongest baseline, MA-LLM, while on M3-Bench-web and VideoMME-long, it outperforms GeminiGPT4o-Hybrid by 7.7% and 5.3%, respectively. Moreover, M3-Agent outperforms MA-LMM by 4.2% in human understanding and 8.5% in cross-modal reasoning on M3-Bench-robot. On M3-Bench-web, it outperforms Gemini-GPT4o-Hybrid with 15.5% gain and 6.7% in these categories. These results underscore M3-Agent’s ability to maintain character consistency, enhance human understanding, and effectively integrate multimodal information.
In conclusion, researchers introduced M3-Agent, a multimodal framework with long-term memory, capable of processing real-time video and audio streams to build episodic and semantic memories. This enables the agent to accumulate world knowledge and maintain consistent, context-rich memory over time. Experimental results show that M3-Agent outperforms all baselines across multiple benchmarks. Detailed case studies highlight current limitations and suggest future directions, such as improving attention mechanisms for semantic memory and developing more efficient visual memory systems. These advancements pave the way for more human-like AI agents in practical applications.
Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.