How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus

GPT-4o-based agents led by a human collaborated to develop experimentally validated nanobodies against SARS-CoV-2. Discover with me how we are entering the era of automated reasoning and work with actionable outputs, and let’s imagine how this will go even further when robotic lab technicians can also carry out the wet lab parts of any research project!

In an era where artificial intelligence continues to reshape how we code, write, and even reason, a new frontier has emerged: AI conducting real scientific research, as several companies (from major established players like Google to dedicated spin offs) are trying to achieve. And we are not talking just about simulations, automated summarization, data crunching, or theoretical outputs, but actually about producing experimentally validated materials, such as biological designs with potential clinical relevance, in this case that I bring you today.

That future just got much closer; very close indeed!

In a groundbreaking paper just published in Nature by researchers from Stanford and the Chan Zuckerberg Biohub, a novel system called the Virtual Lab demonstrated that a human researcher working with a team of large language model (LLM) agents can design new nanobodies—these are tiny, antibody-like proteins that can bind others to block their function—to target fast-mutating variants of SARS-CoV-2. This was not just a narrow chatbot interaction or a tool-assisted paper; it was an open-ended, multi-phase research process led and executed by AI agents, each having a specialized expertise and role, resulting in real-world validated biological molecules that could perfectly move on to downstream studies for actual applicability in disease (in this case Covid-19) treatment.

Let’s delve into it to see how this is serious, replicable research presenting an approach to AI-human (and actually AI-AI) collaborative science works.

From “simple” applications to materializing a full-staff AI lab

Although there are some precedents to this, the new system is unlike anything before it. And one of the coolest things is that it is not based on a special-purpose trained LLM or tool; rather, it uses GPT-4o instructed with prompts that make it play the role of the different kinds of people often involved in a research team.

Until recently, the role of LLMs in science was limited to question-answering, summarizing, writing support, coding and perhaps some direct data analysis. Useful, yes, but not transformative. The Virtual Lab presented in this new Nature paper changes that by elevating LLMs from assistants to autonomous researchers that interact with one another and with a human user (who brings the research question, runs experiments when required, and eventually concludes the project) in structured meetings to explore, hypothesize, code, analyze, and iterate.

The core idea at the heart of this work was indeed to simulate an interdisciplinary lab staffed by AI agents. Each agent has a scientific role—say, immunologist, computational biologist, or machine learning specialist—and is instantiated from GPT-4o with a “persona” crafted via careful prompt engineering. These agents are led by a Principal Investigator (PI) Agent and monitored by a Scientific Critic Agent, both virtual agents.

The Critic Agent challenges assumptions and pinpoints errors, acting as the lab’s skeptical reviewer; as the paper explored, this was a key element of the workflow without which too many errors and overlooks happened that went in detriment of the project.

The human researcher sets high-level agendas, injects domain constraints, and ultimately runs the outputs (especially the wet-lab experiments). But the “thinking” (maybe I should start considering removing the quotation marks?) is done by the agents.

How it all worked

The Virtual Lab was tasked with solving a real and urgent biomedical challenge: design new nanobody binders for the KP.3 and JN.1 variants of SARS-CoV-2, which had evolved resistance to existing treatments. Instead of starting from scratch, the AI agents decided (yes, they took this decision by themselves) to mutate existing nanobodies that were effective against the ancestral strain but no longer worked as well. Yes, they decided on that!

As all interactions are tracked and documented, we can see exactly how the team moved forward with the project.

First of all, the human defined only the PI and Critic agents. The PI agent then created the scientific team by spawning specialized agents. In this case they were an Immunologist, a Machine Learning Specialist and a Computational Biologist. In a team meeting, the agents debated whether to design antibodies or nanobodies, and whether to design from scratch or mutate existing ones. They chose nanobody mutation, justified by faster timelines and available structural data. They then discussed about what tools to use and how to implement them. They went for the ESM protein language model coupled to AlphaFold multimer for structure prediction and Rosetta for binding energy calculations. For implementation, the agents decided to go with Python code. Very interestingly, the code needed to be reviewed by the Critic agent multiple times, and was refined through multiple asynchronous meetings.

From the meetings and multiple runs of code, an exact strategy was devised on how to propose the final set of mutations to be tested. Briefly, for the bioinformaticians reading this post, the PI agent designed an iterative pipeline that uses ESM to score all point mutations on a nanobody sequence by log-likelihood ratio, selects top mutants to predict their structures in complex with the target protein using AlphaFold-Multimer, scores the interfaces via ipLDDT, then uses Rosetta to estimate binding energy, and finally combines the scores to come up with a ranking of all proposed mutations. This actually was repeated in a cycle, to introduce more mutations as needed.

The computational pipeline generated 92 nanobody sequences that were synthesized and tested in a real-world lab, finding that most of them were actually proteins that can be produced and handled. Two of these proteins gained affinity to the SARS-CoV-2 proteins they were designed to bind, both to modern mutants and to ancestral forms.

These success rates are similar to those coming from analogous projects running in traditional form (that is executed by humans), but they took much much less time to conclude. And it hurts to say it, but I’m pretty sure the virtual lab entailed much lower costs overall, as it involved much fewer people (hence salaries).

Like in a human group of scientists: meetings, roles, and collaboration

We saw above how the Virtual Lab mimics how human science happens: via structured interdisciplinary meetings. Each meeting is either a “Team Meeting”, where multiple agents discuss broad questions (the PI starts, others contribute, the Critic reviews, and the PI summarizes and decides); or an “Individual Meeting” where a single agent (with or without the Critic) works on a specific task, e.g., writing code or scoring outputs.

To avoid hallucinations and inconsistency, the system also uses parallel meetings; that is, the same task is run multiple times with different randomness (i.e. at high “temperature”). It is interesting that the outcomes from these several meetings is the condensed in a single low-temperature merge meeting that is much more deterministic and can quite safely decide which conclusions, among all coming from the various meetings, make more sense.

Clearly, these ideas can be applied to any other kind of multi-agent interaction, and for any purpose!

How much did the humans do?

Surprisingly little, for the computational part of course–as the experiments can’t be so much automated yet, though keep reading to find some reflections about robots in labs!

In this Virtual Lab round, LLM agents wrote 98.7% of the total words (over 120,000 tokens), while the human researcher contributed just 1,596 words in total across the entire project. The Agents wrote all the scripts for ESM, AlphaFold-Multimer post-processing, and Rosetta XML workflows. The human only helped running the code and facilitated the real-world experiments. The Virtual Lab pipeline was built in 1-2 days of prompting and meetings, and the nanobody design computation ran in ~1 week.

Why this matters (and what comes next)

The Virtual Lab could serve as the prototype for a fundamentally new research model–and actually for a fundamentally new way to work, where everything that can be done on a computer is left automated, with humans only taking the very critical decisions. LLMs are clearly shifting from passive tools to active collaborators that, as the Virtual Lab shows, can drive complex, interdisciplinary projects from idea to implementation.

The next ambitious leap? Replace the hands of human technicians, who ran the experiments, with robotic ones. Clearly, the next frontier is in automatizing the physical interaction with the real world, which is essentially what robots are. Imagine then the full pipeline as applied to a research lab:

A human PI defines a high-level biological goal.
The team does research of existing information, scans databases, brainstorms ideas.
A set of AI agents selects computational tools if required, writes and runs code and/or analyses, and finally proposes experiments.
Then, robotic lab technicians, rather than human technicians, carry out the protocols: pipetting, centrifuging, plating, imaging, data collection.
The results flow back into the Virtual Lab, closing the loop.
Agents analyze, adapt, iterate.

This would make the research process truly end-to-end autonomous. From problem definition to experiment execution to result interpretation, all components would be run by an integrated AI-robotics system with minimal human intervention—just high-level steering, supervision, and global vision.

Robotic biology labs are already being prototyped. Emerald Cloud Lab, Strateos, and Transcriptic (now part of Colabra) offer robotic wet-lab-as-a-service. Future House is a non-profit building AI agents to automate research in biology and other complex sciences. In academia, some autonomous chemistry labs exist whose robots can explore chemical space on their own. Biofoundries use programmable liquid handlers and robotic arms for synthetic biology workflows. Adaptyv Bio automates protein expression and testing at scale.

Such kinds of automatized laboratory systems coupled to emerging systems like Virtual Lab could radically transform how science and technology progress. The intelligent layer drives the project and give those robots work to do, whose output then feeds back into the thinking tool in a closed-loop discovery engine that would run 24/7 without fatigue or scheduling conflicts, conduct hundreds or thousands of micro-experiments in parallel, and rapidly explore vast hypothesis spaces that are just not feasible for human labs. Moreover, the labs, virtual labs, and managers don’t even need to be physically together, allowing to optimize how resources are spread.

There are challenges, of course. Real-world science is messy and nonlinear. Robotic protocols must be incredibly robust. Unexpected errors still need judgment. But as robotics and AI continue to evolve, those gaps will certainly shrink.

Final thoughts

We humans were always confident that technology in the form of smart computers and robots would kick us out of our highly repetitive physical jobs, while creativity and thinking would still be our domain of mastery for decades, perhaps centuries. However, despite quite some automation via robots, the AI of the 2020s has shown us that technology can also be better than us at some of our most brain-intensive jobs.

In the near future, LLMs don’t just answer our questions or barely help us with work. They will ask, argue, debate, decide. And sometimes, they will discover!

References and further reads

The Nature paper analyzed here:

https://www.nature.com/articles/s41586-025-09442-9

Other scientific discoveries by AI systems:

https://pub.towardsai.net/two-new-papers-by-deepmind-exemplify-how-artificial-intelligence-can-help-human-intelligence-ae5143f07d49

How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus

From “simple” applications to materializing a full-staff AI lab

How it all worked

Like in a human group of scientists: meetings, roles, and collaboration

How much did the humans do?

Why this matters (and what comes next)

Final thoughts

References and further reads

Related Posts

The Importance of Visualization in Data Storytelling

Claude AI Ranks In Top 3% At Student Hacking Contest

Leave a Reply Cancel reply