“Where’s Marta?”: How We Removed Uncertainty From AI Reasoning

“stochastic parrots” to AI models winning math contests? While there is certainly doubt that LLMs are truly PhD-level thinkers as advertised, the progress in complex reasoning situations is undeniable.

A popular trick has been to mix and match LLM generative capabilities with formal verifiers, i.e. purpose-built software that provides guaranteed solutions to certain problems, when stated precisely. The key insight is that LLMs may be good at translating messy, ambiguous requirements into precise formal specifications. Formal verifiers excel at finding solutions that satisfy those specifications. By combining them, we get a system that can understand what you want and guarantee it delivers exactly that: recently, AWS is using this very trick to build “guardrails” for real time chats.

How does this work in practice? Unfortunately, the explanation of these basic dynamics often happens within larger, complex contexts, like reinforcement learning or mathematical proofs. Today, we’ll demonstrate this hybrid approach using Alloy, a lightweight language that is trivial to read, even for beginners. Instead of the usual math-y papers and hard-to-grasp benchmarks, we are going to solve a much more relatable challenge, inspired by a weekly crossword publication:

The original puzzle: 5 cars (1-5), 5 girls (A-E), 5 names, and 4 statements: who is Marta and what is her car? [ The original puzzle from “Settimana Enigmistica” was discussed first in this LinkedIn post; the image was then modified as well as translated by the author. As we show in the prompts below, a pure textual representation of the situation is obviously possible. ]

We have: 5 cars (1-5) parked in front of 5 girls (A-E), and 5 names (Laura, Giovanna, Bianca, Franca, Marta); we don’t know which car was parked by which girl but the girls say something about the situation. Our task is to answer this deceptively simple question: which girl is named Marta and what is her car?

While more beach-level than PhD-level thinking, the solution sits at a sweet spot of complexity. It can provide a primer on LLM and formal methods that is not polluted by other themes and does not require extensive domain knowledge: we keep all the basic ingredients of real-world problems, but simplify the setup.

Prompts, screenshots, and Alloy code are available in this open source repo (all tests have been done in the week of August 2025, the main reasoning loop has been done with Opus 4.1 on Claude Desktop).

AIs and humans struggle by themselves

A fun fact about our puzzle is that, even though it requires only “beach-level thinking”, top models are not obviously good at it. Uploading the original picture and prompting Opus 4.1 for a solution, the model incorrectly assumed C is wearing pants: how can we then trust its conclusion – that Marta is Girl A, and her car is number 5?

Things get interesting when we try to compare models. We abstract away the puzzle in a textual description, but LLMs still cannot find consensus: DeepSeek’s 4.1 answer (A and 2) is different than the one given by Opus; Opus’s own answer with textual prompting (A and 2) is different from Opus above, and ChatGPT5 has yet another opinion (A and 5).

This is what makes the puzzle a great motivating example: humans struggle at this combinatorial reasoning (homework question: how long did it take you to solve it?), but it’s unclear how much better frontier models are. How do we build confidence in any of the answers above? How can we reason with the AI instead of delegating entirely the process?

Reasoning as “eliminating possibilities”

Complex reasoning challenges can often be solved following the advice from that famous detective: ‘When you have eliminated the impossible, then whatever remains, however improbable, must be the truth’. Instead of trying to solve the problem all at once, we can think of our puzzle as the combination of three main things:

An initial situation, randomly mapping girls to cars and labels.
A set of constraints, in the form of statements by the very same girls: these statements will make certain mapping impossible.
A final situation, in which girls are re-mapped to names and cars.

Our initial knowledge is compatible with this reality:

A possible assignment of names to girls and cars: Franca (A, 1), Laura (B, 2), etc. [ image by the author ]

But also this (and many more):

Another possible assignment of names to girls and cars: Franca (A, 1), Marta (B, 2), etc. [ image by the author ]

We can imagine that every time we add a girl statement, we eliminate some arrangements from possibly being the final one. In other words, we increase our knowledge about the situation as we progressively restrict the set of feasible solutions (this basic insight is the same underlying epistemic logic and information theory). In fact, the very first statement, “Girl A states that Laura is not next to her, and A’s car is now in front of Bianca”, rules out our first scenario, because Laura is next to Girl A there.

Enumerating scenarios is a tedious and error-prone task, even for LLMs. The magic of Alloy is their declarative nature. Instead of writing down the reasoning code ourselves, we state what we know (premises in a traditional proof, statements in this case), and what to find out (a theorem, Marta’s car), and let Alloy do the rest: exploring a huge conceptual space is done by tried and tested methods, so that we can focus on the faithful translations of the puzzle and (important!) the interpretation of the instances Alloy finds.

The division of labor should now be clear: instead of LLM (or us) directly solving the problem, we translate English requirements in Alloy code with Claude, then use Alloy to generate solutions and finally, we, as humans, check them.

From LLM to Alloy and back: the reasoning loop

Our prompting strategy is now more subtle. We no longer ask Claude for a direct solution; instead, our prompt guides it to generate Alloy code based on our initial scenario. Instead of “one-shotting” the solution, we are now in a virtuous loop, generating increasingly complex code, and verifying that we are getting closer based on the Alloy output:

Reasoning loop, with Claude and human working together [ image by the author ]

The result is our starting code, which contains the main ingredients but no constraints yet. It is easy to scroll through the definitions now that the tedious translation has been done: Girl, Car, and Name as our main “signatures” (i.e. sets of objects), and the initial position for Girls A-E is the mapping to Cars 1-5. We don’t yet know who owns what except that nobody owns the car in front of them now:

 // No girl is initially standing in front of her own car
 // Girl A (position 1) does not own Car1, B does not own Car2, etc.
 A.owns != Car1
 B.owns != Car2
 C.owns != Car3
 D.owns != Car4
 E.owns != Car5

We pause here to highlight two great Alloy features: first, the code maps clearly to logical statements, quite like the ones to be found in mathematical proofs and informal reasoning – even if you have never seen Alloy’s syntax before, the statements should be obvious (code comments are your friend!). Second, the built-in UI is useful to visualize our progress, as it depicts an instance chosen among all the possible realities that satisfy the constraints: for example, this is a possible assignment (Giovanna is C):

Starting instance in the Alloy UI [ screenshot from the author ]

Executing it again, we could get another one, and then another one: as our knowledge is limited at this stage, multiple assignments are all possible: it is time to start eliminating some!

Let’s ask Claude to modify our initial code, and add the statement from girl A. The great thing about this loop is that we can also encode “sanity checks” based on incomplete but sound reasoning. Not just LLMs, but also human intelligence benefits from this sort of “progressive enhancement”: being able to incorporate “local” constraints is both unit testing the Alloy model as well as engaging us directly with the puzzle.

Let’s now add the statement by Girl A as a constraint. Now add a check to confirm that the following mapping is not allowed anymore: Franca (A, 1), Laura (B, 2). If we now run the code, no counterexample is found, proving we successfully excluded the undesired configuration:

pred InvalidConfiguration {
    // Girl A is named Franca and owns Car1
    A.name = Franca
    A.owns = Car1
    
    // Girl B is named Laura and owns Car2
    B.name = Laura
    B.owns = Car2
}

check { not InvalidConfiguration } for 5 Int

Now that we know the trick, our AI assistant can generate the script with all the statements by the girls. When we run it, this is the instance that we get:

Final model in the Alloy UI (Marta is Girl A, in Car 5) [ screenshot from the author ]

Thanks to a few iterations and interpretable, provably correct reasoning we can now establish that ChatGPT5 got this right: Marta is Girl A in Car 5, and the mapping provided by ChatGPT is correct (you can verify it yourself comparing the chat result with the instance above – incidentally this also proves another interesting fact, which is: irrespective of Marta’s mapping, are the other girls uniquely determined as well?).

Reasoning out of the box

A great side-product of having independently computable representations of the concepts at hand is that now we can explore in the symbolic space of Alloy the underlying mechanics of the puzzle, instead of relying entirely on opaque mappings in latent space.

For example, we can easily confirm that the solution is unique: in the Alloy UI, if you try and get a new instance, a warning says that no other instance is available. But we could also explore outside the existing boundaries, and remove all the Clothing information: does the solution change? (Try to answer before running it!) It turns out, the correct solution is still a valid instance (homework question: why must this be the case?), but this time the UI can indeed produce multiple valid instances: as expected, less constraints, (likely) more solutions.

A symbolic space that we easily manipulate is also great for checking the work of AI, which should never be taken at face value. The first point in case is checking Opus’ solution in the beginning, obtained by parsing the image incorrectly. We can easily change girl C’s clothing (i.e. `C.wears = Trousers`) and try again: since there is no solution, the (sad) conclusion is that Opus’ original reasoning was incorrect – it was “right” but for the “wrong” reasons, so to speak.

A second example comes from what Claude added to check for uniqueness (i.e.: Marta is A and 5 in all valid configurations). In theory, that’s a nice addition, but in practice this check does not do the job:

assert MartaUniqueSolution {
    // In all valid configurations, Marta is 
    // always the same girl with the same car
    all g1, g2: Girl | 
        (g1.name = Marta and g2.name = Marta) implies 
        (g1 = g2)  // Marta is always at the same position
}

The mismatch is clear, and easy to identify thanks to Alloy’s clear syntax: “In all valid configurations” is a quantifier over all instances (in the “meta-language” so to speak), while “all g1…” quantifies over girls inside an instance.

Timeo Claude et dona ferentes!

See you, space cowboys

Similarly to cutting-edge systems like AlphaGeometry, we solved a deductive problem (effectively, a proof) by reasoning with Claude, instead of delegating the process entirely.

The LLM does the mapping between English and a formal language: Alloy is easy to read, but sometimes tedious to write, so the code generation capabilities of Claude come in handy. Humans, on the other hand, can focus on checking if the formal setup is correct (checking is often easier than doing in the first place!). Both Claude and humans then delegate combinatorial reasoning to a powerful, verified solver for the actual deduction.

While our beach-level proof seems unimportant, and the copy-paste from Claude gets tedious quickly, this simple example is a glimpse of the power of formal methods when combined with code generation and some (human or agentic) supervision. Real-world systems use more expressive languages, run tighter, self-improving loops and target less frivolous proofs, but many of the intuitions from today carry over to them.

Of course, solving beach-or-PhD logical puzzles is not the only use case for hybrid systems such as this one. Languages like Alloy are very popular for modelling software programs, and as such, they open the door for a future in which distributed systems can be cheaply designed and verified at scale before any implementation work even begins. As very practical examples, AWS notoriously invests in verifying their cloud products, and Bauplan provides an Alloy model for their own data catalog primitives.

Taking a very different path than what many could have predicted even just 50 years ago, it seems, day by day, that we are finally getting closer to Leibniz’s dream:

If controversies were to arise, there would be no more need for disputation between two philosophers than between two calculators. For it would suffice for them to take their pencils in their hands and to sit down at the abacus, and say to each other: Let us calculate.

Acknowledgments

Thanks to Federico Bianchi, Aldrin Montana, Patrick John Chia for preliminary feedback over a previous draft of this article. No LLM was used or harmed to write the English parts of this blog post.

If you care about verification, simulations and AI in system and infrastructure design, you’ll love working at Bauplan: we are hiring!

“Where’s Marta?”: How We Removed Uncertainty From AI Reasoning

AIs and humans struggle by themselves

Reasoning as “eliminating possibilities”

From LLM to Alloy and back: the reasoning loop

Reasoning out of the box

See you, space cowboys

Acknowledgments

Related Posts

My Most Valuable Lesson as an Aspiring Data Analyst

From Booking to Bon Voyage: How AI Is Redefining the Travel & Hospitality Experience

Leave a Reply Cancel reply