or so, it has been impossible to deny that there has been an increase in the hype level towards AI, especially with the rise of generative AI and agentic AI. As a data scientist working in a consulting firm, I have noted a considerable growth in the number of enquiries regarding how we can leverage these new technologies to make processes more efficient or automated. And while this interest might flatter us data scientists, it sometimes seems like people expect magic from AI models, as if they could solve every problem with nothing more than a prompt. On the other hand, while I personally believe generative and agentic AI has changed (and will continue to change) how we work and live, when we conduct business-process modifications, we must consider its limitations and challenges and see where it proves to be a good tool (as we wouldn’t use a fork, for example, to cut food).
As I’m a nerd and understand how LLMs work, I wanted to test their performance in a logic game like the Spanish version of Wordle against a logic I had built in a couple of hours some years ago (more details on that can be found here). Specifically, I had the following questions:
- Will my algorithm be better than LLM models?
- How will reasoning capabilities in LLM models affect their performance?
Building an LLM-based solution
To get a solution by the LLM model, I built three main prompts. The first one was targeted to get an initial guess:
Let’s suppose I’m playing WORDLE, but in Spanish. It’s a game where you have to guess a 5-letter word, and only 5 letters, in 6 attempts. Also, a letter can be repeated in the final word.
First, let’s review the rules of the game: Every day the game chooses a five-letter word that players try to guess within six attempts. After the player enters the word they think it is, each letter is marked in green, yellow, or gray: green means the letter is correct and in the correct position; yellow means the letter is in the hidden word but not in the correct position; while gray means the letter is not in the hidden word.
But if you place a letter twice and one shows up green and the other yellow, it means the letter appears twice: once in the green position, and once in another position that is not the yellow one.
Example: If the hidden word is “PIZZA”, and your first attempt is “PANEL”, the response would look like this: the “P” would be green, the “A” yellow, and the “N”, “E”, and “L” gray.
Since for now we don’t know anything about the target word, give me a good starting word—one that you think will provide useful information to help us figure out the final word.
Then, a second prompt would be used to show all the word rules (the prompt here is not shown in full due to space, but the complete version also had example games and example reasonings):
Now, the idea is that we review the game strategy. I’ll be giving you the game results. The idea is that, given this result, you suggest a new 5-letter word. Remember also that there are only 6 total attempts. I’ll give you the result in the following format:
LETTER -> COLORFor example, if the hidden word is PIZZA, and the attempt is PANEL, I’ll give the result in this format:
P -> GREEN (it’s the first letter of the final word)
A -> YELLOW (it’s in the word, but not in the second position—instead it’s in the last one)
N -> GRAY (it’s not in the word)
E -> GRAY (it’s not in the word)
L -> GRAY (it’s not in the word)Let’s remember the rules. If a letter is green, it means it’s in the position where it was placed. If it’s yellow, it means the letter is in the word, but not in that position. If it’s gray, it means it’s not in the word.
If you place a letter twice and one shows green and the other gray, it means the letter only appears once in the word. But if you place a letter twice and one shows green and the other yellow, it means the letter appears twice: once in the green position, and another time in a different position (not the yellow one).
All the information I give you must be used to build your suggestion. At the end of the day, we want to “turn” all the letters green, since that means we guessed the word.
Your final answer must only contain the word suggestion—not your reasoning.
The final prompt was used to get a new suggestion after having the result of our attempt:
Here’s the result. Remember that the word must have 5 letters, that you must use the rules and all the knowledge of the game, and that the goal is to “turn” all the letters green, with no more than 6 attempts to guess the word. Take your time to think through your answer—I don’t need a quick response. Do not give me your reasoning, only your final result.
Something important here is that I never tried to guide the LLMs or pointed out mistakes or errors in the logic. I wanted a pure LLM-based result and didn’t want to bias the solution in any shape or form.
Initial experiments
The truth is that my initial hypothesis was that while I expected my algorithm to be better than the LLMs, I thought the Generative AI-based solution was going to do a pretty good job without much help, but after some days, I noticed some “funny” behaviors, like the one below (where the answer was obvious):
The answer was pretty obvious: it only had to switch two letters. However, ChatGPT answered with the same guess as before.
After seeing these kinds of mistakes, I started to ask about this at the end of games, and the LLMs basically acknowledged their mistakes, but didn’t show a clear explanation on their answer:

While these are just two examples, this kind of behavior was usual when generating the pure LLM solution, showcasing some potential limitations in the reasoning of base models.
Results Analysis
With all this information under consideration, I ran an experiment for 30 days. For 15 days I compared my algorithm against 3 base LLM models:
- ChatGPT’s 4o/5 model (After OpenAI released GPT-5 model, I couldn’t toggle between models on the free-tier version of ChatGPT)
- Gemini’s 2.5-Flash model
- Meta’s Llama 4 model
Here, I compared two main metrics: the percentage of wins and a points system metrics (any green letter in the final guess awarded 3 points, yellow letters awarded 1 point, and gray letters awarded 0 points):

As can be seen, my algorithm (while specific to this use case, it only took me a day or so to build) is the only approach that wins every day. Analyzing the LLM models, Gemini provides the worse performance, while ChatGPT and Meta’s Llama provide similar numbers. However, as can be seen on the figure on the right, there is great variability in the performance of each model and consistency is something that is not shown by these alternatives for this particular use case.
However, these results wouldn’t be complete if we didn’t analyze a reasoning LLM model against my algorithm (and against a base LLM model). So, for the following 15 days I also compared the following models:
- ChatGPT’s 4o/5 model using reasoning capability
- Gemini’s 2.5-Flash model (same model as before)
- Meta’s Llama 4 model (same model as before)
Some important comments here: initially, I planned to use Grok as well, but after Grok 4 was released, the reasoning toggle for Grok 3 disappeared, which made comparisons difficult; on the other hand, I tried to use Gemini’s 2.5-Pro, but in contrast with ChatGPT’s reasoning option, the use of this is not a toggle, but a different model which only allowed me to send 5 prompts per day, which didn’t allow us to complete a full game. With this in mind, we show the results for the following 15 days:

The reasoning capability behind LLMs provides a huge boost to performance in this task, which requires understanding which letter can be used in each position, which ones have been evaluated, remembering all results and understanding all combinations. Not only are the average results better, but also performance is more consistent, as in the two games that weren’t won, only one letter was missed. In spite of this improvement, the specific algorithm I built is still slightly better in terms of performance, but as I mentioned earlier, this was done for this specific task. Something interesting is that for these 15 games, the base LLM models (Gemini 2.5 Flash and Llama 4) didn’t win once, and the performance was worse than the other set, which makes me wonder if the wins that were achieved before were lucky or not.
Final Remarks
The intention of this exercise has been to try to test the performance of LLMs against a specifically built algorithm for a task that requires applying logic rules to generate a successful result. We have seen that base models don’t have good performance, but that reasoning capabilities of LLM solutions provide an important boost, generating similar performance to the results of the tailored algorithm I had built. One important thing to take into account is that while this improvement is real, with real-world applications and production systems we also have to take into consideration response time (reasoning LLM models take more time to generate an answer than base models or, in this case, the logic I built) and cost (according to the Azure OpenAI pricing page, as of the 30th of August of 2025, the price of 1M input tokens for the general purpose GPT-4o-mini general purpose model is around $0.15, while for the o4-mini reasoning model, the cost of 1M input tokens is $1.10). While I firmly believe that LLMs and generative AI will continue to evolve the way we work, we can’t treat them as a Swiss knife that solves everything, without considering its limitations and without evaluating easy-to-build tailored solutions.