When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction

models learn to reason abstractly from just a few examples? In this piece, I explore this question by testing both text-based (o3-mini) and image-capable (gpt-4.1) models on abstract grid transformation tasks. These experiments reveal the extent to which current models rely on pattern matching, procedural heuristics, and symbolic shortcuts rather than robust generalization. Even with multimodal inputs, reasoning often breaks down in the face of subtle abstraction. The results offer a window into the current capabilities and limitations of in-context meta-learning with LLMs.

Introduction

Meta-learning, the ability of a system to learn how to learn, has traditionally been explored through gradient-based optimization, memory-augmented networks, or explicit task embeddings. But with the rise of large language models (LLMs), particularly the o3 family with advanced reasoning capabilities, a new question emerges: can we use LLMs themselves as meta-learners in task-based domains like ARC? The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet, is a benchmark explicitly designed to test broad generalization. It provides input-output transformation puzzles with minimal supervision, few examples per task, and often no shared surface-level structure across tasks. In other words: a playground for meta-learning. To get an understanding of typical abstraction and reasoning tasks, the reader can visit the ARC play page.

Example game from the ARC website. From the demonstration grids, it is clear that the task for the test grid is to turn black areas into yellow wherever they are completely enclosed by green boundaries.

Data and Setup

To explore whether LLMs like o3-mini can perform meta-learning on abstract reasoning tasks, I used data from the ARC Prize 2025 Kaggle competition. The dataset repository can be found here (Apache 2.0 license). The dataset consists of input-output grid transformations that challenge models to infer abstract rules from just a few examples.

Each task provides:

A few training examples (input and output 2D grids)
A single test input grid for which the model must predict the corresponding output

A second dataset provides the solution grids for each of the test input grids. Here’s a simplified example of the data format:

# training examples - dictionary of dictionaries. 
# Here is an extracted task
{'train': [{'input': [[6, 6, 0], [6, 0, 0], [0, 6, 6]],
   'output': [[6, 6, 0, 6, 6, 0, 0, 0, 0],
    [6, 0, 0, 6, 0, 0, 0, 0, 0],
    [0, 6, 6, 0, 6, 6, 0, 0, 0],
    [6, 6, 0, 0, 0, 0, 0, 0, 0],
    [6, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 6, 6, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 6, 6, 0, 6, 6, 0],
    [0, 0, 0, 6, 0, 0, 6, 0, 0],
    [0, 0, 0, 0, 6, 6, 0, 6, 6]]},
  {'input': [[4, 0, 4], [0, 0, 0], [0, 4, 0]],
   'output': [[4, 0, 4, 0, 0, 0, 4, 0, 4],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 4, 0, 0, 0, 0, 0, 4, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 4, 0, 4, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 4, 0, 0, 0, 0]]},...,
   'test': [{'input': [[7, 0, 7], [7, 0, 7], [7, 7, 0]]}]
}

# example of solution to test input grid - dictionary of lists
# Here is the extracted solution for the only test input grid above
[[[3, 2, 3, 2, 3, 2],
  [7, 8, 7, 8, 7, 8],
  [2, 3, 2, 3, 2, 3],
  [8, 7, 8, 7, 8, 7],
  [3, 2, 3, 2, 3, 2],
  [7, 8, 7, 8, 7, 8]]]

Each grid is a 2D array of integers from 0–9, representing colored pixels. Grids have varying sizes and a grid transformation may also carry a size change from the input to the output grid. To visualize the arrays, I used a custom colormap with matplotlib:

from matplotlib import colors
cmap = colors.ListedColormap([

    '#8B00FF',  # Violet
    '#4B0082',  # Indigo
    '#0000FF',  # Blue
    '#FFFF00',  # Yellow
    '#00FF00',  # Green
    '#FF7F00',  # Orange
    '#FF0000',  # Red
    '#964B00',  # Golden
    '#000000',  # Black
    '#FFFFFF',  # White
])
norm = colors.Normalize(vmin=0, vmax=9)

# Function to visualize an array
def visualize_matrix(matrix, title='', cmap=cmap, norm=norm):
    plt.imshow(matrix, cmap=cmap, norm=norm)
    plt.title(title)
    plt.axis('off')  # Remove axes
    plt.show()

For model interaction, I used OpenAI’s o3-mini model via LangChain. Later on, we will also use gpt-4.1:

from langchain_openai import ChatOpenAI
import getpass
import os

# Prompt for a secret input
openai_key = getpass.getpass("Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_key

AGENT_MODEL = "o3-mini"  # reasoning model, https://platform.openai.com/docs/models
AGENT_LLM = ChatOpenAI(model=AGENT_MODEL) 
# AGENT_LLM = ChatOpenAI(model=AGENT_MODEL, reasoning_effort='low')

To handle LLM responses, especially when the model returns a predicted output grid as Python code inside triple backticks, I wrote a utility:

import re, ast

def extract_python_code(response_string):
    match = re.search(r"```pythons*(.*?)```", response_string, re.DOTALL)
    if match:
        return ast.literal_eval(match.group(1).strip())
    return None

This setup allowed me to structure a full reasoning loop: prompt the model with few-shot examples, extract and apply a generated algorithm, assess its performance on new test inputs and finally use the assessment to improve the algorithm.

Testing Reasoning with o3-mini

To evaluate whether LLMs can “meta-learn” on abstract reasoning tasks, I tested the o3-mini model using a closed-loop reasoning setup inspired by how humans might approach few-shot tasks. For each ARC challenge, I provided the model with a handful of demonstration input-output grid pairs and asked it to derive a single reusable algorithm.

I defined a sequence of prompts using LangChain’s ChatPromptTemplate to simulate reasoning, application, assessment, and refinement. The process mimics an inner training loop with limited supervision:

PROMPT_REASON: The model is given training examples and asked to infer a general algorithm in pseudocode.
PROMPT_SOLVE: The generated algorithm is applied to new inputs (both training and test).
PROMPT_ASSESS: When the algorithm fails, the model receives feedback comparing its predicted vs. expected outputs.
PROMPT_SUMMARIZE_FEEDBACK: The model summarizes cumulative feedback from failed attempts to iteratively refine its approach.

from langchain_core.prompts import ChatPromptTemplate

PROMPT_REASON = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are an expert in solving abstract reasoning tasks. "
            "You will be given several demonstration input-output pairs of 2D arrays. "
            "Your goal is to develop a single algorithm that maps each input array to its corresponding output array.nn"
            
            "Each input and output is a 2-dimensional array of integers between 0 and 9. "
            "Solving the task involves:n"
            "- Analyzing the demonstration pairsn"
            "- Identifying abstract patterns or transformationsn"
            "- Formulating a general rule or algorithm that works across all examplesn"
            "- Producing pseudocode that implements the rulenn"
            
            "If prior attempts were made, you will also receive feedback summarizing what went wrong. "
            "Carefully use this feedback to improve your solution.nn"
            
            "Return only the updated algorithm as pseudocode. Do not describe or explain it.nn"
            "### Feedback (summary of previous attempts):n{attempt_history}nn"
            "### Demonstration Pairs:n{train_pairs}n"
        ),
        (
            "ai", 
            "Answer:"
        )
    ]
)

PROMPT_SOLVE = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are an expert in abstract reasoning. "
            "Previously, you analyzed demonstration input-output pairs and developed an algorithm "
            "to transform input arrays into output arrays.nn"
            
            "Now, use that algorithm to generate an output array for a new, unseen input array.nn"
            
            "Only return the output array, formatted as valid Python code within a code block. "
            "For example:n```pythonn[[2, 3], [5, 6]]n```n"
            
            "### Developed algorithm:n{reasoning_template}nn"
            "### New input array:n{test_input}n"
        ),
        (
            "ai",
            "Answer:"
        )
    ]
)

PROMPT_ASSESS = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are an expert in abstract reasoning. "
            "A solution array was generated by applying the algorithm to the input array. "
            "Compare the generated solution to the actual target output. "
            "Analyze why the two arrays differ, and provide **clear and concise feedback** on how to improve the algorithm.nn"
            
            "Only return your feedback-do not repeat the arrays or algorithm.nn"
            
            "### Algorithm:n{reasoning_template}nn"
            "### Input array:n{test_input}nn"
            "### Solution array (generated by algorithm):n{solved_test_output}nn"
            "### Target output array:n{test_output}n"
        ),
        (
            "ai",
            "Answer:"
        )
    ]
)

PROMPT_SUMMARIZE_FEEDBACK = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are an expert in summarizing feedback on algorithm development. "
            "You will be given a history of past attempts, each containing an algorithm and feedback about its performance.nn"
            
            "Your goal is to produce a **concise summary** of the most important lessons learned-"
            "focusing on how the algorithm should be improved and what mistakes should be avoided in future versions.nn"
            
            "Return only the feedback summary. Do not repeat the original attempts or feedback.nn"
            
            "### Attempt History:n{attempt_history}n"
        ),
        (
            "ai",
            "Answer:"
        )
    ]
)

These prompts are linked into a simple LangChain pipeline:

reasoning_chain = PROMPT_REASON | AGENT_LLM
solve_chain = PROMPT_SOLVE | AGENT_LLM 
assess_chain = PROMPT_ASSESS | AGENT_LLM 
summarize_feedback_chain = PROMPT_SUMMARIZE_FEEDBACK | AGENT_LLM

For each ARC challenge:

The model receives the demonstration pairs and prior feedback;
The model generates a new algorithm in pseudocode (reasoning_template);
The algorithm is tested on all the demonstrations;
If it fails, the model: receives detailed feedback on mismatched predictions; summarizes mistakes across attempts; refines the next version of the algorithm;
Once the model gets all demonstrations correct, I test it on the unseen test input.

This process repeats for up to a max number of attempts per challenge. A successful algorithm generalizes across the provided examples and applies correctly to the withheld test case. This setup tests whether the model can extract abstract patterns, improve its reasoning over time, and generalize from very few examples.

reasoning_templates = {}

for i, id in enumerate(id_train_challenges):
    print(f"Training on challenge {i} ID: {id}")
    train_pairs = train_challenges[id]['train']
    test_input = train_challenges[id]['test'][0]['input'] # only pick the first test input 
    test_output = train_sols[id][0] # only pick the first test output
    train_pairs_str = ''
    for i, train_pair in enumerate(train_pairs):
        train_pairs_str += f"Demonstration pair {i+1}:n input grid: {train_pair['input']} n output grid: {train_pair['output']}n"
    train_pairs_str = train_pairs_str.strip()

    # keep trying until you figure out how to solve the challenge
    right_wrong = "incorrect"
    # Start with an empty reasoning template, which will be refined over time
    reasoning_template = '' 
    k = 1
    max_attempts = 5
    attempt_history = []
    attempt_history_summary = ''
    while right_wrong == "incorrect":
        print(f"Attempt {k} to solve the challenge...")

        # Build the reasoning message with the current reasoning template and attempt history
        # This message will be used to generate a new reasoning template
        reason_message = {
            "train_pairs": train_pairs_str,
            "attempt_history": attempt_history_summary,  
        }
        res = reasoning_chain.invoke(reason_message)
        reasoning_template = res.content

        # Assess reasoning template
        wrong_pairs = []
        for train_pair in train_pairs:
            demo_input = train_pair['input']
            demo_output = train_pair['output']
            # Test the reasoning template on the demonstration pair
            test_message = {
                "test_input": demo_input,
                "reasoning_template": reasoning_template,
            }
            res = solve_chain.invoke(test_message)
            solved_demo_output = extract_python_code(res.content)            
            # Compare the output with the demonstration output
            if solved_demo_output != demo_output:
                wrong_pairs.append((demo_input, demo_output, solved_demo_output))

        if len(wrong_pairs) > 0:
            right_wrong = 'incorrect'
            print(f"Reasoning template failed on {len(wrong_pairs)} demonstration pairs.")

            if k >= max_attempts:
                print(f"Max attempts reached ({max_attempts}). Stopping for challenge {id}.")
                reasoning_templates[id] = ''
                break

            print("Assessing the reasoning template...")
            assessment_res = f'Algorithm failed on {len(wrong_pairs)} demonstration pairs. Here is the feedback:n'
            for demo_input, demo_output, solved_demo_output in wrong_pairs:
                assess_chain_message = {
                    "reasoning_template": reasoning_template,
                    "test_input": demo_input,
                    "solved_test_output": solved_demo_output,
                    "test_output": demo_output,
                }
                res = assess_chain.invoke(assess_chain_message)
                assessment_res += f" - From input {demo_input} to output {demo_output}, your solution was {solved_demo_output}: {res.content.strip()}n"

            attempt_history.append({
                "attempt": k,
                "reasoning_template": reasoning_template,
                "feedback": assessment_res
            })

            summary_message = {
                "attempt_history": attempt_history,
            }
            summary_res = summarize_feedback_chain.invoke(summary_message)
            attempt_history_summary = summary_res.content.strip()
        else:
            print("Solution is correct!")
            right_wrong = "correct"
            reasoning_templates[id] = reasoning_template

            # test it against the test input/ output .... but do not give feedback (this is supposed to be unknown)
            test_message = {
                "test_input": test_input,
                "reasoning_template": reasoning_template,
            }
            res = solve_chain.invoke(test_message)
            solved_test_output = extract_python_code(res.content)
            if test_output != solved_test_output:
                print(f"Test output does not match the true output for challenge {id}.")
            else:
                print(f"Test output matches the true output for challenge {id}.")
                #visualize_matrix(test_input, "Input grid")
                #visualize_matrix(test_output, "True output")
                #visualize_matrix(solved_test_output, "Test Output")

            print("-" * 40)  # Separator between entries

        k += 1

Results: When Reasoning Works

In some cases, o3-mini was able to correctly infer a generalizable algorithm from just a few input-output demonstrations. One such example involved generating a patterned tiling based on a small 2×2 input grid.

After just one attempt, the model converged on the following pseudocode:

BEGIN  
  Let input be a 2x2 grid, where:
    input[0] = [a, b]
    input[1] = [c, d]
  
  Initialize output as an empty list.

  FOR each row index r from 0 to 5 DO:
    Let original_row ← input[r mod 2]
    
    IF (FLOOR(r / 2)) mod 2 = 1 THEN
      Let base_row ← REVERSE(original_row)
    ELSE
      Let base_row ← original_row
    ENDIF

    Initialize new_row as an empty list.
    FOR repeat from 1 to 3 DO:
      Append all elements of base_row to new_row.
    ENDFOR

    Append new_row to output.
  ENDFOR

  RETURN output
END

Here is the expected solution (true output) and the solution from the model pseudocode (test output).

Test output grid from the pseudocode (Image by author).

This algorithm demonstrates several notable reasoning capabilities:

Pattern abstraction: the model inferred a repeating tiling pattern from limited data;
Modulo logic: it introduced modulo-based indexing (r mod 2 and (r // 2) mod 2) to alternate row behavior, mimicking visual mirroring;
Grid construction: the solution scales up the 2×2 input into a larger 6×6 grid through repeated copying and reversal;

The fact that the model discovered this structure without hardcoded rules suggests it was engaging in a form of algorithm synthesis, guided by few-shot abstraction.

Here is another successful example.

Example grids and test input grid (image by author).

Again, after just one attempt, the model converged on the following pseudocode:

BEGIN
  Let N = 3
  Create output as a 2D array of size (N×N) × (N×N), filled with 0

  FOR each row r from 0 to N–1:
    FOR each column c from 0 to N–1:
      IF input[r][c] ≠ 0 THEN
        FOR each i from 0 to N–1:
          FOR each j from 0 to N–1:
            Set output[(r * N) + i][(c * N) + j] = input[i][j]
  RETURN output
END

Here is the expected solution and the solution from the model pseudocode.

This algorithm effectively tiles the full input grid into the output grid at every position where the input cell is non-zero. The tiling is aligned such that each copy of the original grid is placed at an offset determined by (r * N, c * N) — the scaled-up coordinates of the non-zero input cell.

What’s impressive here is that the model:

Learns conditional placement. It only pastes the input where the input value is non-zero;
Uses coordinate arithmetic to scale placement locations, showing spatial understanding;
Treats the input as both control logic and content, combining layout detection with repetition.

When Text-Based Reasoning Isn’t Enough

In the task below, the model was given a small number of demonstration input-output grid pairs and asked to infer the transformation rule.

The text-based model (o3-mini) produced a detailed pseudocode solution, structured, plausible, and internally consistent:

BEGIN
  Let original ← input grid  
  Let output ← deep copy of original
  Let R ← number of rows in original  
  Let C ← number of columns in original

  // Compute ring index for every cell that is part of a non-zero region.
  // A cell's ring index is defined as:
  //   - 0 if the cell is on the boundary of the grid OR if at least one of its 4-neighbors is 0
  //   - Otherwise, 1 + min(ring index of its 4-neighbors)

  Create grid ring of size R × C, filled with −1

  For each cell (r, c) in original:
    If original[r][c] ≠ 0 then
      If r = 0 OR c = 0 OR r = R−1 OR c = C−1 then
        ring[r][c] ← 0
      Else if any 4-neighbor (up, down, left, right) of (r, c) has value 0 then
        ring[r][c] ← 0

  changed ← true
  While changed is true:
    Set changed ← false
    For each cell (r, c) in original such that original[r][c] ≠ 0 and (ring[r][c] < 0 or not finalized):
      Let candidates ← list of ring indices from any 4-neighbor (r′, c′) with a non-negative ring value
      If candidates is non-empty then
        newRing ← min(candidates) + 1
        If ring[r][c] < 0 OR newRing < ring[r][c] then
          ring[r][c] ← newRing
          changed ← true

  // Process propagation from outer ring inward.
  // For each cell with ring index ≥ 1, update its value based on a neighbor with ring index one less.
  // Selection is deterministic: "up, left, down, right"

  Let maxRing ← maximum value in ring

  For layer from 1 to maxRing:
    For each cell (r, c) such that ring[r][c] = layer:
      For direction in [(−1,0), (0,−1), (1,0), (0,1)]:
        nr ← r + direction.row
        nc ← c + direction.col
        If (nr, nc) is in bounds AND ring[nr][nc] = layer − 1 then
          Let base ← output[nr][nc]
          If base = 2 then
            newValue ← 8
          Else if base mod 2 = 0 then
            newValue ← base + 2
          Else // base is odd
            newValue ← base + 1
          output[r][c] ← newValue
          Break out of neighbor loop

Return output
END

At first glance, this looks like a sophisticated strategy: compute a ring index for each non-zero region, then diffuse values inward using deterministic rules. It even includes update logic:

If the “base” cell is 2 → assign 8
If base is even → add 2
If base is odd → add 1

But this entire algorithm, however coherent, is misguided. It completely fails to match the true underlying transformation shown in the demonstrations.

Expected test output grid (image by author).

Instead of reasoning about blue-bordered regions and their nested structure, the model generated a generic flood-fill algorithm based on distance from edge and adjacency to zeros. Even the feedback keeps refining the procedural approach identified earlier:

['Key lessons are to: • Precisely compute the ring index so that only',
 'true boundary (or external zero) cells get index 0, ensuring that',
 'inner cells receive higher indices for proper propagation. • Use a',
 'reliable, consistent method for selecting the "base" value for',
 'updates-ideally by considering all adjacent lower-ring cells or using',
 'a deterministic order-and use an immutable copy of the original grid',
 'for these lookups. • Apply the parity‐based update rules correctly so',
 'that cells with ring index ≥ 1 get the specified value increments',
 '(especially the special case when the base is 2) rather than remaining',
 'unchanged. • Ensure that the update logic cascades inward, allowing',
 'inner cells to correctly inherit and build upon values from outer',
 'rings.']

So what went wrong?

Topological, not visual. The model focused on connectivity and edge proximity, ignoring the visually defined regions.
Procedural, not inferential. The logic was rigid and hand-crafted, not derived from patterns in the examples.
Demonstration-agnostic. There’s no sign the model meaningfully incorporated the few-shot examples. It likely defaulted to a familiar pattern — spatial growth using layers.

This isn’t surprising. Text-only LLMs have no visual grounding. They tokenize the grid as symbolic input — rows of digits, not enclosed figures or nested patterns. As a result, their inductive biases lean toward symbolic or graph-like algorithms, not perceptual abstractions.

In this case, the model fell into a common trap: generating something plausible-sounding but wrong. It produced a spatial propagation scheme that might work for a diffusion task but not the one at hand. This highlights a key weakness in text-based few-shot prompting for abstract visual reasoning: the model’s “reasoning” is disconnected from perceptual understanding. It invents algorithms based on internal priors, not external cues.

When Reasoning Fails: Also Image Models Aren’t Magic

To improve generalization, I transitioned from purely text-based reasoning to image-based prompting, leveraging GPT-4.1’s multimodal capabilities through LangChain. This setup encoded input-output grid examples as base64 images, which were presented alongside a natural language prompt describing the task.

from langchain_core.messages import HumanMessage

import io
import base64

AGENT_MODEL = "gpt-4.1"

# Prompt for image based reasoning
PROMPT_REASON_IMG = """You are an expert at solving abstract reasoning tasks.

These are unique reasoning tasks with limited examples. You are given demonstration input-output 2D grids. 
The colormap used is as follows:

{{
    'Violet': 0,
    'Indigo': 1,
    'Blue': 2,
    'Yellow': 3,
    'Green': 4,
    'Orange': 5,
    'Red': 6,
    'Golden': 7,
    'Black': 8,
    'White': 9
}}

Your goal is to develop a single algorithm that maps each input grid to its corresponding output grid.

A successful solution involves:
- Analyzing the demonstration examples carefully
- Identifying underlying visual or spatial patterns
- Formulating a general transformation rule
- Translating this rule into clear pseudocode

If this is not your first attempt, a summary of previous feedback is also provided. Review it carefully and incorporate it to improve your solution.

Test your algorithm against the demonstrations to ensure it works.

Return **only the algorithm pseudocode**, formatted as plain text. Do not explain it or add extra commentary.
"""

# If your array is 10x10 and you want each cell to be 20x20 pixels (cell_px), the image will be 200x200 pixels.
# Convert matrix into image
def visualize_grid_fig(matrix, cmap=cmap, norm=norm, cell_px=20, show=False):
    if type(matrix) is not np.ndarray:
        matrix = np.array(matrix)
    h, w = matrix.shape[:2]
    figsize = (w * cell_px / 100, h * cell_px / 100)  # inches
    fig, ax = plt.subplots(figsize=figsize)
    ax.imshow(matrix, cmap=cmap, norm=norm)
    ax.axis('off')
    if show:
        plt.show()
    else:
        plt.close(fig)
    return fig

# encode image for model
def fig_to_base64(fig, dpi=100):
    buf = io.BytesIO()
    fig.savefig(buf, format='png', dpi=dpi, bbox_inches='tight')
    buf.seek(0)
    img_base64 = base64.b64encode(buf.read()).decode('utf-8')
    buf.close()
    return img_base64

# In the loop replace reasoning code with this
# reasoning with images
reason_message = [{"type": "text", "text": PROMPT_REASON_IMG}]
for i, example in enumerate(train_pairs):
    #fig_in = visualize_grid_fig(example['input'], cmap, norm)
    #fig_out = visualize_grid_fig(example['output'], cmap, norm)
    fig_in = visualize_grid_fig(example['input'], )
    fig_out = visualize_grid_fig(example['output'], )
    fig_in = fig_to_base64(fig_in)
    fig_out = fig_to_base64(fig_out)
    reason_message.append({"type": "text", "text": f"### Input grid {i+1}:"})
    reason_message.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{fig_in}"}})
    reason_message.append({"type": "text", "text": f"### Output grid {i+1}:"})
    reason_message.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{fig_out}"}})
reason_message.append({"type": "text", "text": f"### Feedback (summary of previous attempts): {attempt_history_summary}"})
reason_message = HumanMessage(content=reason_message)
res = AGENT_LLM.invoke([reason_message])
reasoning_template = res.content

The resulting pseudocode marked a clear step forward in expressiveness. The model was able to:

Detect blue-bordered squares using visual features rather than purely symbolic structure;
Apply rules based on square size and nesting depth to infer interior fill colors;
Sort identified squares by size before filling, effectively avoiding overwrite conflicts.

Here’s the generated pseudocode:

1. Let grid be the input 2D array.
2. Create output_grid as a copy of grid.

3. Identify all blue-bordered squares in the grid:
   a. For each possible top-left corner (i, j):
      i. For each possible square size s (s ≥ 3, up to min(grid height, grid width)):
         - Check if the square of size s starting at (i, j) is fully within bounds.
         - Check if all *border* cells of this square are Blue (value = 2).
         - Check that the *interior* cells (not on the border) do not contain any Blue (2).
         - If all conditions are met, record the square as (i, j, s).

4. Sort the list of detected blue-bordered squares by size in ascending order (smallest first).

5. For each detected square (i, j, s), in sorted order:
   a. Determine the fill color:
      - If the square is the smallest (no other blue-bordered square is fully inside it), set fill color = Black (8).
      - If the square is the largest (no other blue-bordered square fully contains it), fill color = 
        - If there are exactly 2 blue-bordered squares, set fill color = Green (4).
        - If there are three blue-bordered squares in the grid, fill color = Yellow (3).
      - If the square is nested (not smallest or largest), fill color = Black (8).
      - (More complex rules may generalize beyond these based on demonstrations.)

   b. Fill the interior of the square:
      For each cell (x, y) strictly inside the square (i+1 ≤ x < i+s−1) and (j+1 ≤ y < j+s−1):
         - If output_grid[x][y] is not Blue (2), set it to the chosen fill color.

6. Return output_grid.

Special notes:
  - Never overwrite Blue (2) border pixels. 
  - When filling, later (larger) squares overwrite earlier (smaller) fills in overlapping regions. 
  - Only process valid blue-bordered squares (minimum size 3x3, complete border). 
  - If there are multiple disjoint blue-bordered squares, treat each independently for fill color assignment as above matching the demonstration logic.

The model clearly exhibits structured reasoning. It develops an internal representation of nested geometric forms and attempts to apply rule-based transformations derived from examples.

Test output grid from image-based reasoning model pseudocode (image by author).

However, despite these advances, the model still fails to generalize reliably. In novel configurations, it mis-assigns fill colors, falling back on brittle heuristics like size-based priority or rigid nesting assumptions. For instance, it might assume the largest square is always filled with yellow, even when that logic no longer holds in a new context. This failure reveals a deeper limitation: even with image input, the model does not “see” in the human sense. It doesn’t build a holistic perceptual representation of spatial relationships. Instead, it converts the image into symbolic patterns and applies deterministic procedures like flood-fill, sorting, or positional indexing.

In practice, this means the model reasons from internal abstractions, not perceptual grounding. It infers that “smaller squares get black,” or “fill based on size rank,” without fully understanding why those assignments occurred in the demonstrations. As a result, any deviation from the expected layout can cause it to misfire.

This suggests that while multimodal prompting extends the expressive range of the model, it doesn’t yet provide the kind of flexible, generalizable visual reasoning that humans display. These tasks may ultimately require stronger forms of program induction, meta-learning, or hybrid systems that integrate perceptual grouping with learned rules.

Conclusions

In this study, I explored whether large language models — both text-based and multimodal — can perform meta-learning from examples on abstract reasoning tasks. Specifically, I focused on a class of problems from the ARC dataset, where solutions require identifying visual patterns, learning transformations, and generalizing them to novel test inputs.

Through direct prompting experiments, I found that:

Text-based models (e.g., o3-mini) often hallucinate plausible algorithms that are topologically or procedurally sound but entirely disconnected from the task’s visual logic. These models rely on symbolic reasoning over tokenized grids, and default to familiar heuristics like flood-fill, ring propagation, or rule-based updates, regardless of the examples provided.
Multimodal models (e.g., GPT-4 with vision) showed a clear improvement in pattern detection and relational reasoning. They successfully identified blue-bordered regions and adapted behaviors based on relative size or nesting. However, their generalization remained fragile: they still applied brittle rules, such as fixed size-based assignments, and failed in novel layouts that deviated from the demonstrations.

These findings suggest that, even with visual input, current LLMs do not “see” as humans do. They process images symbolically, not perceptually. Their reasoning is driven by internally constructed rules, not a flexible, visual understanding of shapes, hierarchy, or affordance.

The limitations I observed reinforce a central tension: few-shot prompting alone, even with images, is not sufficient for robust abstraction. True generalization likely requires:

Program induction: inferring reusable, structured transformations from examples;
Perceptual grounding: developing architectures that parse and manipulate visual scenes compositionally;
Meta-learning architectures: building models that adapt their reasoning strategies dynamically rather than applying pre-learned heuristics;

Today’s LLMs are astonishing in their breadth, but they’re still guessing based on priors, not learning to learn in the human sense. They lack a strong inductive bias for abstraction and transformation. ARC-style tasks expose this gap clearly: success requires more than pattern recognition, it requires reasoning from examples in a structured, compositional way. These results are not discouraging, rather clarifying. We now know where the ceiling is. And the next generation of models, those with hybrid architectures, persistent memory, and explicit meta-learning capabilities might finally break through it.

When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction

Introduction

Data and Setup

Testing Reasoning with o3-mini

Results: When Reasoning Works

When Text-Based Reasoning Isn’t Enough

When Reasoning Fails: Also Image Models Aren’t Magic

Conclusions

Related Posts

From Rules to Relationships: How Machines Are Learning to Understand Each Other

Finding Heavy Traffic Indicators on I-94

Leave a Reply Cancel reply