A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models

In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES.

!pip -q install -U transformers accelerate bitsandbytes rich


import os, re, json, textwrap, traceback
from typing import Dict, Any, List
from rich import print as rprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32

We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab.

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   device_map="auto",
   torch_dtype=DTYPE,
   load_in_4bit=True
)
gen = pipeline(
   "text-generation",
   model=model,
   tokenizer=tok,
   return_full_text=False
)

We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES.

def chat(prompt: str, system: str = "", max_new_tokens: int = 512, temperature: float = 0.3) -> str:
   msgs = []
   if system:
       msgs.append("revise","critique":"...", "fix_hint":"")
   msgs.append("revise","critique":"...", "fix_hint":"")
   inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
   out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9)
   return out[0]["generated_text"].strip()


def extract_json(txt: str) -> Dict[str, Any]:
   m = re.search(r""revise","critique":"...", "fix_hint":""$", txt.strip())
   if not m:
       m = re.search(r""revise","critique":"...", "fix_hint":""", txt)
   try:
       return json.loads(m.group(0)) if m else {}
   except Exception:
       # fallback: strip code fences
       s = re.sub(r"^```.*?n|n```$", "", txt, flags=re.S)
       try:
           return json.loads(s)
       except Exception:
           return {}

We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES.

def extract_code(txt: str) -> str:
   m = re.search(r"```(?:python)?s*([sS]*?)```", txt, flags=re.I)
   return (m.group(1) if m else txt).strip()


def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]:
   import io, contextlib
   g = {"__name__": "__main__"}; l = {}
   if env: g.update(env)
   buf = io.StringIO()
   try:
       with contextlib.redirect_stdout(buf):
           exec(code, g, l)
       out = l.get("RESULT", g.get("RESULT"))
       return {"ok": True, "result": out, "stdout": buf.getvalue()}
   except Exception as e:
       return {"ok": False, "error": str(e), "trace": traceback.format_exc(), "stdout": buf.getvalue()}


PLANNER_SYS = """You are the HRM Planner.
Decompose the TASK into 2–4 atomic, code-solvable subgoals.
Return compact JSON only: {"subgoals":[...], "final_format":""}."""


SOLVER_SYS = """You are the HRM Solver.
Given SUBGOAL and CONTEXT vars, output a single Python snippet.
Rules:
- Compute deterministically.
- Set a variable RESULT to the answer.
- Keep code short; stdlib only.
Return only a Python code block."""


CRITIC_SYS = """You are the HRM Critic.
Given TASK and LOGS (subgoal results), decide if final answer is ready.
Return JSON only: {"action":"submit"|"revise","critique":"...", "fix_hint":""}."""


SYNTH_SYS = """You are the HRM Synthesizer.
Given TASK, LOGS, and final_format, output only the final answer (no steps).
Follow final_format exactly."""

We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES.

def plan(task: str) -> Dict[str, Any]:
   p = f"TASK:n{task}nReturn JSON only."
   return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300))


def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]:
   prompt = f"SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only."
   code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400))
   res = run_python(code, env=context)
   return {"subgoal": subgoal, "code": code, "run": res}


def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   pl = [{"subgoal": L["subgoal"], "result": L["run"].get("result"), "ok": L["run"]["ok"]} for L in logs]
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(pl, ensure_ascii=False, indent=2)+"nReturn JSON only.",
              CRITIC_SYS, temperature=0.1, max_new_tokens=250)
   return extract_json(out)


def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   sys = "Refine subgoals minimally to fix issues. Return same JSON schema as planner."
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(logs, ensure_ascii=False)+"nReturn JSON only.",
              sys, temperature=0.2, max_new_tokens=250)
   j = extract_json(out)
   return j if j.get("subgoals") else {}


def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str:
   packed = [{"subgoal": L["subgoal"], "result": L["run"].get("result")} for L in logs]
   return chat("TASK:n"+task+"nLOGS:n"+json.dumps(packed, ensure_ascii=False)+
               f"nfinal_format: {final_format}nOnly the final answer.",
               SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip()


def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]:
   ctx = dict(context or {})
   trace, plan_json = [], plan(task)
   for round_id in range(1, budget+1):
       logs = [solve_subgoal(sg, ctx) for sg in plan_json.get("subgoals", [])]
       for L in logs:
           ctx_key = f"g{len(trace)}_{abs(hash(L['subgoal']))%9999}"
           ctx[ctx_key] = L["run"].get("result")
       verdict = critic(task, logs)
       trace.append({"round": round_id, "plan": plan_json, "logs": logs, "verdict": verdict})
       if verdict.get("action") == "submit": break
       plan_json = refine(task, logs) or plan_json
   final = synthesize(task, trace[-1]["logs"], plan_json.get("final_format", "Answer: "))
   return {"final": final, "trace": trace}

We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES.

ARC_TASK = textwrap.dedent("""
Infer the transformation rule from train examples and apply to test.
Return exactly: "Answer: ", where  is a Python list of lists of ints.
""").strip()
ARC_DATA = {
   "train": [
       {"inp": [[0,0],[1,0]], "out": [[1,1],[0,1]]},
       {"inp": [[0,1],[0,0]], "out": [[1,0],[1,1]]}
   ],
   "test": [[0,0],[0,1]]
}
res1 = hrm_agent(ARC_TASK, context={"TRAIN": ARC_DATA["train"], "TEST": ARC_DATA["test"]}, budget=2)
rprint("n[bold]Demo 1 — ARC-like Toy[/bold]")
rprint(res1["final"])


WM_TASK = "A tank holds 1200 L. It leaks 2% per hour for 3 hours, then is refilled by 150 L. Return exactly: 'Answer: '."
res2 = hrm_agent(WM_TASK, context={}, budget=2)
rprint("n[bold]Demo 2 — Word Math[/bold]")
rprint(res2["final"])


rprint("n[dim]Rounds executed (Demo 1):[/dim]", len(res1["trace"]))

We run two demos to validate the agent: an ARC-style task where we infer a transformation from train pairs and apply it to a test grid, and a word-math problem that checks numeric reasoning. We call hrm_agent with each task, print the final answers, and also display the number of reasoning rounds the ARC run takes.

In conclusion, we recognize that what we have built is more than a simple demonstration; it is a window into how hierarchical reasoning can make smaller models punch above their weight. By layering planning, solving, and critiquing, we empower a free Hugging Face model to perform tasks with surprising robustness. We leave with a deeper appreciation of how brain-inspired structures, when paired with practical, open-source tools, enable us to explore reasoning benchmarks and experiment creatively without incurring high costs. This hands-on journey shows us that advanced cognitive-like workflows are accessible to anyone willing to tinker, iterate, and learn.

Check out the Paper and FULL CODES. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models

Related Posts

Unfiltered AI Companion Chatbots with Phone Calls: Top Picks

Chunking vs. Tokenization: Key Differences in AI Text Processing

Leave a Reply Cancel reply