Exploring Prompt Learning: Using English Feedback to Optimize LLM Systems

learning (RL) in AI model building has been a growing topic over the past few months. From Deepseek models incorporating RL mechanics into their training processes to other success stories of RL-based improvement, “AI Twitter” has been ablaze.

As more agents get deployed, a question emerges: can reinforcement learning control systems be built only in prompts? After all, reinforcement learning is all about using real-world feedback to optimize toward a goal, traditionally by adjusting model weights. But prompts themselves are the primary interface for guiding large language models.

We’ve been experimenting with a new approach to optimizing LLM prompts that we’re calling “Prompt Learning” (PL). Unlike traditional optimization methods that rely on numerical scores, PL uses natural language feedback to iteratively improve prompts. The roots of this approach are in the Voyager paper by Jim Fan’s team at NVIDIA. It is also alluded to by Andrej Karpathy in several recent tweets, where he argues prompt-centric learning will be a key technique.

Despite these early inklings, to our knowledge no one has yet rigorously researched, characterized, and measured a full implementation of a reinforcement learning based approach to prompt tuning. That’s exactly what we set out to do.

This implementation is inspired by an idea introduced in the original Voyager paper. The iterative prompting mechanism used in the original Voyager paper as the agent acquires and refines forms the basis for our prompt learning approach.

What Is Prompt Learning?

Prompt learning differs from MetaPrompt prompt optimization in a couple major ways.

First and foremost, the error term is in English and is not a score. The English error term allows for English feedback that is used directly to tune instructions. An explanation from an eval tells you exactly why the evaluation failed and prompt learning then adds instructions to help fix the problem to the system prompt. The English error term allows us to solve a set of problems that are unsolvable by current pure prompt optimization techniques.

Secondly, prompt learning is an online approach to manage your system instructions that is designed to be run continually against your prompt – tuning instructions back into the context. LLM-based systems can assist with context engineering your system instructions.

The English instructions in the prompt context allow for management of instructions, such as how to deal with competing instructions or expiring instructions or human review of instructions, all in English. In our prompt learning meta prompt we even allow keywords where it will only make edits to a specific instructions-based area of the prompt. In “weights” and “gradient”-based prompt optimization approaches, this is nearly impossible.

This implementation of prompt learning uses evaluations, explanations, and annotations on runs of an application to automatically improve your prompt.

The results are promising: prompt learning can make significant levels of improvements, with only one-tenth or one-hundredth the number of labeled examples.

Let’s dive into the mechanics of prompt learning and examine exactly why it’s working.

What’s the Difference Between Reinforcement Learning and Prompt Learning?

Traditional reinforcement learning relies on using scores or errors to generate gradient error terms, which then update your original model. Each gradient error term pushes your model slightly closer to optimal performance.

Traditional RL (image created by author)

The key here is that you need many, many examples to align your model. Over time, these myriad examples push your model towards outputting the correct values across your possible inputs. It works by accumulating error gradients and nudging your model in a certain direction.

Reinforcement learning is a very powerful technique. But what if you don’t have thousands of examples? What if you have a complex set of goals and those goals don’t easily express as a score? Lastly, what if someone, an annotator or human expert, has relayed to you in English what the problem actually is and how to fix it?

Prompt learning allows you to make powerful changes using individual examples. Instead of gradient error terms calculated for each example, you calculate full text explanations of why an example was scored a certain way. These examples are then fed back into the optimization flow and incorporated into the prompt.

The key idea is:

The “error”, an Eval explanation OR annotation term is in English
The modification that changes your actions are done in the prompt context, not weights
The reward function is an evaluation or human annotation
The instructions are maintained and managed in the prompt context, allowing instruction management

The above shows an example of a human annotation and a metaprompt added instruction (image created by author)

The above shows an example of an evaluation and a metaprompt created instruction to fix (image created by author)

Our research data shows examples where well-known optimization libraries fall short today. Namely, where evals with critiques or annotations contain information not available in the training set on how to fix a failure. There is not an easy way to take information-rich feedback in English and easily feed it back into a gradient update. In general you might not want to do gradient updates at all. Having all of your instructions in English allows you to deal with things that are not easy to do in “weight land,” such as what to do with competing instructions, removal of instructions, compaction of instructions and managing when to expire an instruction — essentially what we call instruction management.

One other advantage of prompt learning over gradient based updates is instead of using tens of thousands of examples, you can make changes to your system prompt with a single annotation example.

How Is This Different from Prompt Optimization?

There are a lot of techniques out there for prompt optimization. Prompt optimization applies more traditional machine learning train and test approaches to optimizing prompts by gathering examples and attempting to find similarities with those examples.

The seed of the failure of all prompt optimization approaches comes from the focus on scores as the means of propagating failure errors. As you think about failures, not every failure expresses itself easily as a numeric number and a numeric value hides the reason for a failure.

Using a score as your main approach for propagating a failure disconnects the optimization fix from the reason it failed.

	Prompt Learning	Reinforcement Learning	Prompt Optimization
Feedback Mechanism	Evaluation-based English explanations and human annotations	Numeric rewards	Numeric scores
Optimization	Metaprompt defines optimization approach	Updating model based on gradients	Varied but some support metaprompts
Prompt Control	Can optimize only specific section of prompt (instruction section)	N/A	Typically optimizes whole prompt
Online Setup	Designed to be used always on, with human control of “prompt change” acceptance or total automation	Designed to be used online	Normally one off

How Does the Optimization Loop Work?

In many real world use cases, as we tested with customers on real data, a single optimization run with a single-shot output worked great. In cases where you need multiple loops over the optimization to improve performance, the English explanation (or critique) output of an Evaluator can improve performance.

The English explanation (Critique) is an important feature of our evaluation library, generating an explanation then allows the results to be used in a feedback loop.

In our testing, as the model was required to add more instructions back into the context window to fix the prompt, the iterative loop became more important. In cases where only 1-10 instructions needed to be added a single meta-prompt improvement loop was sufficient.

How Did We Test Prompt Learning?

We ran a series of optimization experiments using prompt learning in order to benchmark its efficacy. To date, this has been run across a sizable production set of AI application and agent use cases:

For our demo data application, we chose a JSON generation problem where models had to generate JSON for a webpage based on natural language prompts.

We additionally generated a set of latent rules that the responses needed to follow. Things like:

Every section needs a type value from a predefined list
All images must include alt text
All external asset links must use https

These rules were implicitly represented in feedback and explanations attached to a set of traces of our application.

We designed this test to mimic a typical evaluation cycle of an agent. Evaluation was done using a mixture of LLM-as-a-judge techniques with human review, again to mimic real world patterns.

All of this data (the application traces, feedback, and explanations) was then fed into the optimization stage.

To perform the optimization itself, we used a modified version of meta-prompting that we later dubbed prompt learning.

Each prompt optimization loop was done with a singleLLM call, and 100 examples.

How Does Prompt Learning Perform?

Prompt Learning is able to uncover and address the majority of latent rules within the 5-25 ruleset range. As more rules are introduced, however, performance does not drop.

Ruleset size	Accuracy: 1-Loop	Accuracy: 5-Loop	Average rules followed: 1-Loop	Average rules followed: 5-Loop
10	15%	100%	71%	100%
50	0%	70%	35%	83%
100	0%	55%	14%	68%

As you increase the rules that the optimizer system has to learn the more optimization iterations it takes to learn the rules.

Conclusion

Prompt learning presents a compelling approach for continuous improvement of AI applications, and its ability to drive results with relatively few examples make it suitable for both early stage and production applications.

Appendix

Literature Review

There have been a number of approaches that are relevant worth noting

Comparing Prompt Learning To PromptAgent

Here is a comparison between prompt learning and PromptAgent. Monte Carlo tree search (MCTS)-based search for optimal prompts, like that in PromptAgent, could be combined with prompt learning in future work.

PromptAgent (ICLR ’24) vs. Prompt Learning (PL)

Dimension	PromptAgent	Prompt Learning (PL)
Objective	Find a single “expert-level” prompt that maximises a numeric task score on a dev set.	Continuously maintain a production prompt so that it self-heals when evals or users uncover new failure modes.
Optimizer	MCTS over the space of prompt edits; each node = a prompt, each edge = an edit derived from error feedback. arXiv	A meta-prompt controller reads the latest English critique and decides how to mutate an Instruction block (add, merge, rewrite, expire). No roll-outs or search tree.
Update granularity	Edits the entire task prompt during search; final prompt is frozen after the run.	Edits only the Instruction section inside a fenced region; other parts of the system prompt stay intact.
Use of critiques	Generates “constructive error feedback” to guide the next MCTS action, but the literal text is not kept in the final prompt. arXiv	Primary signal. English critique (from LLM judge or human) feeds the meta-prompt; controller extracts intent and rewrites/merges instructions. Critique itself is not stored, but its meaning is distilled into the instruction set.
Conflict / lifecycle management	None once search ends; prompt can contain redundant or stale rules that an operator must prune manually.	Built-in: controller can deduplicate, version, or expire instructions and supports human approval gates before applying changes.
Online vs. offline	Offline: heavy search (hundreds–thousands of roll-outs), then deployment.	Online: one extra LLM call whenever a failure appears; designed to run forever alongside the app.
Data requirement	Needs a moderate-sized scored dev set to evaluate roll-outs.	Works with single examples because each explanation is information-rich; leverages existing eval traces or human annotations.
Compute cost	Front-loaded (search); negligible at inference.	Minimal upfront, <1 extra call per optimisation; prompt grows by only the net instruction text.
Interpretability	Final prompt readable, but the reasoning path is hidden in search logs.	Full audit trail: every instruction edit is plain English; easy diff & rollback.
Typical sweet spot	Boot-strapping new tasks where you can afford an offline optimisation pass.	Long-lived agents that must obey evolving policy & domain rules with scarce labelled data.