From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment

Context

have evolved from basic search tools to AI assistants that code, write, and research. Now they’re accessible through smartphone apps via internet APIs, putting powerful AI at everyone’s fingertips. These systems are becoming an integral part of our daily lives. People are using AI assistants for seeking advice for personal relationships, fact-checking to form opinions (though it clearly states it can make mistakes), diet plans and next holiday destinations.

As more and more powerful models are launched, the question of trust arises and models are more scrutinized to make sure the responses produced by them are trustworthy and aligned with human values. These are not new questions. Traditionally, models are fine-tuned on human preference data (usually contains input, chosen answer, rejected answer) before launching them for public use. Model alignment and safety have been major areas of research, and multiple algorithms have been developed to train the model for alignment. Among all the alignment training algorithms, the most popular is Direct Preference Optimization (DPO) due to its simplicity and efficiency.

But DPO has a fundamental limitation. When calculating the likelihood of a response, it uses equal weight for each word or token present in the response, though humans naturally give more importance or weight to meaningful words. For example, let’s look at the following user interaction with LLM.

User: What’s the capital of France?
LLM: The capital of France is Paris, and it’s a beautiful city with many attractions.

In this interaction, humans primarily care about the accuracy of “Paris” rather than the stylistic flourishes, yet standard DPO gives equal weight to every token, allowing less relevant content to dilute the learning signal.

There have been multiple attempts to fix DPO’s problems. Algorithms like SimPO and SamPO were introduced to address different issues. In this post, we’re going to look into another algorithm published in May 2025 on “Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization (OTPO).” This post explains the core ideas behind their work and builds a foundation for understanding LLM alignment with human preferences.

Why Equal Token Weighting Fails

To understand why token weighting matters, we first need to examine how DPO actually processes tokens. Usually pre-trained models are trained with trillions of parameters, then fine tuned, and then trained further using DPO on human preference data to align with human preferences before being released to the public.
DPO operates by computing log likelihood differences between chosen and rejected responses at the token level. For each training example with a chosen response y_w and rejected response y_l, DPO calculates its objective value. The core of DPO lies in its loss function formula:

Image from DPO paper

Pi_theta(πθ) is the model to be optimized, Pi_reference(π_ref) is a reference model, and π∗(y|x) denotes the probability of response y given user input x.

π∗(y|x) breaks down into token-level computations. For a chosen response with tokens [t₁, t₂, ..., tₙ], the log probability becomes:

log π∗(y|x) = Σᵢ log π(tᵢ|x, t₁…tᵢ₋₁)

Each token contributes its individual log probability to the overall sequence probability, and there is no mechanism to weight important content more than filler. Let’s look at an example of preference data.

Input: What is the capital of France?
Chosen: The capital of France is Paris.
Rejected: The capital of France is Italy, which is actually incorrect.

DPO computes log probabilities for every token equally.
Chosen: log P("The") + log P("capital") + log P("of") + log P("France") + log P("is") + log P("Paris") + log P(".")

Rejected: log P("The") + log P("capital") + ... + log P("Italy") + ... + log P("incorrect") + log P(".")

The critical factual difference lies in “Paris” vs “Italy,” but DPO gives equal weight to articles, prepositions, and the factually crucial tokens. This uniform token treatment creates a mismatch between what the optimization focuses on and what humans actually care about.

The model receives equal learning signal from semantically crucial tokens (“Paris”) and inconsequential ones (“which”, “actually”). This leads to the verbosity trap, longer sequences accumulate more log likelihood mass through sheer token count, so DPO can inadvertently reward verbosity over quality.

When semantically crucial tokens get averaged with stylistic ones, the learning signals become unreliable, leading to suboptimal preference learning. These problems can be solved if we have a better way to give more weight to relevant tokens when calculating the probability of the response. That’s exactly what OTPO does.

Optimal Transport-Based Token Weighting (OTPO)

Now that we understand DPO’s token weighting problem, let’s see how OTPO solves it using optimal transport theory. OTPO views preference optimization as a transport problem, how much effort does it take to transform one response into another?

The key insight is what’s the minimum effort needed to change “The capital of France is Paris” into “The capital of France is Italy”? Most tokens remain the same, but “Paris” → “Italy” requires significant semantic transformation since they’re completely different concepts.

OTPO formulates this as an optimal transport problem where sources are tokens in the chosen response, targets are tokens in the rejected response, and transport costs reflect semantic similarity between token pairs. Semantically similar tokens (like “Paris” and “London”) have low transport costs, while distant tokens (like “Paris” and “apple”) have high costs.

The algorithm computes an optimal transport solution that tells us how to move probability mass between responses with minimal total cost. Token pairs that participate heavily in this transport, especially those requiring expensive semantic transformations, receive higher weights in the final loss calculation. This means OTPO automatically focuses learning on the tokens that matter most for human preferences, solving DPO’s equal weighting problem.

Math behind OTPO

Now let’s dive into the mathematical foundation of OTPO. The algorithm has three main components, constructing a cost matrix, solving the optimal transport problem, and computing weighted token losses.

Step 1: Cost Matrix Construction

OTPO starts by building a cost matrix M that measures semantic distance between every token pair. For the i-th token in the chosen(w) response and j-th token in the rejected(l) response the cost is

M[i][j] = ( h[w][i] — h[l][j] )²

Where h[w][i] and h[l][j] are the last-layer hidden representations of tokens from the model. This Euclidean distance captures semantic similarity. Similar tokens like “Paris” and “London” have low cost, while distant tokens like “Paris” and “apple” have high cost.

Step 2: Optimal Transport Problem

OTPO formulates token weighting as an unbalanced optimal transport optimization:

Here Γ is the transport plan (what we’re solving for) that aligns tokens between the chosen and rejected responses. Ω controls entropy regularization. KL terms ensure that the marginal distributions of Γ are close to the naive DPO uniform weights. The solution Γ* tells us how to optimally transport probability mass between chosen and rejected tokens.

Step 3: Computing Token Weights

From the optimal transport solution, we derive token-level weights by summing along dimensions:

Here, Γ(i,j) represents the weight assigned to each token pair (i, j) from chosen(w) and rejected(r) response. Finally these weights are applied on the DPO to replace the uniform weighting. Reward difference with weighting scheme.

Experiment Results and Limitations

OTPO was tested on a variety of tasks but in a controlled environment. When it was applied to summarization tasks, it showed about 8.5% improvement over other methods. When it was tested for length biases on the UltraFeedback dataset with smaller models like Llama-3–8B, OTPO was producing shorter responses. These initial tests provide evidence that OTPO helps reduce the verbosity and improve the quality of responses which are more likely to be chosen by humans.

The testing was not exhaustive enough to present the accuracy number across the domain. There were mixed results on different datasets. OTPO requires expensive cost metric and transport plan calculation. Also, the LLM as judge was used to calculate the quality of response, which was further scanned manually by a few people. These methods are good only but totally dependent on reviewers who might be easily biased towards certain datasets.

Conclusion

LLM alignment has been major a topic of research, and OTPO offers promising results in a controlled environment. While this approach is not perfect, the introduction of weighted token selection lays the groundwork for more fine-grained preference modeling in alignment tasks.

References:

Direct policy optimization(DPO). https://arxiv.org/pdf/2305.18290
Optimal transport based token weighting scheme. https://arxiv.org/pdf/2505.18720
Eliminating Biased Length Reliance of Direct Preference Optimization(SamPO). https://arxiv.org/pdf/2406.10957