How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes

Overview

just optical illusions or mind-bending puzzles. They can also be logical, causing initial observations to fall apart upon closer investigation. In data science, paradoxes arise when we take numbers at face value, without looking into the context behind them. One can have the sharpest visuals and still walk away with the wrong story.

In this article, we discuss three logical paradoxes that serve as cautionary tales for anyone who interprets data too quickly, without applying context. We explore how paradoxes arise in Data Science & Business Intelligence (BI) use cases and then extend the insights to Retrieval-Augmented Generation (RAG) systems, where similar paradoxes can undermine the quality of both the prompt provided and the model’s output.

Simpson’s Paradox in Business Intelligence

Simpson’s paradox describes the scenario where trends reverse when data is aggregated. In other words, the trends that you observe in subgroups get flipped when you combine the numbers and analyze them. Let’s assume that we are analyzing the sales of four locations of a popular Ice cream chain. When the sales for each location are individually analyzed, it suggests that the chocolate flavor is the most preferred amongst customers. But when the sales are added up, the trend goes away, and the new combined results suggest that vanilla is preferred the most. This trend reversal is denoted by Simpson’s Paradox. We use the fictitious data below to demonstrate this.

Location	Chocolate	Vanilla	Total Customers	Chocolate %	Vanilla %	Winner
Suburb A	15	5	20	75.0%	25.0%	Chocolate
City B	33	27	60	55.0%	45.0%	Chocolate
Mall	2080	1920	4000	52.0%	48.0%	Chocolate
Airport	1440	2160	3600	40.0%	60.0%	Vanilla
Total	3568	4112	7680	46.5%	53.5%	Vanilla!

Sales by Store Location for a Fictitious Ice Cream Chain (By the Author)

Below is a visual illustration.

Simpson’s Paradox in BI Reporting – Illustration (Image by the Author)

A data analyst who overlooks these subgroup dynamics may assume that chocolate is underperforming. Hence, it is essential to aggregate numbers by subgroups and check for the presence of Simpson’s paradox. When a reversal in trend occurs, the lurking variable should be identified as the next step. A lurking variable is the hidden factor influencing group outcomes. In this case, the store location happens to be the lurking variable. A deep contextual understanding is needed to interpret why the sale of vanilla icecreams was high at the airport, flipping the overall outcome. Some questions that could be used to investigate are:

• Do airport outlets stock fewer chocolate options?

• Do travelers prefer milder flavors?

• Was there a promotional campaign favoring Vanilla at stores in the airport?

Simpson’s Paradox in RAG Systems

Let’s suppose that you have an RAG (Retrieval-Augmented Generation) model that gauges public sentiment towards electric vehicles (EVs) and answers questions around the same. The model uses news articles from 2010 to 2024. Until 2016, EVs were receiving mixed opinions due to their limited range, higher buying price, and lack of charging stations. All these factors made driving in EVs for long distances impossible. Newspaper reports before 2017 used to highlight such deficiencies. But as of 2017, EVs started being perceived in a good light due to improvements in performance and the availability of charging stations. This shift in perception happened particularly after the successful launch of Tesla’s premium EV. An RAG model that uses news reports from 2010 to 2024 would most probably give contradictory responses to similar questions, which will trigger the Simpson’s Paradox.

As an example, if the RAG is asked, “Is EV adoption in the US still low?”, the answer might be “Yes, adoption remains low due to high buying costs and limited infrastructure”. If the RAG is asked, “Has EV adoption increased recently in the U.S.?”, the answer would be ‘Yes, adoption has increased greatly due to advancements in technology and charging infrastructure’. In this case, the lurking variable is the publication date. A practical fix to this issue is to tag documents (articles) into time-based bins during the pre-processing phase. Other options include encouraging the users to specify a time range in their prompt (e.g. In the last five years, how has the adoption of EV been?) or fine-tuning the LLM to explicitly state the timeline that it is considering for its response (e.g., Around 2024, EV Adoption has increased greatly.).

Simpson’s Paradox in RAG Systems (Image by the Author)

Accuracy Paradox in Data Science Problems

The crux of the Accuracy Paradox is that high accuracy is not indicative of a useful output. Let’s assume that you are building a classification model to identify whether a patient has a rare disease that affects only 1 in 100. The model correctly identifies and labels those who do not have the disease and thereby achieves a 99% accuracy. However, it fails to identify the one person who has the disease and needs urgent medical attention. Thereby, the model becomes useless for detecting the disease, which is its very purpose. This occurs especially in imbalanced datasets where the observations for one class are minimal. This has been illustrated in the figure below.

The best way to tackle the Accuracy paradox is to use metrics that capture the performance of the minority classes, such as Precision, Recall, and F1-score. Another approach to follow is to treat imbalanced datasets as anomaly detection problems, as against classification problems. One could also consider collecting more minority class data (if possible), over-sampling the minority class, or undersampling the majority class. Below is a quick guide that helps determine which metric to use depending on the use case, objective, and consequences of mistakes.

Choosing the Right Metric for your Model’s Performance Measurement (Image by the Author)

Accuracy Paradox in LLMs

While the Accuracy Paradox is a common issue that many data scientists tackle, its implications in LLMs are largely ignored. The Accuracy metric can dangerously overpromise in use cases that involve safety, toxicity detection, and bias mitigation. A high accuracy does not mean that a model is fair and safe to use. For example, an LLM model that has a 98% accuracy is of no use if it misclassifies 2 malicious prompts as being safe and harmless. Hence, in LLM evaluations, it is a good idea to use recall, precision, or PR-AUC over Accuracy, as they indicate how well the model tackles minority classes.

Goodhart’s Law in Business Intelligence

Economist Charles Goodhart stated that “When a measure becomes a target, it ceases to be a good measure.” This law is a gentle reminder that if you over-optimize a metric without understanding the implications and context, the model will backfire.

A manager of a fictitious online news agency sets a KPI for his team. He asks the team to work towards increasing the session duration by 20%. The team extends introductions artificially and also adds filler content to increase the session duration. The session duration goes up, but the video quality is lost, and as a result, the value that users get from the video gets diminished.

Another example is related to Customer Churn. In an attempt to reduce customer churn, a subscription-based Entertainment app decides to place the ‘Unsubscribe’ button in a hard-to-find location in its web portal. As a result, the customer churn reduces, but it’s not due to improved customer satisfaction. It’s solely because of limited exit options — an illusion of customer retention. Below is a visual illustration that demonstrates how efforts to meet or exceed growth targets (such as increasing session duration or user engagement) can often lead to unintended consequences, leading to a decline in user experience. When teams resort to artificial inflation tactics to help drive up performance metrics, the metric improvement looks good on paper, but they are not meaningful in any way.

Goodhart’s Law – Illustration (Image by the Author)

Goodhart’s Law in LLMs

When you train an LLM too much on a particular dataset (especially a benchmark), it can start memorizing patterns from that training data instead of learning to generalize. This is a classic example of overfitting, where the model performs extremely well on that training data but performs poorly on real-world inputs.

Let’s assume that you are training an LLM to summarize news articles. You use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric to evaluate the LLM’s performance. The ROUGE metric rewards exact or near-exact matches of n-grams with the reference summaries. Over time, the LLM starts copying large phrases of text from the input articles in order to get an increased ROUGE score. It also uses buzzwords that appear a lot in reference summaries. Let’s assume that the input article has the text “The bank increased interest rates to curb inflation, and this caused stock prices to decline sharply.” The overfit model would summarize it as “The bank increased interest rates to curb inflation”, whereas a generalizing model would summarize it as “The interest rate hike triggered a decline in the stock markets”. The illustration below demonstrates how optimizing your model too much for an evaluation metric can result in low-quality responses (responses that are good on paper but are not helpful).

Goodhart’s Law in LLMs (Image by the Author)

Concluding Remarks

Whether it is in business intelligence or LLMs, paradoxes can creep in if numbers and metrics are handled without the underlying nuance and context. Also, it is important to remember that over-fitting can damage the bigger picture. Combining quantitative analysis with human insight is crucial to avoid such pitfalls and create reliable reports and powerful LLMs that truly deliver value.

How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes

Overview

Simpson’s Paradox in Business Intelligence

Simpson’s Paradox in RAG Systems

Accuracy Paradox in Data Science Problems

Accuracy Paradox in LLMs

Goodhart’s Law in Business Intelligence

Goodhart’s Law in LLMs

Concluding Remarks

Related Posts

Wiki – Dataconomy

What Are Attributes In Computing?

Leave a Reply Cancel reply