Home » Hands On Time Series Modeling of Rare Events, with Python

Hands On Time Series Modeling of Rare Events, with Python

have you heard this:

We have those very large values in this timeseries, but those are just “outliers” and happen only [insert small number here]% of the time

During my Data Science years, I have heard that sentence a lot. In conferences, during product reviews, and during calls with clients, this sentence has been said to reassure that some very large (or small) unwanted values that might appear are not “standard”, they don’t belong to the “usual process” (all vague words I have heard of), and they do not constitute an issue for the system we are trying to build (for reason X, Y and Z).

In production settings, those very small or very large values (known as extreme values) are accompanied by guardrails to “fail gracefully” in case the extreme values are measured. This is usually enough for cases where you just need your system to work, and you want to be safe that it works at all times, even when the unwanted/not standard/rude, crazy, and annoying extreme values happen.

Nonetheless, when analyzing a timeseries, we can do something more than “fixing” the extreme value with guardrails and if/else threshold: we can actually monitor extreme values so that we can understand them.

In timeseries, extreme values actually represent something in the system of interest. For example, if your time series describes the energy consumption of a city, an unreasonably high energy level might indicate a worrisome energy consumption in a specific area, which may require action. If you are dealing with financial data, the highs and lows have an obvious, crucial meaning, and understanding their behavior is extremely important.

In this blog post, we will be dealing with weather data, where the time series will represent the temperature (in Kelvin). The data, as we will see, will have multiple cities, and every city yields a timeseries. If you select a specific city, you have a timeseries like the one you are seeing below:

Image generated by author [data source]

So, in this kind of dataset, it is pretty important to model the maxima and minima, because they mean it’s either as hot as a furnace or extremely cold. Hopefully, at this point, you are asking yourself

“What do we mean by modelling the maxima and minima?”

When you are dealing with a time series dataset, it is reasonable to expect a Gaussian-like distribution, like the one you see here:

Image made by author [data source]

But if you consider only the extreme values, the distribution is far from that. And as you will see in a second, there are multiple layers of complexity associated with extracting the distribution of extreme values. Three of them are:

  1. Defining an extreme value: How do you know something is extreme? We will define the implementation of our extreme value as a first stage
  2. Defining the distributions that describe these events: there are multiple possible distributions of extreme values. The three distributions that will be treated in this blog post are the Generalized Extreme Value (GEV), Weibull, and Gumbel (a special case of GEV) distributions.
  3. Choosing the best distribution: There are multiple metrics we can use to determine the “best fitting” distributions. We will treat the Akaike Information, the Log-likelihood, and the Bayesian Information Criterion.

All things we will talk about in this article 🥹

Looks like we have a lot of ground to cover. Let’s get started.

0. Data and Script Source

The language we will use is Python. The code source can be found on this PieroPaialungaAI/RareEvents folder. The data source can be found in this open source Kaggle Dataset. Mind you, if you clone the GitHub folder, you won’t need to download the dataset. The dataset is inside the RawData folder inside the RareEvents GitHub main folder (you’re welcome 😉).

1. Preliminary Data Exploration

In order to make everything simple during the exploration phase, and give us the maximum versatility in the notebook without writing hundreds of lines of code. The code that does that [data.py] is the following:

This code does all the dirty work data-wise; so we can do all the following steps in a very few lines of code.

The first thing we can do is just display some of the rows of the dataset. We can do it with this code:

Notice that there are 36 columns/cities, not just 4, in the dataset. I displayed 4 to have a nicely formatted table. 🙂

A few things to notice:

  • Every column, except “datetime”, is a city and represents a time series, where every value corresponds to the datetime, which represents the time axis
  • Every value in the city column represents the Kelvin temperature for the date in the corresponding datetime column. For example, index = 3 for column = ‘Vancouver’ tells us that, at datetime 2012-10-01 15:00:00, the temperature was 284.627 K

I also developed a function that allows you to plot the city column. For example, if you want to peek at what happens in New York, you can use this:

Image made by author using code above

Now, the datetime column is just a string column, but it would be actually helpful to have the specific month, day, and year in separate columns. Also, we have some NaN values that we should take care of. All these boring preprocessing steps are inside the `.clean_and_preprocess()`

This is the output:

2. Detecting Extreme Events

Now, a crucial question:

What is an extreme event? And how are we going to detect it?

There are two main ways to define an “extreme event”. For example, if we want to identify the maxima, we can apply:

  1. The first Definition: Peak Over Threshold (POT). Given a threshold, everything above that threshold is a maximum point (extreme event).
  2. The second definition: Extreme within a region. Given a window, we define the maximum value of the window as an extreme event.

In this blog post, we’ll use the second approach. For instance, if we use daily windows, we scan through the dataset and extract the highest value for each day. This would give you a lot of points, as our dataset spans more than 5 years. OR, we could do it with monthly windows or yearly windows. This would give you fewer points, but perhaps richer information.

This is exactly the power of this method: we have control over the number of points and their “quality”. For this study, arguably the best window size is the “daily-sized one. For another dataset, feel free to adjust based on the quantity of your points; for example, you might want to reduce the window size if you have a very short natural window (e.g., you collect data every second), or increase it if you have a very large dataset (e.g., you have 50+ years of data and a week window is more appropriate).

This definition of maximum value is defined within the RareEventsToolbox class, in [rare_events_toolbox.py] script (look at the extract_block_max function).

And we can quickly display the distribution of rare event at different window sizes using the following block of code:

Image made by author using code above

3. Extreme Events Distribution

Before diving into code, let’s take a step back. In general, the extreme value distributions do not exhibit the beautiful gaussian bell behavior that you have seen earlier (the Gaussian distribution for San Francisco). From a theoretical perspective, the two distributions to know are the Generalized Extreme Value (GEV) distribution and the Weibull distribution.

GEV (Generalized Extreme Value)

  • The GEV is the foundation of extreme value theory and provides a family of distributions tailored for modeling block maxima or minima. A special case is the Gumbel distribution.
  • Its flexibility comes from a shape parameter that determines the “tail behavior.” Depending on this parameter, the GEV can mimic different kinds of extremes (e.g., moderate, heavy-tailed).
  • The demonstration of the GEV distribution is very elegant: just like the Central Value Theory (CLT) says, “If you average a bunch of i.i.d. random variables, the distribution of the average tends to a Gaussian”, the EVT (Extreme Value Theory) says “if you take the maximum (or minimum) of a bunch of i.i.d. random variables, the distribution of that maximum tends to a GEV.”

Weibull

  • The Weibull is one of the most widely used distributions in reliability engineering, meteorology, and environmental modeling.
  • It’s especially useful for describing data where there’s a sense of “bounded” or tapered-off extremes.
  • Unlike the GEV distribution(s), the Weibull formulation is empirical. Waloddi Weibull, a Swedish engineer, first proposed the distribution in 1939 to model the breaking strength of materials.

So we have three possibilities: GEV, Gumbel, and Weibull. Now, which one is the best? The short answer is “it depends,” and another short answer is “it’s best just to try them all and see which one performs best”.

So now we have another question:

How do we evaluate the quality of a distribution function and a set of data?

Three metrics to use are the following:

Three metrics to use are the following:

  • Log-Likelihood (LL). It measures how probable the observed data is under the fitted distribution: higher is better.
Image made by author

where f is the probability density (or mass) function of the distribution with parameters θ and x_i is the i-th observed data point

  • Akaike Information Criterion (AIC) AIC balances two forces: fit quality (via the log-likelihood = L) and simplicity (penalizes models with too many parameters, number of parameters = k).
Image made by author
  • Bayesian Information Criterion (BIC). Similar spirit to AIC, but harsher on complexity (dataset size = n).
Image made by author

The standard recommendation is to use one between AIC and BIC, as they consider the Log Likelihood and the complexity.

The implementation of the three distribution functions, and the corresponding L, AIC, and BIC values, is the following:

And then we can display our distribution using the following:

Image made by author

Pretty good fit, right? While it visually looks good, we can be a little bit more quantitative and look at the Q-Q plot, which displays the quartile match between the data and the fitted distribution:

Image made by author

This displays that our distribution matches very well with the provided dataset. Now, you can notice how, if you had tried with a standard distribution (e.g. Gaussian curve), you would have surely failed: the distribution of the data is heavily skewed (as expected, because we are dealing with extreme values, and this requires extreme value distributions (this feels weirdly motivational 😁).

Now the cool thing is that, as we made it structured, we can also run this for every city in the dataset using the following block of code:

And the output will look like this:

{'Dallas': {'dist_type': 'gev',
  'param': (0.5006578789482107, 296.2415220841758, 9.140132853556741),
  'dist': ,
  'metrics': {'log_likelihood': -6602.222429209462,
   'aic': 13210.444858418923,
   'bic': 13227.07308905503}},
 'Pittsburgh': {'dist_type': 'gev',
  'param': (0.5847547512518895, 287.21064374616327, 11.190557085335278),
  'dist': ,
  'metrics': {'log_likelihood': -6904.563305593636,
   'aic': 13815.126611187272,
   'bic': 13831.754841823378}},
 'New York': {'dist_type': 'weibull_min',
  'param': (6.0505720895039445, 238.93568735311248, 55.21556483095677),
  'dist': ,
  'metrics': {'log_likelihood': -6870.265288196851,
   'aic': 13746.530576393701,
   'bic': 13763.10587863208}},
 'Kansas City': {'dist_type': 'gev',
  'param': (0.5483246490879885, 290.4564464294219, 11.284265203196664),
  'dist': ,
  'metrics': {'log_likelihood': -6949.785968553707,
   'aic': 13905.571937107414,
   'bic': 13922.20016774352}}

4. Summary

Thank you for spending time with me so far, it means a lot ❤️

Let’s recap what we did. Instead of hand-waving away “outliers,” we treated extremes as first-class signals. As an example:

  • We took a dataset representing the temperature of cities around the world
  • We defined our extreme events using block maxima on a fixed window
  • We modeled city-level temperature highs with three candidate families (GEV, Gumbel, and Weibull)
  • We selected the best fit using log-likelihood, AIC, and BIC, then verified fits with Q-Q plots.

Results show that “best” varies by city: for example, Dallas, Pittsburgh, and Kansas City leaned GEV, while New York fit a Weibull.

This kind of approach is very important when extreme values are of high importance in your system, and we need to reveal how it statistically behaves under rare and extreme conditions.

5. Conclusions

Thank you again for your time. It means a lot ❤️

My name is Piero Paialunga, and I’m this guy here:

Image made by author

I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department. I talk about AI and Machine Learning in my blog posts and on LinkedIn, and here on TDS. If you liked the article and want to know more about machine learning and follow my studies, you can:

A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you can see all my code
C. For questions, you can send me an email at piero.paialunga@hotmail

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *