Home » Dynamic Inventory Optimization with Censored Demand

Dynamic Inventory Optimization with Censored Demand

under uncertainty. Not just once, but in a sequence over time. We rely on our past experiences and expectations of the future to make the most informed and optimal choices possible.

Think of a business that offers several products. These products are procured at a cost and sold for a profit. However, unsold inventory may incur a restocking fee, may carry salvage value, or in some cases, must be scrapped entirely.

Businesses, therefore, faces a crucial question: how much to stock? This decision must often be made before demand is fully known; that is, under censored demand. If the business overstocks, it observes the full demand, since all customer requests are fulfilled. But if it understocks, it only sees that demand exceeded supply and the actual demand remains unknown, making it a censored observation.

Image Generated by Author via DALL-E

This type of problem is often referred to as a Newsvendor Model. In fields such as operations research and applied mathematics, the optimal stocking decision has been studied by framing it as a classic newspaper stocking problem; hence the name.

In this article, we explore a Sequential Decision-Making framework for the stocking problem under uncertainty and develop a dynamic optimization algorithm using Bayesian learning.

Our approach closely follows the framework laid out by Warren B. Powell, Reinforcement Learning and Stochastic Optimization (2019) and implements the paper from Negoescu, Powell, and Frazier (2011), Optimal Learning Policies for the Newsvendor Problem with Censored Demand and Unobservable Lost Sales, published in Operations Research.

Problem Setup

Following a similar setup to that of Negoescu et al., we frame the problem as optimizing the inventory level for a single item over a sequence of time steps. The cost and selling price are considered fixed. Unsold inventory is discarded with no salvage value, while each unit sold generates revenue. Demand is unknown, and when the available stock is less than actual demand, the demand observation is considered censored.

Demand ( W ) in each period is drawn from an exponential distribution with an unknown rate parameter, for simulation purposes.

[
begin{aligned}
x &in mathbb{R}_+ &&: text{Order quantity (decision variable)} \
W &sim mathrm{Exponential}(lambda) &&: text{Random demand with unknown rate parameter } lambda \
lambda &&&: text{Demand rate (unknown, to be estimated)} \
c &&&: text{Unit cost to procure or produce the item} \
p &&&: text{Unit selling price (assume } p > c text{ for profitability)}
end{aligned}
]

The parameter (lambda) in the exponential distribution represents the rate of demand; that is, how quickly demand events occur. The average demand is given by (mathbb{E}[W] = frac{1}{lambda}).

Image by Author

We can observe from the Probability Density Function (PDF) of the Exponential distribution that higher values of demand (W) become less likely. Thus, the Exponential distribution serves as an appropriate choice for demand modeling.

Sequential Decision Formulation

We formulate the inventory control problem as sequential decision process under uncertainty. The goal is to maximize total expected profit over a finite time horizon ( N ), while learning unknown demand rate by applying Bayesian Learning principals.

We define a model with an initial state and a probabilistic model that represents its belief about future states over time. At each time step, the model makes a decision based on a policy that maps its current belief to an action. The goal is to find the optimal policy that maximizes a predefined reward function.

After taking an action, the model observes the resulting state and updates its belief, accordingly, continuing this cycle of decision, observation, and belief update.

1) State Variable

We model demand in each period as a random variable drawn from an Exponential distribution with an unknown rate parameter λ. Since ( lambda ) is not directly observable, we encode our uncertainty about its value using a Gamma prior:

[
lambda sim mathrm{Gamma}(a_0, b_0)
]

The parameters ( a_0 ) and ( b_0 ) define the shape and rate of our initial belief about the demand rate. These two parameters serve as our state variables. At each time step, they summarize all past information and are updated as new demand observations become available.

As we collect more data, the posterior distribution over (lambda) evolves from a wide and uncertain shape to a narrower and more confident one, gradually concentrating around the true demand rate.

This process is captured naturally by the Gamma distribution, which flexibly adjusts its shape based on the amount of information we’ve seen. Early on, the distribution is diffuse, signaling high uncertainty. As observations accumulate, the belief becomes sharper, allowing for more reliable and responsive decision-making. Probability Density Function (PDF) of Gamma distribution can be seen below:

Image by Author

We will later define a transition function that updates the state, that is, how ( (a_n, b_n) ) evolves to ( (a_{n+1}, b_{n+1}) ), based on newly observed data. This allows the model to continuously refine its belief about demand and make more informed inventory decisions over time.

Note that expected value of the Gamma distribution is defined as:

[
mathbb{E}[lambda] = frac{a}{b}
]

2) Decision Variable

The decision variable at time ( n ) is the stocking level:

[
x_n in mathbb{R}_+
]

This is the number of units to order before demand ( W_{n+1} ) is realized. The decision depends only on the current belief ( (a_n, b_n) ).

3) Exogenous Information

After selecting ( x_n ), demand ( W_{n+1} ) is revealed:

[
W_{n+1} sim text{Exp}(lambda)
]

Since ( lambda ) is unknown, demand is random. Observations are:

  • Uncensored if ( W_{n+1} < x_n ) (we observe the actual demand)
  • Censored if ( W_{n+1} ge x_n ) (we only know demand exceeding supply levels)

This censoring limits the information available for belief updating. Even though the full demand isn’t observed, the censored observation still carries valuable information and should not be ignored in our modeling approach.

4) Transition Function

The transition function defines how the model’s belief, represented by the state variables, is updated over time. It maps the prior state to the expected future state, and in our case, this update is governed by Bayesian learning.

Bayesian Uncertainty Modelling

Bayes’ theorem combines prior belief with observed data to form a posterior distribution. This updated distribution reflects both prior knowledge and the newly observed information.

[
p_{n+1}(lambda mid w_{n+1}) = frac{p(w_{n+1} mid lambda) cdot p_n(lambda)}{p(w_{n+1})}
]

Where:

[
p(w_{n+1} mid lambda) : text{ Likelihood of new observation at time } n+1
]

[
p_n(lambda) : text{ Prior at time } n
]
[
p(w_{n+1}) : text{ Marginal likelihood (normalizing constant) at time } n+1
]
[
p_{n+1}(lambda mid w_{n+1}) : text{ Posterior after observing } w_{n+1}
]

We set up our problem such that in each period, demand W is drawn from an Exponential distribution. Prior belief over λ will be modelled using a Gamma distribution.

[
p_{n+1}(lambda mid w_{n+1})
=
frac{
underbrace{lambda e^{-lambda w_{n+1}}}_{text{Likelihood}}
cdot
underbrace{frac{b_n^{a_n}}{Gamma(a_n)} lambda^{a_n – 1} e^{-b_n lambda}}_{text{Prior (Gamma)}}
}{
underbrace{
int_0^infty lambda e^{-lambda w_{n+1}} cdot frac{b_n^{a_n}}{Gamma(a_n)} lambda^{a_n – 1} e^{-b_n lambda} , dlambda
}_{text{Marginal (evidence)}}
}
]

The Gamma and Exponential distributions form a well-known conjugate prior in Bayesian statistics. When using a Gamma prior and an Exponential likelihood, the resulting posterior is also a Gamma distribution. This property of the prior and posterior belonging to the same distributional family is what defines a conjugate prior. Posterior also belongs to the Gamma family; a property that simplifies Bayesian updating significantly.

For reference, closed-form conjugate updates like this can be found in standard conjugate prior tables, such as the one on Wikipedia. Using this reference, we can formulate the posterior as:

Let:

[
lambda mid w_1, dots, w_n sim mathrm{Gamma}left(a_0 + n, b_0 + sum_{i=1}^n w_iright)
]

[
lambda sim mathrm{Gamma}(a_0, b_0) quad : text{ Prior}
]

[
w sim mathrm{Exp}(lambda) quad : text{ Likelihood}
]

For n independent observations ( w_1, dots, w_n ), the Gamma prior and Exponential likelihood result in a Gamma posterior:

After observing a single (uncensored) demand ( w ), the posterior simplifies to below by leveraging conjugate priors:

[
lambda mid w sim mathrm{Gamma}(a_0 + 1, b_0 + w)
]

  • The shape parameter increases by 1 because one new data point has been observed.
  • The rate parameter increases by ( w ) because the Exponential likelihood includes the term ( e^{-lambda w} ), which combines with the prior’s exponential term and adds to the total exponent.

The Update Function

The posterior parameters (state variables) are updated based on the nature of the observation:

  • Uncensored (( W_{n+1} < x_n )):

[
a_{n+1} = a_n + 1, quad b_{n+1} = b_n + W_{n+1}
]

  • Censored (( W_{n+1} ge x_n )):

[
a_{n+1} = a_n, quad b_{n+1} = b_n + x_{n}
]

These updates reflect how each observation, full or partial, informs the posterior belief over ( lambda ).

We can define the transition function in Python as below:

from typing import Tuple

def transition_a_b(
    a_n: float,
    b_n: float,
    x_n: float,
    W_n1: float
) -> Tuple[float, float]:
    """
    Updates the posterior parameters (a, b) after observing demand.

    Args:
        a_n (float): Current shape parameter of Gamma prior.
        b_n (float): Current rate parameter of Gamma prior.
        x_n (float): Order quantity at time n.
        W_n1 (float): Observed demand at time n+1 (may be censored).

    Returns:
        Tuple[float, float]: Updated (a_{n+1}, b_{n+1}) values.
    """
    if W_n1 < x_n:
        # Uncensored: full demand observed
        a_n1 = a_n + 1
        b_n1 = b_n + W_n1
    else:
        # Censored: only know that W >= x
        a_n1 = a_n
        b_n1 = b_n + x_n

    return a_n1, b_n1

5) Objective Function

The model seeks a policy ( pi ), mapping beliefs to stocking decisions in order to maximize total expected profit.

  • Profit from ordering ( x_n ) units and facing demand ( W_{n+1} ):

[
F(x_n, W_{n+1}) = p cdot min(x_n, W_{n+1}) – c cdot x_n
]

  • The cumulative objective is:

[
max_pi mathbb{E} left[ sum_{n=0}^{N-1} F(x_n, W_{n+1}) right]
]

  • ( pi ) maps ( (a_n, b_n) ) to ( x_n )
  • ( p ) is the selling price per unit sold
  • ( c ) is the unit cost of ordering
  • Unsold units are discarded with no salvage value

Note that this objective function maximizes only the expected immediate reward across the entire time horizon. In the next section, we introduce an expanded version that incorporates the value of future learning. This encourages the model to explore, accounting for the information that censored demand can reveal over time.

We can define the profit function in Python as below:

def profit_function(x: float, W: float, p: float, c: float) -> float:
    """
    Profit function defined as:

        F(x, W) = p * min(x, W) - c * x

    This represents the reward received when fulfilling demand W with inventory x,
    earning price p per unit sold and incurring cost c per unit ordered.

    Args:
        x (float): Inventory level / decision variable.
        W (float): Realized demand.
        p (float, optional): Unit selling price.
        c (float, optional): Unit cost.

    Returns:
        float: The profit (reward) for this period.
    """
    return p * min(x, W) - c * x

Policy Functions

We will define several policy functions as defined by Negoescu et al, which will update the value of (x_{n+1}) (stocking level) based on our current belief of the state ((a_{n}, b_{n})).

1) Point Estimate Policy

Under this policy, model estimates the unknown demand rate (lambda) using the current posterior and chooses and order quantity ( x_{n+1} ) to maximize the immediate expected profit.

At time (n), current posterior about (lambda ~ Gamma(a_{n}, b_{n})) is:

[
hat{lambda}_n = frac{a_n}{b_n}
]

We treat this estimate as the “true” value of (lambda) and assume demand (W sim text{Exp}(hat{lambda}_n)).

Expected Value

The profit for order quantity (x) and realized demand (W) is:

[
F(x, W) = p cdot min(x, W) – c cdot x
]

We seek to maximize the expected profit.

[
max_{x geq 0} quad mathbb{E}_W left[ p min(x, W) – c x right]
]

Expected value of a random variable is:

[
mathbb{E}[X] = int_{-infty}^{infty} x cdot f(x) , dx
]

Thus, the objective function can be written as:

[
max_{x geq 0} left[ p left( int_0^x w f_W(w) , dw + x int_x^infty f_W(w) , dw right) – c x right]
]

Where:

  • (f_W(x)): Probability density function (PDF) of demand evaluated at (x)

PDF of (Exponential(lambda)) is:

[
f_W(w) = hat{lambda}_n e^{-hat{lambda}_n w}
]

This can be solved as:

[
mathbb{E}[F(x, W)] = p cdot frac{1 – e^{-hat{lambda}_n x}}{hat{lambda}_n} – c x
]

First Order Optimality Condition

We set the derivative of the expected profit function to zero, and solve for (x) to find the stocking value which maximize the expected profit:

[
frac{d}{dx} mathbb{E}[F(x, W)] = p e^{-hat{lambda}_n x} – c = 0
]

[
e^{-hat{lambda}_n x^*} = frac{c}{p}
quad Rightarrow quad
x^* = frac{1}{hat{lambda}_n} logleft( frac{p}{c} right)
]

Substitute (hat{lambda}_n = frac{a_n}{b_n}):

[
x_n = frac{b_n}{a_n} logleft( frac{p}{c} right)
]

Python implementation:

import math

def point_estimate_policy(
    a_n: float,
    b_n: float,
    p: float,
    c: float
) -> float:
    """
    Point Estimate Policy, chooses x_n based on posterior mean at time n.

    Args:
        a_n (float): Gamma shape parameter at time n.
        b_n (float): Gamma rate parameter at time n.
        p (float): Selling price per unit.
        c (float): Unit cost.

    Returns:
        float: Stocking level x_n
    """
    lambda_hat = a_n / b_n
    return (1 / lambda_hat) * math.log(p / c)

2) Distribution Policy

The Distribution Policy optimizes the expected immediate profit by integrating over the entire current belief distribution of the demand rate (lambda). Unlike the Point Estimate Policy, it does not collapse the posterior to a single value.

At time (n), the belief about (lambda) is:

[
lambda sim text{Gamma}(a_n, b_n)
]

Demand is modelled as:

[
W sim text{Exp}(lambda)
]

This policy chooses order quantity (x_{n}) by maximizing the expected immediate profit, averaged over both the uncertainty in demand and the uncertainty in (lambda)

[
x_n = argmax_{x ge 0} mathbb{E}_{lambda sim text{Gamma}(a_n, b_n)} left[ mathbb{E}_{W sim text{Exp}(lambda)} left[ p cdot min(x, W) – c x right] right]
]

This is the expected immediate profit, averaged over both the uncertainty in demand and the uncertainty in (lambda).

Expected Value

From the previous policy, we know that:

[
mathbb{E}_W[min(x, W)] = frac{1 – e^{-hat{lambda}_n x}}{hat{lambda}_n}
]

Thus:

[
mathbb{E}_{lambda} left[ mathbb{E}_{W mid lambda}[min(x, W)] right]
= mathbb{E}_{lambda} left[ frac{1 – e^{-lambda x}}{lambda} right]
]

If we denote the Gamma density as:

[
f(lambda) = frac{b^a}{Gamma(a)} lambda^{a – 1} e^{-b lambda}
]

Then expectation becomes:

[
mathbb{E}_lambda left[ frac{1 – e^{-lambda x}}{lambda} right]
=int_0^infty frac{1 – e^{-lambda x}}{lambda} f(lambda) , dlambda
= frac{b^a}{Gamma(a)} int_0^infty (1 – e^{-lambda x}) lambda^{a – 2} e^{-b lambda} , dlambda
]

Without going over the full proof, expectation becomes:

[
mathbb{E}[text{Profit}] = p cdot mathbb{E}_{lambda} left[ frac{1 – e^{-lambda x}}{lambda} right] – c x
= p cdot frac{b}{a – 1} left(1 – left( frac{b}{b + x} right)^{a – 1} right) – c x
]

First Order Optimality Condition

Again, we set the derivative of the expected profit function to zero, and solve for (x) to find the stocking value which maximize the expected profit:

[
frac{d}{dx} mathbb{E}[text{Profit}]
= frac{d}{dx} left[ p cdot frac{b}{a – 1} left(1 – left( frac{b}{b + x} right)^{a – 1} right) – c x right] = 0
]

Without going over the proof, closed form expresion based on Negoescu et al’s paper is:

[
x_n = b_n left( left( frac{p}{c} right)^{1/a_n} – 1 right)
]

Python implementation:

def distribution_policy(
    a_n: float,
    b_n: float,
    p: float,
    c: float
) -> float:
    """
    Distribution Policy, chooses x_n by integrating over full posterior at time n.

    Args:
        a_n (float): Gamma shape parameter at time n.
        b_n (float): Gamma rate parameter at time n.
        p (float): Selling price per unit.
        c (float): Unit cost.

    Returns:
        float: Stocking level x_n
    """
    return b_n * ((p / c) ** (1 / a_n) - 1)

Knowledge Gradient (KG) Policy

The Knowledge Gradient (KG) policy is a Bayesian learning policy that balances exploitation (maximizing immediate profit) and exploration (ordering to gain information about demand for future decisions).

Instead of just maximizing today’s profit, KG chooses the order quantity that maximizes:

Profit now + Value of information gained for the future

[
x_n = argmax_x mathbb{E}left[ p cdot min(x, W_{n+1}) – c x + V(a_{n+1}, b_{n+1}) mid a_n, b_n, x right]
]

Where:

  • (W_{n+1} sim text{Exp}(lambda)) (with (lambda sim text{Gamma}(a_n, b_n)))
  • (V(a_{n+1}, b_{n+1})) is the value of expected future profits under updated beliefs after observing (W_{n+1})

We do not know (a_{n+1}, b_{n+1}) at time (n) because we haven’t yet observed demand. So, we compute their expected value under the possible observation outcomes (censored vs uncensored).

The KG policy then evaluates each candidate stocking quantity (x) by:

  • Simulating its effect on posterior beliefs
  • Computing the immediate profit
  • Computing the value of future learning based on belief updates

Objective Function

We define the total value of choosing (x) at time (n) as:

[
F_{text{KG}}(x) = underbrace{mathbb{E}[p cdot min(x, W) – c x]}_{text{Immediate profit}} + underbrace{(N – n) cdot mathbb{E}_{text{posterior}} left[ max_{x’} mathbb{E}_{lambda sim text{posterior}}[ p cdot min(x’, W) – c x’ ] right]}_{text{Value of learning}}
]

  • The first term is just expected immediate profit.
  • The second term accounts for how this choice improves future profit by sharpening our belief about (lambda).
  • Horizon Factor ((N-n)): We will make ((N-n)) more decisions in the future. So the value of better decisions due to learning today gets multiplied by this factor.
  • Posterior Averaging (mathbb{E}_{text{posterior}}[⋅]): This means we are averaging over all the possible posterior beliefs we might end up with after observing the outcome of demand; because demand is random and possibly censored, we won’t get perfect information, but we will update our belief.

The paper uses previously discussed Distribution Policy as proxy for estimating the Future Value function. Thus:

[
x^*(a, b) = V(a, b) = frac{b p}{a – 1} left( 1 – left( frac{b}{b + x^*} right)^{a – 1} right) – c x^* = b left( left( frac{p}{c} right)^{1/a} – 1 right)
]

Expected Value

Expected value of (V) is expressed as below per Negoescu et al. As the proof of this equation is quite complex, we’ll not be going over the details.

[
begin{align*}
mathbb{E}[V] &= mathbb{E} left[ mathbb{E} left[ b^{n+1} left( frac{p}{a^{n+1} – 1} left( 1 – left( frac{c}{p} right)^{1 – frac{1}{a^{n+1}}} right) – c left( left( frac{c}{p} right)^{-frac{1}{a^{n+1}}} – 1 right) right) Big| lambda right] Big| a^n, b^n, x^n right] \
&= mathbb{E} left[ int_0^{x^n} left( b^n + y right) left( frac{p}{a^n} left( 1 – left( frac{c}{p} right)^{1 – frac{1}{a^{n+1}}} right) – c left( left( frac{c}{p} right)^{-frac{1}{a^{n+1}}} – 1 right) right) lambda e^{-lambda y} , dy right. \
&quad + left. int_{x^n}^{infty} left( b^n + x^n right) left( frac{p}{a^n – 1} left( 1 – left( frac{c}{p} right)^{1 – frac{1}{a^n}} right) – c left( left( frac{c}{p} right)^{-frac{1}{a^n}} – 1 right) right) lambda e^{-lambda y} , dy right].
end{align*}
]

As we already know the expected value of the immediate profit function as described under previous policies, we can express the additive expected value of KG policy as summation. As this equation is quite long, we’ll not be going over the details, but it can be found in the paper.

First Order Optimality Condition

In this policy as well, we set the derivative of the expected profit function to zero, and solve for (x) to find the stocking value which maximize the expected profit. Closed form solution of the equation based on the paper is:

[
x_n = b_n left[ left( frac{r}{1 + (N – n) cdot left( 1 + frac{a_n r}{a_n – 1} – frac{(a_n + 1) r}{a_n} right)} right)^{-1 / a_n} – 1 right]
]

Where:

  • (r = frac{c}{p}): Cost to price ratio

Python implementation:

def knowledge_gradient_policy(
    a_n: float,
    b_n: float,
    p: float,
    c: float,
    n: int,
    N: int
) -> float:
    """
    Knowledge Gradient Policy, one-step lookahead policy for exponential demand
    with Gamma(a_n, b_n) posterior.

    Args:
        a_n (float): Gamma shape parameter at time n.
        b_n (float): Gamma rate parameter at time n.
        p (float): Selling price per unit.
        c (float): Unit cost per unit.
        n (int): Current period index (0-based).
        N (int): Total number of periods in the horizon.

    Returns:
        float: Stocking level x_n
    """
    a = max(a_n, 1.001)  # Avoid division by zero for small shape values
    r = c / p

    future_factor = (N - (n + 1)) / N
    adjustment = 1.0 - future_factor * (1.0 / a)
    adjusted_r = min(max(r * adjustment, 1e-4), 0.99)

    return b_n * ((1 / adjusted_r) ** (1 / a) - 1)

Monte Carlo Policy Evaluation

To evaluate a policy (pi) in a stochastic environment, we simulate its performance over multiple sample demand paths.

Let:

  • (M) be the number of independent simulations (demand paths), each denoted (omega^m) for (m = 1, 2, dots, M)
  • (N) be the time horizon
  • (p_n(omega^m)) be the simulated price at time (n) on path (m)
  • (x_n(omega^m)) be the decision taken at time (n) under policy (pi) on path (n)

Cumulative Reward on a Single Path

For each sample path (omega^m), compute the total reward:

[
hat{F}^pi(omega^m) = sum_{n=0}^{N-1} left[ p cdot minleft(x_n(omega^m), W_{n+1}(omega^m)right) – c cdot x_n(omega^m) right]
]

This represents the realized value of the policy (pi) along that specific trajectory.

Python implementation:

import numpy as np

def simulate_policy(
    N: int,
    a_0: float,
    b_0: float,
    lambda_true: float,
    policy_name: str,
    p: float,
    c: float,
    seed: int = 42
) -> float:
    """
    Simulates the sequential inventory decision-making process using a specified policy.

    Args:
        N (int): Number of time periods.
        a_0 (float): Initial shape parameter of Gamma prior.
        b_0 (float): Initial rate parameter of Gamma prior.
        lambda_true (float): True exponential demand rate.
        policy_name (str): One of {'point_estimate', 'distribution', 'knowledge_gradient'}.
        p (float): Selling price per unit.
        c (float): Procurement cost per unit.
        seed (int): Random seed for reproducibility.

    Returns:
        float: Total cumulative reward over N periods.
    """
    np.random.seed(seed)
    a_n, b_n = a_0, b_0
    rewards = []

    for n in range(N):
        # Choose order quantity based on specified policy
        if policy_name == "point_estimate":
            x_n = point_estimate_policy(a_n=a_n, b_n=b_n, p=p, c=c)
        elif policy_name == "distribution":
            x_n = distribution_policy(a_n=a_n, b_n=b_n, p=p, c=c)
        elif policy_name == "knowledge_gradient":
            x_n = knowledge_gradient_policy(a_n=a_n, b_n=b_n, p=p, c=c, n=n, N=N)
        else:
            raise ValueError(f"Unknown policy: {policy_name}")

        # Sample demand
        W_n1 = np.random.exponential(1 / lambda_true)

        # Compute profit and update belief
        reward = profit_function(x_n, W_n1, p, c)
        rewards.append(reward)

        a_n, b_n = transition_a_b(a_n, b_n, x_n, W_n1)

    return sum(rewards)

Estimate Expected Value by Averaging

The expected reward of policy (pi) is approximated using the sample average across all (M) simulations:

[
bar{F}^pi = frac{1}{N} sum_{m=1}^{N} hat{F}^pi(omega^m)
]

This (bar{F}^pi) is an unbiased estimator of the true expected reward under policy (pi).

Python implementation:

import numpy as np

def policy_monte_carlo(
    N_sim: int,
    N: int,
    a_0: float,
    b_0: float,
    lambda_true: float,
    policy_name: str,
    p: float = 10.0,
    c: float = 4.0,
    base_seed: int = 42
) -> float:
    """
    Runs multiple Monte Carlo simulations to evaluate the average cumulative reward
    for a given inventory policy under exponential demand.

    Args:
        N_sim (int): Number of Monte Carlo simulations to run.
        N (int): Number of time steps in each simulation.
        a_0 (float): Initial Gamma shape parameter.
        b_0 (float): Initial Gamma rate parameter.
        lambda_true (float): True rate of exponential demand.
        policy_name (str): Name of the policy to use: {"point_estimate", "distribution", "knowledge_gradient"}.
        p (float): Selling price per unit.
        c (float): Procurement cost per unit.
        base_seed (int): Seed offset for reproducibility across simulations.

    Returns:
        float: Average cumulative reward across all simulations.
    """
    total_rewards = []

    for i in range(N_sim):
        reward = simulate_policy(
            N=N,
            a_0=a_0,
            b_0=b_0,
            lambda_true=lambda_true,
            policy_name=policy_name,
            p=p,
            c=c,
            seed=base_seed + i
        )
        total_rewards.append(reward)

    return np.mean(total_rewards)
# Parameters
N_sim = 10000 # Number of simulations
N = 100 # Number of time periods
a_0 = 10.0 # Initial shape parameter of Gamma prior
b_0 = 5.0 # Initial rate parameter of Gamma prior
lambda_true = 0.25 # True rate of exponential demand
p = 26.0 # Selling price per unit
c = 20.0 # Unit cost
base_seed = 1234 # Base seed for reproducibility

results = {
    policy: policy_monte_carlo(
        N_sim=N_sim,
        N=N,
        a_0=a_0,
        b_0=b_0,
        lambda_true=lambda_true,
        policy_name=policy,
        p=p,
        c=c,
        base_seed=base_seed
    )
    for policy in ["point_estimate", "distribution", "knowledge_gradient"]
}

print(results)

Results

The left plot shows how average cumulative profit evolves over time, while the right plot shows the average reward per time step. From this simulation, we observe that the Knowledge Gradient (KG) policy significantly outperforms the other two, as it optimizes not only immediate rewards but also the future value of cumulative rewards. Both the Point Estimate and Distribution policies perform similarly.

We can observe from above plots that the Bayesian Learning algorithm gradually converges to the true mean demand (W).

These findings highlight the importance of incorporating the value of information in sequential decision making under uncertainty. While simpler heuristics like the Point Estimate and Distribution policies focus solely on immediate gains, the Knowledge Gradient policy leverages future learning potential, yielding superior long-term performance.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *