Expected rewards
Beta prior
- If rewards are bounded (e.g., in [0,1]), can use Beta prior.
- If our reward is an unbounded quantity, such as profit, we can first normalize it, then still apply Beta prior.
- Beta distr. is conjugate to the Bernoulli.
- We are trying to estimate the probability that a campaign is “good” (i.e., profitable).
- Hence, if we get feedback as a binary signal, we use Beta prior.
- Beta distr. models the probability of sucess in a binary outomce (profit/non-profit).
- Parameters of Beta distr.
- Both $\alpha, \beta > 0$ are real numbers, and are shape parameters (not scale parameters). This means they influence the shape of the curve without changing its range.
- Higher $\alpha$: shift the peak of the distribution toward the right (closer to 1).
- Higher $\beta$: shift the peak of the distribution toward the left (closer to 0).
- Equal $\alpha, \beta$: When alpha and beta are equal, the distribution can be symmetrical (e.g., a U-shape when alpha and beta are less than 1, or a bell shape when they are greater than 1). When alpha and beta are both equal to 1, the distribution becomes a uniform distribution on.
- For the discrete case, you can use the following insight to help understand better
- $\alpha$: number of “successes” seen so far.
- $\beta$: number of “failure” seen so far.
- Mean $\mu = \frac{\alpha}{\alpha + \beta}$
- Variance $\sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$ –> deceases as we ahve more data.
- Both $\alpha, \beta > 0$ are real numbers, and are shape parameters (not scale parameters). This means they influence the shape of the curve without changing its range.
- Process:
- Initialize each arm with $\alpha=1$, $\beta=1$. Then, a Beta(1,1) is just a uniform distr. over [0,1].
- Each round/day, we sample an “expected reward” of each arm from its Beta distr. (e.g.
np.random.beta(alpha, beta)
, then choose the arm with the highest sampled reward. - Then, we play that arm and observe its true reward (normalized) then update $\alpha, \beta$.
- $\alpha += normalized_reward$
- $\beta += 1 - normalized_reward$
- Python:
- We can generate samples from Beta using:
samples_array = np.random.beta(5.6, 4.1, size=5)
, where $\alpha, \beta$ do not have to be integer.
- We can generate samples from Beta using: