Tho Le

A Data Scientist. Looking for knowledge!

Thompson Sampling

03 Aug 2025 » reinforcement learning, Thompson Sampling

Expected rewards

Beta prior

  • If rewards are bounded (e.g., in [0,1]), can use Beta prior.
    • If our reward is an unbounded quantity, such as profit, we can first normalize it, then still apply Beta prior.
  • Beta distr. is conjugate to the Bernoulli.
    • We are trying to estimate the probability that a campaign is “good” (i.e., profitable).
    • Hence, if we get feedback as a binary signal, we use Beta prior.
    • Beta distr. models the probability of sucess in a binary outomce (profit/non-profit).
  • Parameters of Beta distr.
    • Both $\alpha, \beta > 0$ are real numbers, and are shape parameters (not scale parameters). This means they influence the shape of the curve without changing its range.
      • Higher $\alpha$: shift the peak of the distribution toward the right (closer to 1).
      • Higher $\beta$: shift the peak of the distribution toward the left (closer to 0).
      • Equal $\alpha, \beta$: When alpha and beta are equal, the distribution can be symmetrical (e.g., a U-shape when alpha and beta are less than 1, or a bell shape when they are greater than 1). When alpha and beta are both equal to 1, the distribution becomes a uniform distribution on.
    • For the discrete case, you can use the following insight to help understand better
      • $\alpha$: number of “successes” seen so far.
      • $\beta$: number of “failure” seen so far.
    • Mean $\mu = \frac{\alpha}{\alpha + \beta}$
    • Variance $\sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$ –> deceases as we ahve more data.
  • Process:
    • Initialize each arm with $\alpha=1$, $\beta=1$. Then, a Beta(1,1) is just a uniform distr. over [0,1].
    • Each round/day, we sample an “expected reward” of each arm from its Beta distr. (e.g. np.random.beta(alpha, beta), then choose the arm with the highest sampled reward.
    • Then, we play that arm and observe its true reward (normalized) then update $\alpha, \beta$.
      • $\alpha += normalized_reward$
      • $\beta += 1 - normalized_reward$
  • Python:
    • We can generate samples from Beta using: samples_array = np.random.beta(5.6, 4.1, size=5), where $\alpha, \beta$ do not have to be integer.