Tho Le

A Data Scientist. Looking for knowledge!

Off-Policy Evaluation

28 Jul 2025 » ai, reinforcement learning
  • Off-Policy Evaluation (OPE): a technique in RL to estimate the performance of a target policy (i.e., the policy you want to evaluate) using data collected by a behavior policy (i.e., a different policy used to generate the data).
    • In other words, you can come up with a new policy A (e.g., using simulation). You want to evaluate how A would perform in a given scenario without having to directly execute A in the environment.
    • A Review of OPE in RL.
    • How it works: OPE methods typically use techniques like importance sampling to adjust the contributions of different actions and states in the historical data to account for the difference between A and B.
    • Why bother?
      • You can eval policies without executing them in the real world (useful in high risk / cost applications such as robotics, healthcare).
      • You can leverage existing data to gains insight into the performance of different strategies.
      • You can discover better policies than those used to generate the initial data.
      • 3 core methods in OPE:
      • Inverse Propensity Scoring (IPS): Reweight each logged sample based on how likely the new policy would have chosen the same action as the behavior (historical) policy did.
        • An example of importance sampling.
        • $V(\pi)$ is the expexcted reward (or value) if we were to run the new policy $\pi$ on the real environment. But since we can’t directly sample from $\pi$ since all our logged data came from $\pi_b$, we need importance sampling.
        • Simply put, it is like replaying the historical dataset, but rescaling the importance of each datapoint depending on how relevant it is under the new policy.
      • Weighted Importance Sampling (WIS): Similar to IPS, but normalize the weights first to reduce variance.
      • Doubly Robust (DR) Estimator: Combine model-based and importance-weighted approaches.
        • Recall how the importance sampling approaches use 100% of weighted reward in final expected reward.
        • Here, we use only part of weighted reward. The rest come from a model that can learn/predict the expected reward.
        • Then you combine these two in the final expected reward.
      • If your historical data did not come from a particular policy (i.e., it is by rule-based), you don’t have a behavior policy $\pi_b(as)$ to start with. But you can estimate this by:
        • Build a classification model to predict probability of an action given a state.
        • Remember to put softmax at output layer.
        • Now, you have an estimate $\pi_b(as)$ for any $(s,a)$
        • Then, use this to apply to IPS, WIS, DR.
        • Note that if your actions are continuous (e.g., real-valued budgets), you’ll need to use kernel density estimators or discretize your action space.
      • If you don’t want to estimate $\pi_b$, a practical alternative is to use Boostrapping.
        • Split hitorical data into train and validation/test campaigns.
        • Train your model (e.g., policy $\pi(as)$, or Q-value model).
        • On th test set, use your model to select the top action $\hat{a_i} = argmax_a Q(s_i,a)$ or $\pi(as)
        • Compare the actual observed rewards of:
          • Actions taken by the model.
          • Action taken historically.
        • Boostrapping:
          • Randomly resample the test data 1000 times.
          • Compute value estimates per resample.
          • Report:
            • Mean value estimate.
            • 95% CI.
            • Value uplift vs. baseline.