Off-Policy Evaluation

Off-Policy Evaluation (OPE): a technique in RL to estimate the performance of a target policy (i.e., the policy you want to evaluate) using data collected by a behavior policy (i.e., a different policy used to generate the data).
- In other words, you can come up with a new policy A (e.g., using simulation). You want to evaluate how A would perform in a given scenario without having to directly execute A in the environment.
- A Review of OPE in RL.
- How it works: OPE methods typically use techniques like importance sampling to adjust the contributions of different actions and states in the historical data to account for the difference between A and B.
- Why bother?
  - You can eval policies without executing them in the real world (useful in high risk / cost applications such as robotics, healthcare).
  - You can leverage existing data to gains insight into the performance of different strategies.
  - You can discover better policies than those used to generate the initial data.
  - 3 core methods in OPE:
  - Inverse Propensity Scoring (IPS): Reweight each logged sample based on how likely the new policy would have chosen the same action as the behavior (historical) policy did.
    - An example of importance sampling.
    - $V(\pi)$ is the expexcted reward (or value) if we were to run the new policy $\pi$ on the real environment. But since we can’t directly sample from $\pi$ since all our logged data came from $\pi_b$, we need importance sampling.
    - Simply put, it is like replaying the historical dataset, but rescaling the importance of each datapoint depending on how relevant it is under the new policy.
  - Weighted Importance Sampling (WIS): Similar to IPS, but normalize the weights first to reduce variance.
  - Doubly Robust (DR) Estimator: Combine model-based and importance-weighted approaches.
    - Recall how the importance sampling approaches use 100% of weighted reward in final expected reward.
    - Here, we use only part of weighted reward. The rest come from a model that can learn/predict the expected reward.
    - Then you combine these two in the final expected reward.
  - If your historical data did not come from a particular policy (i.e., it is by rule-based), you don’t have a behavior policy $\pi_b(a s)$ to start with. But you can estimate this by:
    - Build a classification model to predict probability of an action given a state.
    - Remember to put softmax at output layer.
    - Now, you have an estimate $\pi_b(a s)$ for any $(s,a)$
    - Then, use this to apply to IPS, WIS, DR.
    - Note that if your actions are continuous (e.g., real-valued budgets), you’ll need to use kernel density estimators or discretize your action space.
  - If you don’t want to estimate $\pi_b$, a practical alternative is to use Boostrapping.
    - Split hitorical data into train and validation/test campaigns.
    - Train your model (e.g., policy $\pi(a s)$, or Q-value model).
    - On th test set, use your model to select the top action $\hat{a_i} = argmax_a Q(s_i,a)$ or $\pi(a s)
    - Compare the actual observed rewards of:
      - Actions taken by the model.
      - Action taken historically.
    - Boostrapping:
      - Randomly resample the test data 1000 times.
      - Compute value estimates per resample.
      - Report:
        Mean value estimate.
        95% CI.
        Value uplift vs. baseline.

Tho Le

Off-Policy Evaluation

Related Posts