Tho Le

A Data Scientist. Looking for knowledge!

Reinforcement Learning Notes

19 Feb 2025 » ai, reinforcement learning

Terminologies

Evaluation

Off-Policy Evaluation (OPE): a technique in RL to estimate the performance of a target policy (i.e., the policy you want to evaluate) using data collected by a behavior policy (i.e., a different policy used to generate the data).
- More details.
  Offline RL
Conservative Q-Learning (CQL): prevent value overestimation outside the data support.
Implicit Q-Learning (IQL): strong, simple, stable without importance sampling.
TD3+BC / BCQ / BRAC / AWAC: if your action is continuous (budget value).
FQI/FQE (Fitted Q Iteration/Evaluation): classical, strong baselines.

Limitations

Real-time use can be limited.
- Need a lot of interactions with the env. to learn effective policies.
- Slow inference or policy updates for deep RL algos –> prevent applications in high-freq trading or robotics.
- Rewards in RL are often delayed –> computational expensive esp. in dynamic env. where feedbackloops must be rapid.
- Potential solution:
  - Use offline RL with past data to solve cold start problem
  - Use causal RL to extrapolate beyond the current policy
  - Use similated data from digital twins (esp. in Robotics) can help.
  - Also check out this post: No-code LLM Fine-Tuning and Debugging in Real Time: Case Study.

Related Posts

New Tech Trends (Categories: ai, tech, trends)
New Modeling Approach (Categories: ai, tech, trends, models)
Transformers (Categories: ai, llms, transformers)
Model Distillation (Categories: ai, llms, news, latest)
Thompson Sampling (Categories: reinforcement learning, Thompson Sampling)
Off-Policy Evaluation (Categories: ai, reinforcement learning)

« Hallucination Reinforcement Learning Resources »