- How to Measure the Reliability of a Large Language Model’s Response
- (DeepLearing.AI Course) Evaluating AI Agents.
- Use cases: Eval a shopping assistant, coding agent, research assistant. Need a structured evaluation process. This eval each component of an agent and its end-to-end performance.
- This helps you identify areas for improvement. This is similar to error analysis in supervised learning.
- Code-based evals: write code explicitly to test a certain step.
- LLM-as-a-Judge evals: you prompt an LLM to efficiently come up with ways to evaluate more open-ended outputs.