assistant node
is an LLM that determines whether to invoke a tool based upon the input. The tool condition
sees if a tool was selected by the assistant node
and, if so, routes to the tool node
. The tool node
executes the tool and returns the output as a tool message to the assistant node
. This loop continues until as long as the assistant node
selects a tool. If no tool is selected, then the agent directly returns the LLM response.
Final Response
: Evaluate the agent’s final response.Single step
: Evaluate any agent step in isolation (e.g., whether it selects the appropriate tool).Trajectory
: Evaluate whether the agent took the expected path (e.g., of tool calls) to arrive at the final answer.RAG From Scratch
series.LLM-as-judge
is a commonly used evaluator for RAG because it’s an effective way to evaluate factual accuracy or consistency between texts.
Offline evaluation
: Use offline evaluation for any prompts that rely on a reference answer. This is most commonly used for RAG answer correctness evaluation, where the reference is a ground truth (correct) answer.
Online evaluation
: Employ online evaluation for any reference-free prompts. This allows you to assess the RAG application’s performance in real-time scenarios.
Pairwise evaluation
: Utilize pairwise evaluation to compare answers produced by different RAG chains. This evaluation focuses on user-specified criteria (e.g., answer format or style) rather than correctness, which can be evaluated using self-consistency or a ground truth reference.
Evaluator | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant |
---|---|---|---|---|
Document relevance | Are documents relevant to the question? | No | Yes - prompt | No |
Answer faithfulness | Is the answer grounded in the documents? | No | Yes - prompt | No |
Answer helpfulness | Does the answer help address the question? | No | Yes - prompt | No |
Answer correctness | Is the answer consistent with a reference answer? | Yes | Yes - prompt | No |
Pairwise comparison | How do multiple answer versions compare? | No | Yes - prompt | Yes |
Developer curated examples
of texts to summarize are commonly used for evaluation (see a dataset example here). However, user logs
from a production (summarization) app can be used for online evaluation with any of the Reference-free
evaluation prompts below.
LLM-as-judge
is typically used for evaluation of summarization (as well as other types of writing) using Reference-free
prompts that follow provided criteria to grade a summary. It is less common to provide a particular Reference
summary, because summarization is a creative task and there are many possible correct answers.
Online
or Offline
evaluation are feasible because of the Reference-free
prompt used. Pairwise
evaluation is also a powerful way to perform comparisons between different summarization chains (e.g., different summarization prompts or LLMs):
Use Case | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant |
---|---|---|---|---|
Factual accuracy | Is the summary accurate relative to the source documents? | No | Yes - prompt | Yes |
Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - prompt | Yes |
Helpfulness | Is summary helpful relative to user need? | No | Yes - prompt | Yes |
reference
labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
If ground truth reference labels are provided, then it’s common to simply define a custom heuristic evaluator to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use LLM-as-judge
to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
Online
or Offline
evaluation is feasible when using LLM-as-judge
with the Reference-free
prompt used. In particular, this is well suited to Online
evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
Use Case | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant |
---|---|---|---|---|
Accuracy | Standard definition | Yes | No | No |
Precision | Standard definition | Yes | No | No |
Recall | Standard definition | Yes | No | No |