Skip to main content



Evaluation and benchmarking are crucial concepts in LLM development. To improve the perfomance of an LLM app (RAG, agents) you must have a way to measure it.

LlamaIndex offers key modules to measure the quality of generated results. We also offer key modules to measure retrieval quality.

  • Response Evaluation: Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines?
  • Retrieval Evaluation: Are the retrieved sources relevant to the query?

Response Evaluation

Evaluation of generated results can be difficult, since unlike traditional machine learning the predicted result is not a single number, and it can be hard to define quantitative metrics for this problem.

LlamaIndex offers LLM-based evaluation modules to measure the quality of results. This uses a “gold” LLM (e.g. GPT-4) to decide whether the predicted answer is correct in a variety of ways.

Note that many of these current evaluation modules do not require ground-truth labels. Evaluation can be done with some combination of the query, context, response, and combine these with LLM calls.

These evaluation modules are in the following forms:

  • Correctness: Whether the generated answer matches that of the reference answer given the query (requires labels).

  • Faithfulness: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there’s hallucination).

  • Relevancy: Evaluates if the response from a query engine matches any source nodes.