How can we evaluate the output of an LLM when there’s no clear right/wrong answer?

When evaluating the output of a large language model (LLM) in situations where there’s no clear right or wrong answer, you’ll need to define your own evaluation process. This often involves creating a scoring framework—such as a 1-10 scale—based on criteria that matter for your use case (e.g., clarity, correctness, relevance, creativity, tone, or helpfulness). This framework is then used to assign scores manually to a small set of responses to understand how well the model is performing. Additionally, providing an "ideal" or reference response can be helpful for more consistency if you are putting responses in front of multiple scorers.

This approach essentially converts qualitative outputs into something quantitative, allowing you to apply more standard evaluation metrics. While still vulnerable to some subjectivity, it makes model performance more objective and comparable over time.

A common follow-up question in response to this approach is whether or not LLMs can perform the scoring instead of human scorers?

Responses can be scored by another LLM, a practice referred to as LLM-as-a-judge. This approach is faster and scalable, especially when human evaluation is costly. However, it comes with trade-offs:

LLMs can be biased or inconsistent, especially if the rubric isn’t very clear or if the scoring task is subtle.
Human scorers are better at evaluating subjective qualities like nuance, tone, or creativity.

We highly recommend starting with human evaluators, if possible, when first getting started on building and evaluating your model.

Additional Resources:

LLM evaluation: a beginner's guide

This response has been generated by an LLM based on notes from PJMF technical consultations. All responses go through human review by our PJMF Products & Services team and are anonymized to protect our consultation participants.