The following article isn’t sponsored by any organization and is solely the opinions and observations of the author.
I recently attended a software seminar on generative AI (GenAI) model evaluation. The event was hosted by TrackIt and AWS. The session promised to explore “how modern AI teams are benchmarking, testing, and optimizing large language models for production environments,” but the most valuable takeaway was a discussion how best to evaluate LLMs (large language models; generative AI models like ChatGPT, Claude Code, etc.) for real-world business use.
I found this topic particularly interesting because there is little guidance available on how organizations should choose an LLM for a specific project or business need. While model capabilities are discussed extensively, the process of evaluating and selecting a model is often overlooked.
This slide was crucial:
It summarizes the three primary approaches to LLM evaluation:
- Algorithmic (deterministic)
- LLM-as-Judge
- Operational (cost, speed, etc.).
The first and third are relatively straightforward. Operational considerations such as cost, latency, and vendor support are often the first factors engineering teams evaluate when selecting software. Similarly, there are dozens of deterministic LLM-evaluating benchmarks, such as Humanity’s Last Exam, GPQA Diamond, and others. There are numerous websites dedicated to comparing LLMs using these benchmarks.
What was novel to me was the LLM-as-judge approach: using one LLM to evaluate the output of another. In this approach, you define the evaluation criteria and have a separate LLM score the responses generated by the model under test. Although this requires some upfront effort to design effective evaluation criteria, those criteria can be reused across all of your future evaluations.
This approach is especially valuable when your intended usage has no objective ground truth, or when quality depends on subjective factors. While traditional benchmarks can provide insight into general model performance, they may not accurately predict how well a model performs on a specialized business task. For example, a benchmark score will not tell you which model is best at transforming a chef’s rough notes into polished, consumer-ready recipes for a new cooking website.
Let’s imagine you are building a cooking platform and have a chef who sketches recipe ideas in shorthand. You want an LLM to convert those rough notes into clear, complete, and professionally written recipes. To evaluate candidate models, you would create a collection of representative inputs (the chef's notes and the model instructions) and ideal outputs (finished recipes). Each model under evaluation would generate responses for the same inputs, and the judge model would compare those responses against the expected outputs using criteria you define. The resulting scores would provide a task-specific evaluation tailored to your business needs.
Some tips:
- Pick one or two judge models. One might think that the comprehensive way would be to use many different models. This adds significant complexity while providing limited additional value. In most cases, using your current production model, or a capable, low-cost model is sufficient.
- Use a small continuous scoring range, such as 1–5. Quality assessments are inherently subjective and thus should be rated with a range rather than a binary pass/fail. Conversely, a huge range (e.g. 1-100) implies a false level of precision.
- Don’t rely solely on LLM-as-judge evaluations. Combine subjective evaluations with the algorithmic and operational metrics. Those quick, easy, and deterministic methods add important color to the subjective method discussed here.
The key lesson I took away from the seminar is that selecting an LLM should not be based solely on benchmark rankings. The best model for a business is often the one that performs best on the organization's specific tasks while balancing algorithmic performance, subjective quality, and operational requirements.


No comments:
Post a Comment