This is something I've stuggled with for my site, I made https://aimodelreview.c...

This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.

My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.

The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.

LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.