1
You are probably familiar with the long list of various benchmarks that new
models are tested on and compared against. These benchmarks are supposedly
designed to assess the model’s ability to perform in various aspects of language
understanding, logical reasoning, information recall, and so on. However, while
I understand the need for an objective and scientific measurement scale, I have
long felt that these benchmarks are not particularly representative of the
actual experience of using the models. For example, people will claim that a
model performs at “some percentage of GPT-3” and yet not one of these models has
ever been able to produce correctly-functioning code for any non-trivial task or
follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model
has an actual in-depth understanding of the text, question, or argument, whereas
other models that I have tried always feel as though they have only a
superficial/surface-level understanding regardless of what the benchmarks claim.
My most recent frustration, and the one that prompted this post, is regarding
the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it
performs better than other 13B models at the time of writing, supposedly
outperforms Microsoft’s own published benchmark results for their yet-unreleased
model, and scores an “average” result of 74.0% against GPT-3’s 75.7% while the
LLaMa model that I was using previously apparently scores merely 63%. I’ve used
GPT-3 (text-davinci-003), and this model does not “come within comparison” of
it. Even giving it as much of a fair chance as I can, giving it plenty of leeway
and benefit of the doubt, not only can it still not write correct code (or even
valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B
(which is also pretty bad). This model does not understand basic reasoning and
fails at basic reasoning tasks. It will write a long step-by-step explanation of
what it claims that it will do, but the answer itself contradicts the provided
steps or the steps themselves are wrong/illogical. The model has only learnt to
produce “step by step reasoning” as an output format, and has a worse
understanding of what that actually means than any other model does when asked
to “explain your reasoning” (at least, for other models that I have tried,
asking them to explain their reasoning produces at least a marginal improvement
in coherence). There is something wrong with these benchmarks. They do not
relate to real-world performance. They do not appear to be measuring a model’s
ability to actually understand the prompt/task, but possibly only measuring its
ability to provide an output that “looks correct” according to some format.
These benchmarks are not a reliable way to compare model performance and as long
as we keep using them we will keep producing models that score higher on
benchmarks and claim to perform “almost as good as GPT-3” but yet fail
spectacularly in any task/prompt that I can think of to throw at them. (I keep
using coding as an example however I have also tried other tasks besides code as
I realise that code is possibly a particularly challenging task due to
requirements like needing exact syntax. My interpretation of the various models’
level of understanding is based on experience across a variety of tasks.)
You must log in or # to comment.