Falsifiability in Evaluating Language Models: Insights from Recent Research

3 min and 36 sec to read, 899 words A recent paper seems to have found something interesting — here is the tl;dr: Taking inspiration from those problems and aiming for even simpler settings, we arrived at a very simple problem template that can be easily solved using common sense…

3 min and 36 sec to read, 899 words

A recent paper seems to have found something interesting — here is the tl;dr:

Taking inspiration from those problems and aiming for even simpler settings, we arrived at a very simple problem template that can be easily solved using common sense reasoning but is not entirely straightforward, of the following form: “Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?”. The problem – we call it here “AIW problem” – has a
simple common sense solution which assumes all sisters and brothers in the problem setting share the same parents. The correct response C – number of sisters – is easily obtained by calculating M + 1 (Alice and her sisters), which immediately gives the number of sisters Alice’s brother has.


Initially, we assumed that the AIW problem will pose no challenge for most of the current state-of-the-art LLMs. In our initial experiments, we started by looking at the problem instance with N = 4 and M = 1, with correct the response being C = 2 sisters and we noticed to our surprise that on the contrary, most state-of-the-art models struggle severely to provide correct responses when confronted with the problem. We also noticed already during early experimentation that even slight variations of numbers N and M may cause strong fluctuations of correct response rate.

This mirrors a research approach that we have seen in many cases with LLMs: find some kind of prompt that gives a response that the model cannot handle, us that to make a claim about the overall capability of models and then suggest better benchmarks. This is not a bad approach as such, as it ensures that we are not blinded by benchmarks, but it raises the question of how we evaluate single prompt approaches overall.

If the claim is something like:

(i) There is a prompt P that model M does not handle well, hence model M is not as good as benchmarks B(1)…B(N) suggest.

We need to think about if we think about the validity of the claim as such. And if we demand more prompts so that there is a class of prompts that a model does not handle well, we need to determine how large such a class needs to be to be considered significant. In a way this is just about regular, basic theory of science questions and if a “counterprompt” can be seen as the same as a falsification of something, and if that is the case, what it is a falsification of. The authors refer to Popper in their conclusion:

Facing these initial findings, we would like to call upon scientific and technological ML community to work together on providing necessary updates of current LLM benchmarks that obviously fail to discover important weaknesses and differences between the studied models. Such updates might feature sets of problems similar to AIW studied here – simple, to probe specific kind of reasoning deficiency, but customizable, thus offering enough combinatorial variety to provide robustness against potential contamination via memorization. We think that strong, trustful benchmarks should follow Karl Popper’s principle of falsifiability [55] – not trying to confirm and highlight model’s capabilities, which is tempting especially in commercial setting, but in contrast do everything to break model’s function, highlighting its deficits, and thus showing possible ways for model improvement, which is the way of scientific method.

The claim (i) seems to suggest that a single prompt P, or class of prompts, can falsify all benchmarks in some dimension (or completely, if we take a strong stance). This seems a bit unreasonable, but still interesting to explore. What we seem to be lacking here is a theory of evidence / significance / falsification in relation to evaluations and benchmarks, and this in turn seems to say something about the complexity of assessing general capability in a model. A stronger version of claim (i) could be something more general like:

(ii) If there is a prompt P such that a model M fails at it AND this problem is easily solved by a human being, this means the model should be considered falsified as a whole.

This obviously seems to strong, but remains intriguing to think about. What is it that we are falsifying here? Is it the claim of generality? And to what degree is a benchmark, or the use of multiple benchmarks, a claim to generality? I think the authors are right to point out that using multiple benchmarks and tests risks being interepreted as suggesting there is a general capability G that we test alongside the specific capabilities S(1)…S(N) a benchmark focuses on. But this may not be true at all, and raises the question of if there is any benchmark or basket of tests that can test for G here.

At the heart of this lies the question of what it means to falsify a model, I think. This is where the paper’s reference to Popper and scientific method feels like a tease, not exploring that full theory of model falsification – if indeed there is such a thing.

Old hat, in a sense, as it connects to the discussion of how you would test for AGI — but still. Interesting. And worthwhile read. See Nezhurina, M., Cipolina-Kun, L., Cherti, M. and Jitsev, J., 2024. Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models. arXiv preprint arXiv:2406.02061.

+

Leave a Reply

Discover more from Unpredictable Patterns

Subscribe now to keep reading and get access to the full archive.

Continue reading