I have been increasingly surprised over the past year at what I perceive to be the lack of critical thinking in response to new announcements regarding AI model capabilities. I’ve written about many of these issues before, but as AI hype accelerates the problems seem to be getting worse. Here I want to highlight a few points that seem to be neglected in coverage of the growth of AI capabilities.
Advertising is treated as fact
Every new model release by a major tech company represents billions of dollars of expenditure and the potential for billions of dollars of additional investment. Model releases and their associated documentation should therefore be interpreted as advertising designed primarily to attract customers and investors. Their goal is not to provide an objective and thorough analysis of the strengths and weaknesses of each new model. Of course, this does not mean that everything reported is false, but it does mean that such releases should be treated with significant skepticism. It is therefore disappointing to me to see write-ups such as this one by Rob Wiblin, which consists almost entirely of uncritical restatement of claims made by Anthropic, with little to no additional analysis.
For example, in discussing the capabilities of Claude Mythos to identify software vulnerabilities, Wiblin says:
“Anthropic’s previous model Opus 4.6 could only successfully convert a bug it identified in the browser Firefox into an effective way to accomplish something really bad 1% of the time. Mythos could do it 72% of the time.”
This statement is selective and misleading. What Anthropic actually did was provide their models with a testing harness which mimicked Firefox 147 but without critical defence components. They then prompted the models to devise and implement a certain type of exploit. Mythos fully accomplished this 72% of the time, but nearly always did so using two specific bugs that have since been fixed. When Anthropic removed these two bugs, Mythos only fully succeeded 4.4% of the time. (It was unclear to me whether Mythos may have had any knowledge of these bugs in its training data, given they had already been fixed). While I do not doubt that Mythos has improved capabilities relative to previous models, the reporting here unjustifiably hypes the significance of the results without providing any substantive critical analysis. This is further highlighted by the fact that an independent analysis was able to find many of the same vulnerabilities using much smaller open-source models.
Wiblin also extrapolates well beyond what is even claimed by Anthropic, such as when he argues:
“Now, Anthropic doesn’t say this directly in their reports, but I think a common-sense interpretation of the above is that in any deployment where this AI has access to the kind of tools that would make it actually useful to people — the ability access some parts of the network and execute code — could probably break out of whatever software box we try to put it in, because the systems that we would be trying to restrain it are themselves made of software, and that software is going to have vulnerabilities nobody knows about that this model is superhumanly good at finding and taking advantage of”
In my view, it is reasonable to think that humans armed with improved automated techniques for identifying software vulnerabilities would be better, rather than worse, at constraining the behaviour of new models. This is in fact what Anthropic argues in their report. There may be differences of opinion about this, but this is an example of where hype seems to be substituting for genuine analysis.
Wiblin also comments on the fact that Anthropic has yet to release Mythos publicly:
“And also keep in mind that on Monday — the day before Anthropic published all of this — we learned that their annualised revenue run rate had grown from $9 billion at the end of December to $30 billion just three months later…
That exploding revenue is a pretty good proxy for how much more useful the previous release, Opus 4.6, has become for real-world tasks. If the past relationship between capability measures and usefulness continues to hold, the economic impact of Mythos once it becomes available is going to dwarf everything that came before it — which is part of why Anthropic’s decision not to release it is a serious one, and actually quite a costly one for them.
They’re sitting on something that would likely push their revenue run rate into the hundreds of billions, but they’ve decided it’s simply not worth the risk.”
Wiblin does not consider the possibility that the reason why Anthropic is publishing these claims without the corresponding model is to build hype for a model that is not actually ready for release yet, especially in the lead up to Anthropic’s upcoming IPO. Wiblin does not explain where his estimate of ‘hundreds of billions of dollars’ of revenue comes from, but it reads to me like pure marketing for potential investors. Nor does it make sense to claim that revenue is a measure of economic value when Anthropic, OpenAI, and others are massively subsidising usage. There is a discussion to be had about the implication of these issues, but it is not to be found in this (or similar) pieces I’ve seen on the subject. We need to do better than just uncritically repeating advertising talking points of billion-dollar tech companies.
Benchmarks are interpreted uncritically
Much of the claimed improvement in performance of models derives from rapidly increasing scores on various benchmarks, which are standardised tests designed to quantify model capabilities on tasks such as language, coding, reasoning, and image recognition. While these benchmark scores give the appearance of precise and objective tests, in practice they often have very limited value in assessing the rate of capability improvement in a meaningful way.
First, most benchmarks have not been validated. Validity is an important concept in research generally and especially human psychometrics. It refers to the extent to which a metric has been assessed as adequately measuring the underlying phenomenon of interest. There are many components of validity, and validity assessments require careful research to assess the relationship between test performance and the target phenomenon. However, few AI benchmarks report this sort of research. Most simply come up with tasks the researchers hope are related to the target capability. This is simply poor research practice. It cannot simply be determined by intuition whether a given set of tasks will provide reliable and valid information about the intended capability of interest. This requires carefully designed research.
Second, almost as soon as they are released the benchmark solutions begin to contaminate the training data of new models. For instance, memorisation is known to be a major problem for SWE-Bench, a widely used benchmark of software engineering tasks. A recent analysis of visual benchmarks has even found that models could outperform humans on a standard X-ray question-answering benchmark without even being provided with any images. A particularly concerning analysis found it was possible to achieve 100% on several major benchmarks without solving a single task, usually by exploiting simple vulnerabilities in the test pipeline or the way scores are computed.
Third, even when the test solutions are not publicly available, the training data often sufficiently resemble the test data that a model trained on the training data will see dramatically improved performance on the test questions as well. This would not be a problem if the train and test problems constituted a representative sample of the domain of interest, but for so many important topics (language, reasoning, coding, image recognition), the domain is so vast and hard to characterise that it is not possible to construct a representative sample in this way. Sampling also tends to favour more common and simpler problems, and even very subtle changes in the sampling method can lead to the model learning radically different representations. This means that models tend to overfit to the training data, diminishing the value of the benchmarks in assessing out-of-distribution generalisation capabilities.
The issue of benchmark contamination is granted only a few pages of the 244 in the Mythos model card. Only a few benchmarks are assessed for contamination, with Anthropic arguing that most of the improvement on these cannot be attributed to memorisation. However, their results show that model performance degrades significantly when restricted to the 20% of the benchmark questions they assess as having least probability of memorisation. This was even true for the SWE-Bench Pro benchmark, which is supposedly ‘a contamination-resistant testbed’. This highlights the importance of devoting more attention to these issues in order to better interpret the meaning of benchmark improvements.
Negative results are ignored
I rarely see discussion in EA circles of various results which indicate fundamental limitations of existing LLM-based approaches. Numerous studies have found that these models often fail to learn the appropriate task structure, but instead learn to answer questions by learning spurious correlations and superficial heuristics that work in some constrained domain or training task, but do not generalize to variations of the task. There are also significant known limitations of the chain of thought approach which underpins reasoning models, with thought chains often being unfaithful to the actual computations that generate model predictions. In my view, there is reason to believe that these problems reflect fundamental limitations of the machine learning techniques that underpin leading models.
As a further interesting example, Claude Opus 4.7 shows a significant regression in performance on long context tasks based on the MRCR benchmark, which interestingly is precisely a benchmark that uses adversarial methods to distract the model from the task. Anthropic’s response is:
“We kept MRCR in the system card for scientific honesty, but we’ve actually been phasing it out slowly. It’s built around stacking distractors to trick the model, which isn’t how people actually use long context.”
In my view, this response indicates that Anthropic is more interested in ensuring their model works in typical use cases, rather than assessing whether it actually has robust generalisable capabilities that are indicative of what we might call ‘genuine intelligence’. This is particularly relevant for arguments relying on extrapolations of improvements in model capabilities to novel tasks and in more complex settings.
Conclusions
There is no doubt that LLM-based models have shown significant improvements in recent years. However, it is important to carefully and critically assess these advances in order to make accurate inferences about their social, political, and economic impacts. One cannot infer AI 2027-like superintelligence takeover scenarios from recent trends and developments without making significant additional assumptions about the nature of generalized intelligence, the relevance of benchmark results, and the limitations of LLM-based models. Humans have a very bad track record of predicting what tasks require ‘general intelligence’ to accomplish, and I suspect that it may be possible to develop machine learning models that can automatically perform any task with known solutions without this implying any superintelligence takeoff. These issues are complex and demand a more nuanced, informed consideration than I often see in contemporary discussions.