Originally posted on venturebeat.
Researchers at Stanford University and University of California-Berkeley have published an unreviewed paper on the open access journal arXiv.org, which found that the “performance and behavior” of OpenAI’s ChatGPT large language models (LLMs) have changed between March and June 2023. The researchers concluded that their tests revealed “performance on some tasks have gotten substantially worse over time.”
“The whole motivation for this research: We’ve seen a lot of anecdotal experiences from users of ChatGPT that the models’ behavior is changing over time,” James Zou, a Stanford professor and one of the three authors of the research paper, told VentureBeat. “Some tasks may be getting better or other tasks getting worse. This is why we wanted to do this more systematically to evaluate it across different time points.”
There are some important caveats to the findings and the paper, including that arXiv.org accepts nearly all user-generated papers that comply with its guidelines, and that this particular paper — like many on the site — has not yet been peer-reviewed, nor published in another reputable scientific journal. However, Zou told VentureBeat that the authors do plan to submit it for consideration and review by a journal.
In a tweet in response to the paper and the ensuing discussions, Logan Kilpatrick, OpenAI developer advocate, offered a general thanks to those reporting their experiences with the LLM platform and said they’re actively looking into the issues being shared. Kilpatrick also posted a link to OpenAI’s Evals framework GitHub page which is used to evaluate LLMs and LLM systems with an open-source registry of benchmarks.
VentureBeat has reached out to OpenAI for further comment but did not hear back in time for publication.
Several LLM tasks put to the test over time
Measuring both GPT-3.5 and GPT-4 in terms of a range of different requests, the research team found that the OpenAI LLMs became worse at identifying prime numbers and showing its “step by step” thought process, and outputted generated code with more formatting errors.
Accuracy on answers to “step-by-step” prime number identification dropped a dramatic 95.2% on GPT-4 over the three-month period evaluated, while it increased substantially at 79.4% for GPT-3.5. Another question posed to find sums of a range of integers with a qualifier also saw degraded performance in both GPT-4 and GPT-3.5, minus 42% and 20%, respectively.
“GPT-4’s success rate on ‘Is this number prime? Think step by step’ fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved,” tweeted co-author Matei Zahari. “Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely significant changes in LLM behavior.”
However, in a change that is likely seen as an improvement by the company — though it may frustrate users — GPT-4 was more resistant to jailbreaking, or circumvention of content protection boundaries through specific prompts, in June compared to March.
The two LLMs did see small improvements on visual reasoning, according to the paper.
Pushback on the findings and methodology
Not everyone was convinced that the tasks selection from Zaharia’s team used the right metrics to measure meaningful changes to declare the service “substantially worse.”
Computer science professor and director of the Princeton University Center for Information Technology Policy Arvind Narayanan, tweeted: “We dug into a paper that’s been misinterpreted as saying GPT-4 has gotten worse. The paper shows behavior change, not capability decrease. And there’s a problem with the evaluation — on 1 task, we think the authors mistook mimicry for reasoning.”
Commenters on the ChatGPT subreddit and YCombinator similarly took issue with the thresholds the researchers considered failing, but other longtime users seemed to be comforted by evidence that perceived changes in the generative AI output weren’t merely in their heads.
This work brings to light a new area that business and enterprise operators need to be aware of when considering generative AI products. The researchers have dubbed the change in behavior as “LLM drift” and cited it as a critical way to comprehend how to interpret results from popular chat AI models.
More transparency and vigilance would help improve understanding of changes
The paper notes how opaque the current public view is of closed LLMs, and how they evolve over time. The researchers say that improving monitoring and transparency is key to avoid the pitfalls of LLM drift.
“We don’t get a lot of information from OpenAI — or from other vendors and startups — on how their models are being updated.” said Zou. “It highlights the need to do these kinds of continuous external assessments and monitoring of LLMs. We definitely plan to continue to do this.”
In a previous tweet, Kilpatrick stated that the GPT APIs don’t change without OpenAI notifying its users.
Businesses incorporating LLMs in their products and internal capabilities will need to be vigilant to address the effects of LLM drift. “Because if you’re relying on the output of these models in some sort of software stack or workflow, the model suddenly changes behavior, and you don’t know what’s going on, this can actually break your entire stack, can break the pipeline,” said Zou.