It's like when your trusty old car starts making a new sound: familiar, yet undeniably changed. A group of researchers from Stanford University and the University of California-Berkeley took a close look under the hood of OpenAI's ChatGPT, a tool many have come to rely on daily. They discovered subtle yet significant changes in the model's behavior over just a few months.
18 Ağu 2023
4 dk okuma süresi
It's like when your trusty old car starts making a new sound: familiar, yet undeniably changed. A group of researchers from Stanford University and the University of California-Berkeley took a close look under the hood of OpenAI's ChatGPT, a tool many have come to rely on daily. They discovered subtle yet significant changes in the model's behavior over just a few months.
The study is available on the open-access repository arXiv.org. This publication uncovers alterations in the "performance and behavior" of OpenAI's ChatGPT large language models (LLMs) between March and June 2023. Based on their analysis, the investigators deduced that the model's "performance on some tasks has gotten substantially worse over time."
VentureBeat received input from James Zou, an academic from Stanford University and a contributor among three authors of the scholarly article:
"The whole motivation for this research: We've seen a lot of anecdotal experiences from users of ChatGPT that the models' behavior is changing over time. Some tasks may be getting better or other tasks getting worse. This is why we wanted to do this more systematically to evaluate it across different time points."
It's vital to highlight that arXiv.org, the open-access repository where this study was published, predominantly admits user-contributed manuscripts that meet its criteria. Notably, this specific paper has not undergone peer-review nor been accepted by any recognized scientific journals.
Addressing the paper and the ensuing academic discussions, Logan Kilpatrick, an advocate for OpenAI developers, expressed gratitude on Twitter towards the community for their feedback regarding the LLM platform.
Kilpatrick confirmed that OpenAI is earnestly examining the cited issues. Kilpatrick also shared a link to OpenAI's Evals framework on GitHub, a tool devised to assess LLMs using an open-source benchmark registry.
The investigative team embarked on a rigorous evaluation of both GPT-3.5 and GPT-4, focusing on diverse tasks. Their findings indicated that OpenAI LLMs experienced challenges in prime number identification, elaborating their "step-by-step" methodologies, and presented more formatting anomalies in generated code.
For instance, the precision in delivering "step-by-step" explanations for prime number identification plunged by a staggering 95.2% for GPT-4 within the surveyed three months. Conversely, GPT-3.5 witnessed a robust improvement of 79.4%. Furthermore, when tasked with calculating the sums of a series of integers under specified conditions, performance deteriorated for both models — GPT-4 declined by 42%, and GPT-3.5 by 20%.
Detailing these findings, co-author Matei Zahari tweeted:
"GPT -4's success rate on 'Is this number prime? Think step by step' fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely significant changes in LLM behavior."
On the brighter side, GPT-4 displayed enhanced resilience against attempts at ‘jailbreaking’ or bypassing content protection measures in June compared to March. This is possibly perceived as a progressive stride by OpenAI, though it might pose challenges for certain users.
The paper also underscores minor enhancements in the LLMs' capabilities concerning visual reasoning.
The research conducted by Zaharia's team didn't escape criticism. There was a section of the academic community and the public that questioned the choice of tasks and the metrics employed to quantify significant alterations in service quality.
Arvind Narayanan, a prominent computer science professor and the head of the Princeton University Center for Information Technology Policy, offered a skeptical viewpoint on Twitter:
"We dug into a paper that's been misinterpreted as saying GPT-4 has worsened. The paper shows behavior change, not capability decrease. And there's a problem with the evaluation — on one task, we think the authors mistook mimicry for reasoning."
https://twitter.com/random_walker/status/1681748271163912194
Online forums like the ChatGPT subreddit and YCombinator witnessed spirited debates. While some members highlighted concerns with the benchmarks deemed as failures by the researchers, there were others, especially veteran users, who felt vindicated. They believed the study validated their experiences, suggesting that the changes they observed in generative AI outputs weren't merely cognitive biases.
The study accentuates the limitations in the public's comprehension of the workings and evolutions of closed LLMs, emphasizing a more transparent approach. To prevent the potential issues stemming from "LLM drift," the authors stress the importance of comprehensive monitoring and greater transparency.
"We don't get a lot of information from OpenAI — or from other vendors and startups — on how their models are being updated. It highlights the need to do these kinds of continuous external assessments and monitoring of LLMs. We definitely plan to continue to do this," stated Zou.
Reinforcing this perspective, Kilpatrick asserted in an earlier tweet that any modifications to the GPT APIs are communicated to its users by OpenAI.
https://twitter.com/OfficialLoganK/status/1663934947931897857
For enterprises integrating LLMs into their product lines or internal infrastructure, awareness and responsiveness to "LLM drift" become paramount. Zou elaborates: "Because if you're relying on the output of these models in some sort of software stack or workflow, the model suddenly changes behavior, and you don't know what's going on, this can actually break your entire stack, can break the pipeline."
Consistent performance is more than a luxury; it's a lifeline. The revelations about ChatGPT's shifting behavior spotlight areas where OpenAI can refine its offerings and emphasize the dynamic nature of artificial intelligence. This journey of discovery reminds us that as technology evolves, our role in understanding, monitoring, and shaping it becomes ever more crucial.
İlgili Postlar
Technical Support
444 5 INV
444 5 468
info@innova.com.tr