Study: Is GPT-4 Getting Worse Over Time?

The advancements in artificial intelligence have changed various industries, making tasks more efficient and accessible. However, recent findings suggest that the performance of the highly acclaimed GPT-4 model may not be as reliable as previously believed.

GPT-4 Case Study: The Performance Of ChatGPT-4 And GPT 4 Is Getting Worse Over Time

In this article, we explore a study that sheds light on GPT-4’s performance degradation and the potential implications it could have on AI-driven applications.

A Shift in GPT-4’s Performance

Over time, many users have reported a noticeable decline in the quality of responses generated by GPT-4. Until now, these concerns were largely anecdotal. However, a recent study has provided objective evidence of this decline, comparing the performance of GPT-4 in March and June versions.

A Prime Example of Degradation

To evaluate the model’s performance, a team used a dataset of 500 problems, where GPT-4 had to determine whether a given integer was prime.

In the March version, GPT-4 correctly answered 488 out of 500 questions, showcasing an impressive 97.6% success rate. However, the June version’s performance plummeted, with only 12 correct answers out of 500, resulting in a mere 2.4% success rate.

Chain-of-Thought Failure

Another technique used to enhance model responses is Chain-of-Thought, which helps the model reason step by step. Unfortunately, the latest version of GPT-4 failed to generate intermediate steps, leading to incorrect responses like a simple “No” when asked about the primality of a number.

Code Generation Woes

For tasks involving code generation, GPT-4’s performance also showed a significant drop. A dataset comprising 50 easy problems from LeetCode was used to evaluate the model. The March version achieved a satisfactory 52% success rate in providing answers without any changes, while the June version’s success rate plummeted to a paltry 10%.

Possible Reasons Behind the Decline

The exact reasons for GPT-4’s declining performance remain unclear, but it is speculated that the quality decreases due to the safety alignment and fine-tuning. Take a look at the below-given video (more specifically, “the strange case of the unicorn” segment), where it’s shown how GPT-4’s quality decreases after fine-tuning it for safety.

The other rumors suggest that OpenAI might be using smaller and specialized GPT-4 models to handle user queries more efficiently, deciding which model to send the query to based on cost and speed considerations.

Implications for AI Applications

The deterioration in GPT-4’s performance is concerning for developers and businesses relying on this technology to power their applications. AI-driven systems that exhibit unpredictable behavior over time can lead to significant consequences in various industries, including healthcare, finance, and customer service.

Related Stories: