GPT-4 vs. ChatGPT: An Exploration of Training, Performance, Capabilities, and Limitations

GPT-4 is an improvement, but temper your expectations.

chatgpt-vs-gpt-4-similarities-differences

Image created by the author.

OpenAI stunned the world when it dropped ChatGPT in late 2022. The new generative language model is expected to totally transform entire industries, including media, education, law, and tech. In short, ChatGPT threatens to disrupt just about everything. And even before we had time to truly envision a post-ChatGPT world, OpenAI dropped GPT-4.

In recent months, the speed with which groundbreaking large language models have been released is astonishing. If you still don’t understand how ChatGPT differs from GPT-3, let alone GPT-4, I don’t blame you.

In this article, we will cover the key similarities and differences between ChatGPT and GPT-4, including their training methods, performance and capabilities, and limitations.

ChatGPT vs. GPT-4: Similarities & differences in training methods

ChatGPT and GPT-4 both stand on the shoulders of giants, building on previous versions of GPT models while adding improvements to model architecture, employing more sophisticated training methods, and increasing the number of training parameters.

Both models are based on the transformer architecture, which uses an encoder to process input sequences and a decoder to generate output sequences. The encoder and decoder are connected by an attention mechanism, which allows the decoder to pay more attention to the most meaningful input sequences.

OpenAI’s GPT-4 Technical Report offers little information on GPT-4’s model architecture and training process, citing the “competitive landscape and the safety implications of large-scale models.” What we do know is that ChatGPT and GPT-4 are probably trained in a similar manner, which is a departure from training methods used for GPT-2 and GPT-3. We know much more about the training methods for ChatGPT than GPT-4, so we’ll start there.

ChatGPT

To start with, ChatGPT is trained on dialogue datasets, including demonstration data, in which human annotators provide demonstrations of the expected output of a chatbot assistant in response to specific prompts. This data is used to fine-tune GPT3.5 with supervised learning, producing a policy model, which is used to generate multiple responses when fed prompts. Human annotators then rank which of the responses for a given prompt produced the best results, which is used to train a reward model. The reward model is then used to iteratively fine-tune the policy model using reinforcement learning.

Image created by the author.

To sum it up in one sentence, ChatGPT is trained using Reinforcement Learning from Human Feedback (RLHF), a way of incorporating human feedback to improve a language model during training. This allows the model’s output to align to the task requested by the user, rather than just predict the next word in a sentence based on a corpus of generic training data, like GPT-3.

GPT-4

OpenAI has yet to divulge details on how it trained GPT-4. Their Technical Report doesn’t include “details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” What we do know is that GPT-4 is a transformer-style generative multimodal model trained on both publicly available data and licensed third-party data and subsequently fine-tuned using RLHF. Interestingly, OpenAI did share details regarding their upgraded RLHF techniques to make the model responses more accurate and less likely to veer outside safety guardrails.

After training a policy model (as with ChatGPT), RLHF is used in adversarial training, a process that trains a model on malicious examples intended to deceive the model in order to defend the model against such examples in the future. In the case of GPT-4, human domain experts across several fields rate the responses of the policy model to adversarial prompts. These responses are then used to train additional reward models that iteratively fine-tune the policy model, resulting in a model that’s less likely to give out dangerous, evasive, or inaccurate responses.

Image created by the author.

ChatGPT vs. GPT-4: Similarities & differences in performance and capabilities

Capabilities

In terms of capabilities, ChatGPT and GPT-4 are more similar than they are different. Like its predecessor, GPT-4 also interacts in a conversational style that aims to align with the user. As you can see below, the responses between the two models for a broad question are very similar.

Image created by the author.

OpenAI agrees that the distinction between the models can be subtle and claims that “difference comes out when the complexity of the task reaches a sufficient threshold.” Given the six months of adversarial training the GPT-4 base model underwent in its post-training phase, this is probably an accurate characterization.

Unlike ChatGPT, which accepts only text, GPT-4 accepts prompts composed of both images and text, returning textual responses. As of the publishing of this article, unfortunately, the capacity for using image inputs is not yet available to the public.

Performance

As referenced earlier, OpenAI reports significant improvement in safety performance for GPT-4, compared to GPT-3.5 (from which ChatGPT was fine-tuned). However, whether the reduction in responses to requests for disallowed content, reduction in toxic content generation, and improved responses to sensitive topics are due to the GPT-4 model itself or the additional adversarial testing is unclear at this time.

Additionally, GPT-4 outperforms CPT-3.5 on most academic and professional exams taken by humans. Notably, GPT-4 scores in the 90th percentile on the Uniform Bar Exam compared to GPT-3.5, which scores in the 10th percentile. GPT-4 also significantly outperforms its predecessor on traditional language model benchmarks as well as other SOTA models (although sometimes just barely).

ChatGPT vs. GPT-4: Similarities & differences in limitations

Both ChatGPT and GPT-4 have significant limitations and risks. The GPT-4 System Card includes insights from a detailed exploration of such risks conducted by OpenAI.

These are just a few of the risks associated with both models:

Hallucination (the tendency to produce nonsensical or factually inaccurate content)
Producing harmful content that violates OpenAI’s policies (e.g. hate speech, incitements to violence)
Amplifying and perpetuating stereotypes of marginalized people
Generating realistic disinformation intended to deceive

While ChatGPT and GPT-4 struggle with the same limitations and risks, OpenAI has made special efforts, including extensive adversarial testing, to mitigate them for GPT-4. While this is encouraging, the GPT-4 System Card ultimately demonstrates how vulnerable ChatGPT was (and possibly still is). For a more detailed explanation of harmful unintended consequences, I recommend reading the GPT-4 System Card, which starts on page 38 of the GPT-4 Technical Report.

Conclusion

In this article, we review the most important similarities and differences between ChatGPT and GPT-4, including their training methods, performance and capabilities, and limitations and risks.

While we know much less about the model architecture and training methods behind GPT-4, it appears to be a refined version of ChatGPT that now accepts image and text inputs and claims to be safer, more accurate, and more creative. Unfortunately, we will have to take OpenAI’s word for it, as GPT-4 is only available as part of the ChatGPT Plus subscription.

The table below illustrates the most important similarities and differences between ChatGPT and GPT-4:

Image created by the author.

The race for creating the most accurate and dynamic large language models has reached breakneck speed, with the release of ChatGPT and GPT-4 within mere months of each other. Staying informed on the advancements, risks, and limitations of these models is essential as we navigate this exciting but rapidly evolving landscape of large language models.

GPT-4 vs. ChatGPT: An Exploration of Training, Performance, Capabilities, and Li...

GPT-4 vs. ChatGPT: An Exploration of Training, Performance, Capabilities, and Limitations

GPT-4 is an improvement, but temper your expectations.

ChatGPT vs. GPT-4: Similarities & differences in training methods

ChatGPT

GPT-4

ChatGPT vs. GPT-4: Similarities & differences in performance and capabilities

Capabilities

Performance

ChatGPT vs. GPT-4: Similarities & differences in limitations

Conclusion

Recommend

Performance capabilities of data warehouses and how Cube can help

New Capabilities for GPT-3: Edit and Insert

Apple unveils M2 with breakthrough performance and capabilities

Generative Pre-training (GPT) for Natural Language Understanding

Is ChatGPT creative? An exploration with book titles.

The Incredible Capabilities Of The US Air Force's New Supersonic Training Jet

GPT-4's new capabilities power a 'virtual volunteer' for the visually impaired

[2303.13375] Capabilities of GPT-4 on Medical Challenge Problems

AR and VR Training: Leveraging Capabilities

4 advanced GPT-4 capabilities to level up your PPC efforts

About Joyk