reader comments
27 with
On Tuesday, OpenAI announced GPT-4, a large multimodal model that can accept text and image inputs while returning text output that “exhibits human-level performance on various professional and academic benchmarks,” according to OpenAI. Also on Tuesday, Microsoft announced that Bing Chat has been running on GPT-4 all along.
If it performs as claimed, GPT-4 potentially represents the opening of a new era in artificial intelligence. “It passes a simulated bar exam with a score around the top 10% of test takers,” writes OpenAI in its announcement. “In contrast, GPT-3.5’s score was around the bottom 10%.”
OpenAI plans to release GPT-4’s text capability through ChatGPT and its commercial API, but with a waitlist at first. GPT-4 is currently available to subscribers of ChatGPT Plus. Also, the firm is testing GPT-4’s image input capability with a single partner, Be My Eyes, an upcoming smartphone app that can recognize a scene and describe it.
GPT stands for “generative pre-trained transformer,” and GPT-4 is part of a series of foundational language models extending back to the original GPT in 2018. Following the original release, OpenAI announced GPT-2 in 2019 and GPT-3 in 2020. A further refinement called GPT-3.5 arrived in 2022. In November, OpenAI released ChatGPT, which at that time was a fine-tuned conversational model based on GPT-3.5.
AI models in the GPT series have been trained to predict the next token (a fragment of a word) in a sequence of tokens using a large body of text pulled largely from the Internet. During training, the neural network builds a statistical model that represents relationships between words and concepts. Over time, OpenAI has increased the size and complexity of each GPT model, which has resulted in generally better performance, model-over-model, compared to how a human would complete text in the same scenario, although it varies by task.
As far as tasks go, GPT-4’s performance is a doozy. As with its predecessors, it can follow complex instructions in natural language, and generate technical or creative works, but it can do so with more depth: It supports generating and processing up to 32,768 tokens (around 25,000 words of text), which allows much longer content creation or document analysis than previous models.
Uniform Bar Exam, the Law School Admission Test (LSAT), the Graduate Record Examination (GRE) Quantitative, and various AP subject tests. On many of the tasks, it scored at a human level. That means if GPT-4 were a person being judged solely on test-taking ability, it could get into law school—and likely many universities as well.
🤯🤯Well this is something else.
GPT-4 passes basically every exam. And doesn’t just pass…
The Bar Exam: 90%
LSAT: 88%
GRE Quantitative: 80%, Verbal: 99%
Every AP, the SAT… pic.twitter.com/zQW3k6uM6Z— Ethan Mollick (@emollick) March 14, 2023
Along with the introductory website, OpenAI also released a technical paper describing GPT-4’s capabilities and a system model card describing its limitations in detail.
Microsoft’s unhinged ace in the hole
Microsoft’s simultaneous GPT-4 announcement means OpenAI has been sitting on GPT-4 since at least November 2022, when Microsoft first tested Bing Chat in India.
“We are happy to confirm that the new Bing is running on GPT-4, customized for search,” writes Microsoft in a blog post. “If you’ve used the new Bing in preview at any time in the last six weeks, you’ve already had an early look at the power of OpenAI’s latest model. As OpenAI makes updates to GPT-4 and beyond, Bing benefits from those improvements to ensure our users have the most comprehensive copilot features available.”
The Bing Chat timeline matches with an anonymous tip Ars Technica heard last fall that OpenAI had GPT-4 ready internally but was reticent to release it until better guard rails could be implemented. While the nature of Bing Chat’s alignment was debatable, GPT-4’s guard rails now come in the form of more alignment training. Using a technique called reinforcement learning from human feedback (RLHF), OpenAI used human feedback from GPT-4’s results to train the neural network to refuse to discuss topics that OpenAI thinks are sensitive or potentially harmful.
“We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT,” OpenAI writes on its website, “resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.”
This is part of a breaking news story that will be updated as new details emerge.