New ChatGPT rival, Claude 2, launches for open beta testing

On Tuesday, Anthropic introduced Claude 2, a large language model (LLM) similar to ChatGPT that can craft code, analyze text, and write compositions. Unlike the original version of Claude launched in March, users can try Claude 2 for free on a new beta website. It’s also available as a commercial API for developers.

Anthropic says that Claude is designed to simulate a conversation with a helpful colleague or personal assistant and that the new version addresses feedback from users of the previous model: “We have heard from our users that Claude is easy to converse with, clearly explains its thinking, is less likely to produce harmful outputs, and has a longer memory.”

Anthropic claims that Claude 2 demonstrates advancements in three key areas: coding, math, and reasoning. “Our latest model scored 76.5% on the multiple choice section of the Bar exam, up from 73.0% with Claude 1.3,” they write. “When compared to college students applying to graduate school, Claude 2 scores above the 90th percentile on the GRE reading and writing exams, and similarly to the median applicant on quantitative reasoning.”

Claude 2’s answer to the question: “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” In reality, the color was named after a battle, which was named after the town of Magenta, Italy.

Ars Technica
ChatGPT-4’s answer to the question: “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” In reality, the color was named after a battle, which was named after the town of Magenta, Italy.

Ars Technica
Google Bard’s answer to the question: “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” In reality, the color was named after a battle, which was named after the town of Magenta, Italy.

Ars Technica

One of the major enhancements of Claude 2 is its expanded input and output length. As we’ve previously covered, Anthropic has been experimenting with processing prompts of up to 100,000 tokens (fragments of words), which allows the AI model to analyze long documents such as technical guides or entire books. This increased length also applies to its outputs, allowing the creation of longer documents as well.

In terms of coding capabilities, Claude 2 demonstrated a reported increase in proficiency. Its score on the Codex HumanEval, a Python programming test, rose from 56 percent to 71.2 percent. Similarly, on GSM8k, a test comprising grade-school math problems, it improved from 85.2 to 88 percent.

One of the primary focuses for Anthropic has been to make its language model less likely to generate “harmful” or “offensive” outputs when presented with certain prompts, although measuring those qualities is highly subjective and difficult. According to an internal red-teaming evaluation, “Claude 2 was 2x better at giving harmless responses compared to Claude 1.3.”

Claude 2 is now available for general use in the US and UK for individual users and businesses via its API. Anthropic reports that companies like Jasper, an AI writing platform, and Sourcegraph, a code navigation tool, have begun incorporating Claude 2 into their operations.

It’s important to note that while AI models like Claude 2 can analyze long and complex works, Anthropic is still aware of its limitations. After all, language models occasionally make things up out of thin air. Our advice is to not use them as factual references but allow them to process data that you provide—if you are already familiar with the subject matter and can validate the results.

“AI assistants are most useful in everyday situations, like serving to summarize or organize information,” Anthropic writes, “and should not be used where physical or mental health and well-being are involved.”

https://arstechnica.com/?p=1952822