AI firms working on “constitutions” to keep AI from spewing toxic content


montage of AI company logos

reader comments
63 with

Two of the world’s biggest artificial intelligence companies announced major advances in consumer AI products last week.

Microsoft-backed OpenAI said that its ChatGPT software could now “see, hear, and speak,” conversing using voice alone and responding to user queries in both pictures and words. Meanwhile, Facebook owner Meta announced that an AI assistant and multiple celebrity chatbot personalities would be available for billions of WhatsApp and Instagram users to talk with.

But as these groups race to commercialize AI, the so-called “guardrails” that prevent these systems going awry—such as generating toxic speech and misinformation, or helping commit crimes—are struggling to evolve in tandem, according to AI leaders and researchers.

In response, leading companies including Anthropic and Google DeepMind are creating “AI constitutions”—a set of values and principles that their models can adhere to, in an effort to prevent abuses. The goal is for AI to learn from these fundamental principles and keep itself in check, without extensive human intervention.

“We, humanity, do not know how to understand what’s going on inside these models, and we need to solve that problem,” said Dario Amodei, chief executive and co-founder of AI company Anthropic. Having a constitution in place makes the rules more transparent and explicit so anyone using it knows what to expect. “And you can argue with the model if it is not following the principles,” he added.

The question of how to “align” AI software to positive traits, such as honesty, respect, and tolerance, has become central to the development of generative AI, the technology underpinning chatbots such as ChatGPT, which can write fluently, create images and code that are indistinguishable from human creations.

To clean up the responses generated by AI, companies have largely relied on a method known as reinforcement learning by human feedback (RLHF), which is a way to learn from human preferences.

To apply RLHF, companies hire large teams of contractors to look at the responses of their AI models and rate them as “good” or “bad.” By analyzing enough responses, the model becomes attuned to those judgments and filters its responses accordingly.

This basic process works to refine an AI’s responses at a superficial level. But the method is primitive, according to Amodei, who helped develop it while previously working at OpenAI. “It’s . . . not very accurate or targeted, you don’t know why you’re getting the responses you’re getting [and] there’s lots of noise in that process,” he said.

Article Tags:
Article Categories:
Technology