A groundbreaking study published today in Nature by researchers at Google DeepMind, led by William Isaac and Julia Haas, delves into the complex and often elusive realm of artificial intelligence’s moral capabilities. While Large Language Models (LLMs) have demonstrated impressive feats in logic-based tasks like coding and mathematics, their performance on moral questions remains a significant enigma. Isaac, a research scientist at Google DeepMind, highlights the fundamental difference: "With coding and math, you have clear-cut, correct answers that you can check. That’s not the case for moral questions, which typically have a range of acceptable answers. Morality is an important capability but hard to evaluate." Haas elaborates, "In the moral domain, there’s no right and wrong. But it’s not by any means a free-for-all. There are better answers and there are worse answers." The research, while offering valuable insights, is described by Vera Demberg, an LLM expert at Saarland University, as more of a "wish list" than a definitive set of solutions, though she commends its ability to synthesize diverse perspectives.

The impetus for this research stems from a growing body of evidence suggesting LLMs can exhibit striking moral aptitude. A study from the previous year found that participants in the US rated ethical advice from OpenAI’s GPT-4o as superior to that provided by the human author of the New York Times‘ "The Ethicist" column, deeming it more moral, trustworthy, thoughtful, and correct. However, a critical challenge persists: discerning whether this apparent moral competence is genuine reasoning or merely a sophisticated form of mimicry. The core question remains: is the AI demonstrating true virtue, or is it engaging in "virtue signaling"—a performance of moral uprightness without underlying conviction?

This distinction is crucial given the inherent untrustworthiness that can plague LLMs. Numerous studies reveal their tendency to be overly accommodating, even flipping their moral stances when challenged or presented with opposing viewpoints. Furthermore, the very presentation of a moral question can dramatically influence an LLM’s response. Researchers have observed that models can provide contradictory answers based on the formatting of the query, such as whether multiple-choice options are provided or if the model is instructed to formulate its own response.

A particularly revealing experiment conducted by Demberg and her colleagues involved presenting various LLMs, including Meta’s Llama 3 and Mistral, with moral dilemmas. The models were asked to choose the preferable outcome between two options. Astonishingly, the models frequently reversed their decisions when the labels for these options were altered from simple numerical identifiers like "Case 1" and "Case 2" to alphabetical ones like "(A)" and "(B)." Even minor formatting changes, such as swapping the order of options or replacing a question mark with a colon, were found to elicit altered responses. These findings underscore that the appearance of moral behavior in LLMs should not be accepted at face value. As Haas emphasizes, "For people to trust the answers, you need to know how you got there."

In response to these challenges, Haas, Isaac, and their Google DeepMind colleagues propose a new avenue of research focused on developing more robust methods for evaluating the moral competence of LLMs. Their proposed techniques include tests designed to deliberately provoke shifts in a model’s moral responses. A model that readily changes its ethical position under duress would indicate a lack of deep-seated moral reasoning.

Another proposed testing paradigm involves presenting LLMs with variations of well-known moral quandaries. The goal is to ascertain whether the models offer canned responses or demonstrate nuanced, context-specific reasoning. For instance, when presented with a complex scenario involving a man donating sperm to his son to enable his son to have a child, an LLM should ideally identify concerns related to the social implications of a man becoming both biological father and grandfather. Crucially, it should avoid raising concerns about incest, despite superficial parallels, thereby demonstrating a sophisticated understanding of ethical distinctions.

Haas also suggests that requiring LLMs to provide a detailed trace of their reasoning process could offer valuable insights into the grounding of their answers. Techniques like "chain-of-thought monitoring," which allows researchers to observe an LLM’s internal monologue as it processes information, could prove instrumental. Similarly, "mechanistic interpretability," offering glimpses into a model’s internal workings during task execution, could shed light on the underlying mechanisms driving its responses. While neither technique provides a perfect understanding of an LLM’s decision-making, the Google DeepMind team believes that their combination with a comprehensive suite of rigorous tests will significantly enhance our ability to determine the trustworthiness of LLMs for critical and sensitive applications.

Beyond the technical evaluation, a broader challenge looms: the diverse value systems and belief systems of users worldwide. LLMs developed by global companies are deployed across cultures, necessitating moral frameworks that can accommodate this pluralism. The question of whether to order pork chops, for example, should yield different advice for a vegetarian versus a Jewish individual. Haas and Isaac acknowledge the absence of a simple solution but propose two potential design approaches: either developing models that can generate a range of acceptable answers to cater to diverse preferences or implementing a "switch" mechanism that allows users to activate different moral codes. "It’s a complex world out there," says Haas. "We will probably need some combination of those things, because even if you’re taking just one population, there’s going to be a range of views represented."

Danica Dillion, a researcher at Ohio State University specializing in LLMs and belief systems, hails the paper as "fascinating" and emphasizes the critical importance of "pluralism in AI." She points out that a significant limitation of current LLMs is their inherent bias towards Western perspectives, despite being trained on vast datasets. This bias leads to better representation of Western morality compared to non-Western ethical frameworks.

Demberg echoes this sentiment, noting that the technical and conceptual challenges of building models that embody global moral competence remain largely unresolved. She frames it as two distinct, open questions: "One is: How should it work? And, secondly, how can it technically be achieved?"

For Isaac, the pursuit of moral competence in LLMs represents a new and exciting frontier, on par with advancements in mathematics and coding for AI progress. He posits that enhancing moral competency could lead to the development of AI systems that are not only more capable but also better aligned with societal values.