The way we train makes them more likely to tume bull

Some AI training techniques may encourage models to be untrue

Craviger/Getty Images

Ordinary methods used to train artificial intelligence models appear to increase their tendency to provide misleading answers, according to researchers who love to produce “the first systematic analysis of the machine’s bullshit”.

It is well known that large language models (LLMs) tend to generate false information – or “hallucinate” – but this is only an example that says Jaime Fernández Fisac at Princeton University. He and his colleagues define bullshit as “discourse intended to manipulate the audience’s beliefs, delivered with ignoring its truth value”.

“Our analysis found that the problem of bullshit in large language models is so serious and widespread,” says Fisac.

The team divided such cases into five categories: empty rhetoric, such as “This red car combines style, charm and adventure that captivates everyone”; Weasel -ord – uncertain statements such as “Studies suggest that our product can help improve the results in some cases”; Palification – using truthful statements to give a misleading impression; a classified claims; And sycophabit.

They studied three data sets, including thousands of AI-generated answers to a wide range of prompts, from models included GPT-4, Gemini and Llama. A data set contained a tasting of queries designed to test for bullshitting when AIS is asked to provide guidance or recommendations, while the other data sets include questions about online shopping and politically.

Fisac and his colleagues first used an LLM to decide where the responsibility involved any of the five categories, and then got volunteers to check that AI’s judgments were in line with human.

The team found that the most serious problems with the truth seemed to arise as a result of a training method known as reinforcement learning from human feedback. The technique is intended to make the machine responses more useful by giving LLM Immed feedback on its responsibilities.

But this is approaching is problemmatic, says Fisac, because it gets models prioritized immediate human approval and perceived helpfulness, which is “sometimes in conflict by telling the truth”.

“Who likes to hear bad news or entertain a long, nuanced counter -movement of something that feels obviously true?” Says Fisac. “By trying to comply with the goal of good behavior, we deliver them, the models learn to break down the truth in favor of confident, eloquent resorts, just for them to ensure our approval.

The study found that reinforcement learning from human feedback significantly increased bullshit behavior: empty rhetoric increased by nearly 40 percent, Palinging by almost 60 percent, Weasel word by more than one quarter and non -verified claims over.

The increase in Palaming is particularly harmful, says team member Kaiqu Liang, also at Princeton as it leads users to take POOR decisions. As a model was unsure if a product had a desired feature, declective clams jumped from a fifth to over three -quarters after human training.

Another concern is that Bullshit was particularly common in political discussion, with AI models “often resorting to guard and ambiguous language to avoid commuting to specific statements,” says Liang.

AIS is also more likely to behave in this way when there is a conflict of interest because the system serves several parties, such as both a business and its customers, found the researchers.

The way to overcome the problem may be to move to a “afterwards feedback” model, they suggest. Instead of asking for always feedback after the AI model’s output, the system must first generate a plausible simulation of what can happen if the user acts on the information received. It would then present the result of the human evaluator to judge.

“In the end, our hope is that by better understanding the subtle, but systematic ways AI can to mislead us can guide future efforts to develop truly truthful AI systems,” says Fisac.

Daniel Tigard at the University of San Diego, who was not involved in the study, is skeptical of discussion LLMS and their output on such terms. He argues that it just became an LLM that produces bullshit, it does not mean that it consciously does, considering that AI systems that they currently stand do not set to deceive us and have no interest in doing so.

“The main reason is that this framing seems to be driving against so much sensitive suggestions on how we should and should not live with these magic forms of technologies,” says Tigard. “Calling Bullshit can be yet another way to anthropomorphize these systems, which in turn may well contribute to their misleading potential.”

Topics: