On the "solutionless dilemma" of users being misled by chatbots
“Nothing is certain, except death and taxes,” went the cliché—but that was before the advent of LLMs. Now, one of the giveaways (at least temporarily) of LLM-generated text is the introductory “Certainly, here is…” And we might need a new saying: Nothing is certain, except death, taxes, and the fact that LLMs chatbots will generate certain-sounding but sometimes inaccurate content.
Oh, the ambiguity embedded in “certain”! Because to say that something is certain is to say that it’s “known for sure” or “established beyond doubt” (or so Google says, drawing on the Oxford Languages dictionary)—but “certain” can also be used to mean “sure” or “confident” in one’s views. And while most LLMs are designed and trained to sound very certain, they also come with caveats about the accuracy of what they generate.
There’s a growing need to amplify those caveats for most users—to shout them from the digital rooftops, explaining the issue much more clearly whenever non-experts come into contact with chatbots and other versions of generative LLMs.
Instead, people are being encouraged to use chatbots as more “accessible” means to get health information, information about voting, even information about taxes. And lo, then researchers discover and reveal that some of the health-related information, the voting-related information, and the tax-related information was wrong: see “An eating disorders chatbot offered dieting advice, raising fears about AI in health”; “AI doesn’t have all the answers—especially this election season”; “AI tax-prep chatbots are giving bad advice.” Popular chatbots are also bad at providing law-related answers, etc. (even as they are touted as a means to help those who can’t afford legal advice otherwise).
Do we really have to point out separately, regarding every kind of category of information, that chatbots sound certain but shouldn’t?
Back in January 2019, the MIT Tech Review published an article by Karen Hao, titled “Giving algorithms a sense of uncertainty could make them more ethical.” It was not about LLMs in particular, and it was long before chatbots were treated as a feature to be added to any and all websites—but it focused on the complexity of the real world and on researchers who were interested in “solutionless dilemmas.” Purveyors of chatbots now present us with the opposite: readily-findable, readily-summarizable, readily-draftable answers, even though LLM “hallucinations” remain solutionless.
Last September, journalist Casey Newton wrote about Google’s efforts to address this issue, quoting a senior director of product who said “We may have created the first language model that admits it has made a mistake.” But the action was not an admission per se: what he was referring to was the fact that Google’s Bard would now come with a “Google It” button that would help check Bard’s response. As Newton explained,
Double-checking a query will turn many of the sentences within the response green or brown. Green-highlighted responses are linked to cited web pages; hover over one and Bard will show you the source of the information. Brown-highlighted responses indicate that Bard doesn’t know where the information came from, highlighting a likely mistake.
“If you’re wondering why Google doesn’t double-check answers like this before showing them to you,” Newton added, “so did I. [The senior director of product] told me that, given the wide variety of ways people use Bard, double-checking is frequently unnecessary. (You wouldn’t typically ask it to double-check a poem you wrote, or an email it drafted, and so on.)”
This, however, points to a problem with the variety of uses presented, for one product, in the case of LLMs. The fact that double-checking isn’t necessary for generated poems doesn’t mean it’s not critically important for other queries—especially when they’re made to a chatbot operated by a company whose name is synonymous with internet searches for information.
Hundreds of millions of people are now using chatbots; in November, OpenAI’s CEO announced that a hundred million people were using ChatGPT on a weekly basis, for example. How many of them know that they shouldn’t rely on it for consistently accurate answers?
The blog post announcing the launch of ChatGPT, in November 2023, opened with an exultant paragraph: “We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.” If you scroll down OpenAI’s announcement, in the “Limitations” section, the first bulleted point reads “ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.” The authors of the blog didn’t address how this claim intersected with the one about the chatbot admitting its mistakes.
A couple of days ago, Wharton professor Ethan Mollick, who studies the effects of AI and often writes about his own uses of it, summarized (on X) something that has become clear over the past year: “To most users, it isn't clear that LLMs don't work like search engines. This can lead to real issues when using them for vital, changing information. Frontier models make less mistakes, but they still make them. Companies need to do more to address users being misled by LLMs.”
It's certainly, painfully obvious by now that this is true.