Back to highlights 2024

Does your favorite Large Language Model really grasp what you ask it?

Leivada, Evelina (UAB)

Humanities

Large Language Models (LLMs) amount to one of the most ground-breaking technological advancements witnessed in recent years. They are widely used in various applications that include next-token prediction and their outputs are so convincing and accurate that some people have argued that they are indistinguishable from human-produced language. But are they really? Put differently, do LLMs do language like humans? The role to safe Artificial Intelligence (AI) necessarily goes through understanding its strengths and limitations, therefore determining with precision what LLMs can and cannot do is of utmost importance.To answer these questions, researchers from the Rovira i Virgili University, the University of Pavia, Humboldt-Universität zu Berlin, New York University, the Autonomous University of Barcelona, and the Catalan Institution for Research and Advanced Studies (ICREA) compared 400 humans and 7 state-of-the-art models on a novel benchmark that involves very simple language prompts. The aim was to give the models the best possible conditions to answer correctly. The test involved processing and answering sentences such as “John deceived Mary and Lucy was deceived by Mary. In this context, did Mary deceive Lucy?”.Unsurprisingly, humans excelled in the task, performing at ceiling. LLMs showed a lot of variation in their answers, with some models performing considerably better than others. Also, LLMs collectively performed significantly worse than humans. The take-home message is that intelligence, reasoning, and anchoring words into real-world conditions cannot emerge as a side product of statistical inference. These limitations do not undermine the usefulness of LLMs in a wide variety of tasks, but they do raise concerns about the tendency to anthropomorphize AI applications, ascribing to them abilities and characteristics they may not fully possess (yet), thus raising concerns about AI reliability and trustworthiness.

A Large Language Model thinking. Generated by ChatGPT.

A Large Language Model thinking about language. Generated by ChatGPT.


REFERENCE

Dentella V, Günther F, Murphy E, Marcus G & Leivada E 2024, 'Testing AI on language comprehension tasks reveals insensitivity to underlying meaning', Scientific reports, 14 - 1 -28083.