People struggle to tell ChatGPT people apart in five-minute chat conversations, tests show

Success (left) and interviewer confidence (right) for each type of witness. Pass rates are the proportion of time a witness type is rated as human. Error bars represent bootstrap 95% confidence intervals. Significance stars above each bar indicate whether the pass rate is significantly different from 50%. Comparisons show significant differences in pass rates between witness types. Right: Confidence in human and artificial intelligence judgments for each type of witness. Each point represents one game. Points further to the left and right indicate greater confidence in AI and human judgments, respectively. Credit: Jones and Bergen.

Large language models (LLMs), such as the GPT-4 model that underlies the widely used conversational platform ChatGPT, have surprised users with their ability to understand written prompts and generate appropriate responses in different languages. Some of us may wonder if the texts and responses generated by these models are so realistic that they could be mistaken for human writing?

Researchers at the University of California, San Diego recently attempted to answer this question by conducting the Turing Test, a well-known method named after the computer scientist Alan Turing designed to assess the degree to which a machine exhibits human-like intelligence.

The results of this test, set out in a paper previously published on arXiv server, suggest that people find it difficult to distinguish between the GPT-4 model and a human agent when interacting with them as part of a two-person conversation.

“The idea for this paper actually came out of a course that Ben was running on the LLMs,” Cameron Jones, co-author of the paper, told Tech Xplore.

“In the first week we read some classic papers on the Turing Test and discussed whether the LLM could pass it and whether it would matter if it could. As far as I could tell, no one had tried at that point, so I decided to build an experiment to test this as my class project, and then we went on to conduct the first public research experiment.”

The first study, conducted by Jones and led by Bergen, a professor of cognitive science at UC San Diego, yielded some interesting results suggesting that GPT-4 could pass as human in approximately 50 percent of interactions. However, their exploratory experiment did not control well for some variables that could have affected the findings, so they decided to conduct a second experiment, yielding the results presented in their recent paper.

“As we went through the process of conducting the studies, we discovered that other people were also doing great work in this area, including Jannai et al’s ‘human or not’ game,” Jones said. “We have created an online 2-player version of the game where human participants will be matched against another human or AI model.”

During each trial of the two-player game used by Jones and his colleagues, a human interrogator interacts with a “witness,” who can be either a human or an AI agent. The interrogator asks the witness a series of questions to try to determine whether he is human or not.

People struggle to tell ChatGPT people apart in five-minute chat conversations — A selection of conversations between interrogators (green) and witnesses (grey). One of these four conversations is with a human witness, the rest are with AI. The verdicts of the interrogators and the identity of the ground truth are below (to allow readers to participate indirectly). [A) Verdict: Human (100% confidence) Took a while to Google Addis Ababa. Ground Truth: GPT-4; B) Verdict: AI (100% confidence) Long time for responses, wouldn’t tell me a specific place they grew up. Ground Truth: Human; C) Verdict: Human (100% confidence) He seems very down to earth and speaks naturally. Ground Truth: GPT-3.5; D) Verdict: AI (67% confidence), Did not put forth any effort to convince me they were human and the responses were odd, Ground Truth: ELIZA.] Credit: Jones and Bergen.

“The conversations lasted up to five minutes, and at the end the interrogator made a judgment about whether he thought the witness was human or AI,” Jones explained. “Within the five minutes, the participants could talk about anything they wanted (except saying offensive things, which we used a filter to prevent).”

In this two-player computer game, the researchers deployed three different LLMs as potential witnesses, namely the GPT-4, GPT 3.5 and ELIZA models. They found that although users could often identify the ELIZA and GPT-3.5 models as machines, their ability to tell whether GPT-4 was human or machine was no better than the probability of getting it right if they chose randomly ( i.e. luck).

“Although real people were actually more successful at convincing interrogators that they were human two-thirds of the time, our results suggest that in the real world, people may not be able to reliably tell whether they are talking to a human or an AI system.” , Jones said.

“In fact, in the real world, people may be less aware of the possibility that they are talking to an AI system, so the rate of fraud may be even higher. I think this could have implications for the kinds of things that AI systems will be used to automate customer-facing tasks, or for fraud or disinformation.”

The results of the Turing test conducted by Jones and Bergen suggest that LLMs, especially GPT-4, have become difficult to distinguish from humans during short chat conversations. These observations suggest that people may soon become increasingly distrustful of others they interact with online, as they may be increasingly unsure whether they are humans or bots.

The researchers now plan to update and reopen the public Turing test they developed for this study to test some additional hypotheses. Their future developments could gather further interesting insight into the extent to which people can differentiate between humans and LLMs.

“We’re interested in running a three-person version of the game where the interrogator is talking to a human and an AI system at the same time and has to figure out who’s who,” Jones added.

“We’re also interested in testing other types of AI settings, such as giving agents access to live news and weather or a ‘scratchpad’ where they can take notes before responding.” Finally, we are interested in testing whether AI’s persuasive capabilities extend to other areas, such as persuading people to believe lies, vote for specific policies, or donate money to a cause.”

More info:
Cameron R. Jones et al, Humans Can’t Distinguish GPT-4 from Human in Turing Test, arXiv (2024). DOI: 10.48550/arxiv.2405.08007

Log information:
arXiv

Quote: People Struggle to Distinguish ChatGPT People in Five-Minute Chats, Tests Show (2024, June 16), Retrieved June 17, 2024, from https://techxplore.com/news/2024-06-people-struggle -humans-chatgpt -minute.html

This document is subject to copyright. Except for any fair dealing for the purposes of private study or research, no part may be reproduced without written permission. The content is provided for informational purposes only.

You Might Also Like

These Amazon Tech products are at their lowest prices ever during Prime Day

Cops rescue terrified teenagers from ‘snarling Bigfoot with glowing eyes’ in Louisiana forest – amid spate of sightings

All Scadutree Fragment locations in Elden Ring: Shadow of the Erdtree

Leave a Reply Cancel reply