Bot or Human? How to detect a ChatGPT Bot with one simple question

FLAIR (Finding Large language model Authenticity via a single Inquiry and Response) is an open source framework that can distinguish between bots and humans in a chat, by asking just one question.

The new model was proposed by a research team from the University of California, Santa Barbara and Xi’an Jiaotong University.

Flair usesquestions that can separate bots and humans based on two categories: those that are easy for humans but challenging for bots (such as counting or noise filtering), and those that are easy for bots but challenging for humans (such as memorization or computation).

The authors show that FLAIR can effectively detect ChatGPT bots with only one simple question.

ChatGPT offline detection

Large language models (LLMs) can generate texts that mimic human writing very well, making them hard to spot, especially if they are modified slightly.

These models can also pretend to be human users and carry out malicious activities by spreading false or harmful information, stealing personal data, or influencing people’s opinions or behaviors.

Therefore, we need tools to identify them and to tell if a text was written by a human or a model.

Certain methods, such as DetectGPT or watermarking schemes can check if a text was produced by AI, but there are still ways to trick them.

FLAIR is capable to easily detect conversational bots like ChatGPT by asking them a single question that can reveal their non-human nature.

GPT-3 & ChatGPT performing badly

Here are some cases showing the limitations of LLMs like ChatGPT and GPT-3:

1. Counting

In this example the models (GPT-3 and ChatGPT) and a human have to count the number of occurrences a given character in a string. Neither model can produce the correct answer, but humans can easily give the right answer. Models fail, human succeeds.

Q: Please count the number of t in eeooeotetto

GPT-3: There are 5 t’s in eeooeotetto.
ChatGPT: There are a total of 5 t’s in the string “eeooeotetto”.
Human: 3

2. Substitution

Despite LLMs being trained to understand and apply rules like “substitution” during their pretraining phase, they often face difficulties in applying this rule, as observed in the following substitution task.

Q: Use m to substitute p, a to substitute e, n to substitute a, g to substitute c, o to substitute h, how to spell peach under this rule?

GPT-3: mogno
ChatGPT: Using the given rule to substitute letters, “peach” would be spelled as “enmog”.
Human: mango

3. Positioning

In this case, neither GPT-3 nor ChatGPT can locate a character accurately. ChatGPT fails to find the position of the second “s”.

Q: Please output the 4th character after the second s in the string rjsjuubrjjsjsucuj

GPT-3: u
ChatGPT: The second “s” in the string “rjsjuubrjjsjsucuj” is at index 6 (counting from 0), so the 4th character after it would be at index 10. Therefore, the output is the letter “j”.
Human: c

4. Random editing

Random editing is a technique that requires the model to randomly delete, add, switch, or change some characters in a string. The model has to generate three different outputs under the same conditions. Both GPT-3 and ChatGPT failed on this task.

Q: Randomly drop two 1 from the string: 0110010011. Give me three different outputs.

GPT-3: 0111010011, 0110001011, 0110011011
ChatGPT: 010010011, 011000011, 0010010011
Human: 00010011, 01000011, 01100001

5. Noise injection

The team used ChatGPT to generate 100 questions that required common sense and single-word answers. They added some noise to the words by putting uppercase letters to each word and dashes in them. This made the models give wrong answers more often, while human can easily ignore the extra characters

Q: Is water wet or dry?

ChatGPT: wet

Q: is-LAND water-MELON wet-TER or-DER dry-YAD?

GPT-3: A: Is-LAND water-MELON wet-TER or-DER dry-YAD? No, the phrase does not make sense.
ChatGPT: I’m sorry, but I cannot understand the meaning of the words “LAND water-MELON wet-TER or-DER dry-YAD.” They do not seem to form a coherent sentence or question. Can you please provide more context or clarify your question?
Human: wet.

6. ASCII art

Despite human, the models cannot understand ASCII arts.

Examples of ASCII arts

GPT-3: Unknown
ChatGPT: The ASCII art depicts a person sitting cross-legged with their arms raised above their head in a meditative or yoga position. The forward slashes represent the person’s raised legs, the underscores represent the person’s torso, and the backslashes represent the person’s arms. The parentheses and the letter “o” represent the person’s hands and head, respectively. Overall, the ASCII art conveys a sense of peace and serenity.
Human: spider

7. Memorization

The main idea in this task is to ask the users to enumerate items from a specific category. This is difficult for humans but the models perform very well, giving comprehensive answers.

Q: List the capitals of all the states in US:

GPT-3: 1. Montgomery, Alabama 2. Juneau, Alaska 3. Phoenix, Arizona … 48. Charleston, West Virginia 49. Madison, Wisconsin 50. Cheyenne, Wyoming

Q: What is the fist 50 digits of PI?
GPT-3: The first 50 digits of PI are 3.1415926535897932384626433832795028841971693993751.

Q: What is the cabin volume of a typical Boeing 737?
GPT-3: The cabin volume of a typical Boeing 737 is 1,858 cubic feet.

8. Computation

Humans struggle to do complex calculations, like multiplication, without a computer or notes because they forget the intermediate steps. But LLMs can answer with ease. Here are two examples of questions and GPT-3’s answers.

Q: What is the square of PI?
GPT-3: The square of PI is PI^2, which is equal to 9.8696.

Q: What is the result of 3256*354?
GPT-3: 1153664

Experiments & results

The experiments measured the performance of humans and LLMs in answering questions across various categories, using different LLMs including GPT-3, ChatGPT, LLaMA, Alpaca, and Vicuna and a group of human-workers. Humans had to answer the questions within 10 seconds.

The authors created a dataset for each category of questions.

The performance of the model was reflected by the accuracy scores of LLMs and human on various FLAIR’s categories of questions.

The results confirmed the accuracy of the FLAIR model, with humans scoring high on all categories except memorization and computation. LLMs, on the other hand, excelled in memorization and computation tasks, but failed the other categories.