Who's watching the AI detectors?

We are! Sort of. Well, at least we're putting them to the test

Jul 19, 2023

This week’s Tenzing Trivia post featured some stats from TurnItIn’s AI detection tool, which got us thinking… let’s see how well these tools analyze the content in the trivia post itself. Kinda meta, right?

Turns out TurnItIn has one of the least accessible AI detection tools out there. By “least accessible,” we mean there’s no way to quickly sign up for any type of account without having to schedule a consultation. So ironically, we didn’t get to test out how TurnItIn analyzed our Tenzing Trivia Tuesday post about them.

Fortunately, the rest of the pack is pretty easy to try out for free or for a nominal fee. We ran the content of this post through the other top tools that are out on the market.

How artificial-ish is the content?

Before getting into the test results, let’s look at how we wrote this post so we can accurately assess the accuracy of these tools’ assessment of our content. You still following?

The post itself is 411 words and 2629 characters.

The post was all human written EXCEPT the bullet-points (highlighted below) under Tenzing’s EdTech Jobs of the Week:

AI generated content highlighted in this week’s Tenzing’s Trivia post

For the section above we had ChatGPT-4 create one-line bullet-point summaries of our Jobs of the Week post from Monday. The highlighted section is 116 words, 938 characters.

Therefore, if we’re going by words in the post, one could say the post is:

72% human written [(411-116)/411 * 100]

Using the Character count, it is:

62% human written [(2629-938)/2629 * 100]

For today’s purposes, we’ll say that the post is:

Between 62% to 72% human written

Let the games begin!

In alphabetical order…

Content at Scale’s Advanced AI Detector

Content at Scale encourages folks to “use our free advanced AI Detector to see if your content is human or AI generated from ChatGPT, GPT4, & Bard. Check up to 25k characters at once. Our AI checker was one of the first ever and goes deeper than generic AI content detectors, while offering more transparent scoring.”

Content at Scale’s test results

We had high hopes for this AI Detector because they described it as “Advanced.”

Turns out it rated our post as 100% highly likely to be human!!! (We added two additional exclamation points for emphasis).

Content at Scale’s Advanced AI Detector is super excited about our text being highly likely to be Human!

Content at Scale’s Grade: F

It rated our post as 100% highly likely to be written by a human, when it should have rated our post as 72% likely to be human written, at best.

Copyleaks

Copyleaks markets itself as “the only enterprise AI detection solution” with a “99.1% accuracy and full model coverage, including GPT-4 and Bard.”

It is available for free, and also has a Chrome extension

Copyleaks’ test results

Copyleaks rated the post as “Human text” with an 88.3% probability.

Copyleaks’ Chrome extension rating the post as human written with an 88.3% probability

When hovering over the individual portions of the text, Copyleaks rated it all with an 88.3% probability for being human written - even the text that was AI-generated below.

When hovering over the AI-generated text, Copyleaks still gave that section an 88.3% probability for being human written

Copyleaks’ Grade: C-

Rated the post with an 88.3% probability for Human text, which is 15.3% higher than it should have
It failed to distinguish between the human generated text and AI-generated section in the post

Crossplag

Crossplag says it’s AI Content Detector “detects if the text is AI-generated using advanced machine learning algorithms and ChatGPT detection technology.”

It is available for free.

Crossplag’s test results

Crossplag rated the post as mainly written by a human with a 12% chance of being AI-generated.

Crossplag’s AI Content Detector rated the post as mainly written by a human

Crossplag’s Grade: C

Rated the post with an 88% probability for being written by a human, which is 15% higher than it should have
The AI Content Detector does not attempt to flag individual sentences or sections of the text.

GPTZero

GPTZero markets itself as the “The Global Standard for AI Detection. Humans Deserve the Truth,” with an ability to “detect ChatGPT, GPT3, GPT4, Bard, and other AI models.”

Available for free.

GPTZero’s test results

GPTZero rated the post “likely to be written entirely by a human" and failed to highlight any sentences that were more likely to be written by AI.

GPTZero rated the text likely to be written entirely by a human

GPTZero’s Grade: D-

Rated the text likely to be written entirely by a human
Failed to identify any of the text as likely to be written by AI

OpenAI’s AI Text Classifier

OpenAI markets its AI Text Classifier (OpenAI) as a “fine-tuned GPT model that predicts how likely it is that a piece of text was generated by AI from a variety of sources, such as ChatGPT.”

Available for free.

AI Text Classifier’s test results

OpenAI rated the post to be “very unlikely AI-generated.” Oops!

OpenAI’s AI Text Classifier’s Grade: F

Open AI appears to be very good at creating AI-generated text, not so good at detecting it.

Originality.ai

Originality.ai (Originality) markets itself as “the most accurate Chat GPT, Bard, Paraphrasing, and GPT-4 AI detector built specifically for content marketers and SEOs” with a “99% accuracy on GPT-4, 83% on ChatGPT (GPT-4 powered) and ~2% false positives.”

No free plan; get started with a $20 deposit, and $.01 per 100 words checked.

Originality’s test results

Originality rated the post as 99% human, but then highlighted a portion of the area that was AI-generated as orange. It predicted that the section had an 87% chance of being AI-generated.

However, it highlighted some text that was human written as likely to be AI-generated, as shown below.

All of this text was human written including the text Originality.AI highlighted as likely to be AI-generated

This text highlighted as likely to be AI-generated is correct, but then for some reason it didn’t highlight the rest of the text in this section that was in fact AI-generated.

All of this text shown above was AI-generated, but for some reason Originality.AI only highlighted 2/3rds of it as AI-generated

Originality.AI’s Grade: C-

Rated the human score as 27% higher than it should have
Highlighted 2/3rds of the text that was AI-generated as likely to be AI-generated
Misidentified about 1.5 human written sentences as AI-generated
Don’t understand how it calculated that 99% of this text was original, while highlighting 4.5 sentences as 87% likely to be AI-generated

Sapling

Sapling markets itself as a tool that “outputs the probability that a piece of content was AI-generated by a model such as GPT-3.5 or ChatGPT.”

Free, and also has a Chrome extension.

Sapling’s test results

Sapling rated the post a 0.0% fake, which we assume means that it viewed it as entirely human written.

Sapling rated post as 0% fake - entirely human written

Although, it did go on to identify certain sections of the post as potentially AI-generated. The problem was that all the sections it flagged were human written, and it failed to flag any of the AI-generated sentences.

Sapling misidentified which sentences were possibly AI-generated.

Sapling’s Grade: F

Rated the post as 0% fake
Misidentified the sections that were possibly AI-generated, and failed to flag the sections that were actually AI-generated

Winston AI

Winston AI (Winston) markets itself as an “AI detection tool to help identify content generated with ChatGPT, GPT-4, Bard, Bing Chat, Claude, and many more Large Language Models” with a 99.6% accuracy rate.

Limited free plan, and paid monthly plans start at $18/mo.

Winston’s test results

Winston rated the post as 91% human, and identified two sentences (highlighted below) as “Possibly AI generated.” The problem being that the sentences it identified as “Possibly AI generated” weren’t written by AI and it failed to flag any of the AI-generated sentences. Oops!

Winston AI’s sentence by sentence analysis

Winston AI’s Grade: D

Rated the human score 19% higher than it should have
Failed to properly identify which sentences were “Possibly AI generated”

Writer

Writer describes it’s AI Content Detector as a tool that “evaluates your text and calculates how much of it has likely been generated by AI.”

Free to use.

Writer’s test results

Because Writer only accepts 1500 characters of content at a time, we had to copy and paste our post into two sections.

The first half of the post (which was 100% human written) was rated by Writer as 100% human-generated content.

Writer rated the first half of the post as 100% human-generated, which was accurate. Yay! Or lets exclaim, “Fantastic!”

The second half of our post, which contained the AI-generated content, was rated as 96% human-generated content. This was less than fantastic because at least half of these characters were in fact AI-generated.

Writer was a bit optimistic in rating the second half of the post 96% human-generated.

Writer’s Grade: D-

First off, it is super annoying only being able to copy and paste 1500 characters at a time.
More importantly, while it did slightly lower its human generated rating from 100% to 96% for the second half of the post, it really should have rated that section as 50% human generated.

ZeroGPT

ZeroGPT, not to be confused with GPTZero (though probably hoping to cause some confusion), markets itself as “the most Advanced and Reliable Chat GPT, GPT4 & AI Content Detector.”

Free to use.

ZeroGPT’s test results

It rated our text as human written with 0% AI GPT generated text.

GPTZero, sorry, I mean ZeroGPT rating our text as 0% AI GPT generated

ZeroGPT’s Grade: F

It rated our post as human written, 0% AI GPT generated text, when it should have rated our text as at least 28% AI GPT generated.

The verdict!

Humans win! At least in this example. Or maybe the AI generators won. But one thing is for sure, in this pop quiz, the AI Detectors did not cover themselves in glory.

If we had to pick a winner, it would be CrossPlag. Counterintuitively, CrossPlag’s main advantage was not trying to do too much; it didn’t try to highlight individual sentences or content sections as AI-generated. It just gave an overall rating of 88% human written which wasn’t accurate, but it was the least inaccurate of the field. Congrats CrossPlag!

AI generated trophy image with human written text

Here’s how the field stacked up:

CrossPlag: C
Copyleaks: C-
Originality.ai: C-
Winston AI: D
GPTZero: D-
Writer: D-
Content at Scale: F
Open AI’s AI Text Classifier: F
Sapling: F
ZeroGPT: F

Considerations and limitations

The main limitation of this test is that we ran one post through all these tools instead of running a statistically significant number and broader variety of texts through them. But isn’t that how the majority of us humans are going to encounter these tools. We test them out with a small sample size of text, and if we find them to be inaccurate in our sample size of one, two, or a few then are we really going to trust them with much more?

On the other hand, one would expect the accuracy and sophistication of these AI detectors to only accelerate as we’ve witnessed with the AI generators. Of course the AI generators will only improve in their own ability to produce content that appears to be human written.

Another consideration is more of a philosophical question:

How AI-generated is a summary of what was originally human written text? Should these types of summaries be considered to be less AI-generated than other types of AI-generated content?

Let me know your thoughts in the comments below.

If you’re in need of sourcing the type of technical talent (e.g., ML engineers, NLP engineers, Product Managers) required to solve for these challenges, drop us a line at:

hire [at] tenzingtalent.com.

TalentEd by Tenzing

Discussion about this post