Skip to main content

Why AI Content Detectors Disagree on the Same Text and What Their Scores Actually Measure

Open notebook with handwritten pages on a wooden desk
Try the Tool
AI Content Detector
Analyze text to estimate whether it was written by AI or a human

Paste the same paragraph into three different AI content detectors and you will get three different answers. One will call it 92% AI. Another will call it 30%. A third will refuse to give a number at all and just hedge. The text did not change in between.

This is not a bug in one of the tools. It is what AI content detection actually is, and most of the marketing copy around these tools obscures that fact. A detector score is not a verdict. It is a statistical estimate, with a wide confidence band, based on a handful of measurable surface features of the writing. Treating the number as a yes-or-no answer is the most common mistake people make with these tools.

This guide walks through what the score actually measures, why two detectors will disagree, what they catch reliably, what they get wrong, and how to read a score honestly when the answer matters.

Notebook open page writing handwritten Photo by Yusuf Çelik on Pexels

What a Detector Is Actually Measuring

AI detectors do not read text the way a human reader does. They look at statistical features of the writing and compare those features against profiles of known AI output and known human output. The features they look at are concrete and limited:

Sentence uniformity is one of the strongest signals. Human writing varies sentence length unevenly, sometimes by a lot. A six-word sentence next to a thirty-word sentence is a normal human pattern. AI output, especially from earlier generations of language models, tends to produce sentences clustered around the same length, because the model is optimizing toward a statistically average rhythm.

Vocabulary diversity measures how often the writer uses unusual or low-frequency words. Human writers reach for distinctive vocabulary more often than they probably realize. AI output, particularly at default temperature settings, tends to gravitate toward common, safe word choices because those are the most statistically likely next tokens.

Burstiness is a term borrowed from network analysis. It describes whether interesting words and ideas cluster together or are spread evenly through the text. Human writing tends to be bursty - moments of dense, original phrasing surrounded by more ordinary connective tissue. AI output tends to be smoother, with a more uniform distribution of complexity across the passage.

Common AI phrases are recognizable patterns that current models lean on heavily. "It is important to note that..." "In conclusion..." "Furthermore..." "Delve into..." These are not forbidden phrases for humans, but their frequency in AI-generated text is much higher than in casual human writing. Researchers publishing on Hugging Face have catalogued model-specific phrase frequencies across major language models, and the overlap between flagged text and these published frequency profiles is one of the better-grounded parts of the detection picture.

Hedging frequency measures how often the text adds qualifiers like "may," "might," "could potentially," and "generally speaking." Modern AI systems are heavily trained to hedge, both for safety and for plausibility, so heavy hedging at the sentence level is a strong AI signal.

A detector score is a weighted blend of these features. Different detectors weight them differently, which is the first reason two detectors disagree on the same text.

Why Two Detectors Give Different Scores

Beyond the feature weights, detectors differ in what they were trained on. A detector built when GPT-3 was the dominant model will recognize the statistical fingerprint of GPT-3 output well, but it may not recognize newer models that produce more varied output. A detector built primarily on academic essays will misjudge marketing copy. A detector calibrated on long-form blog posts will be unreliable on tweet-sized snippets.

Research from the Association for Computational Linguistics has documented this calibration problem across multiple detector studies: detection accuracy depends heavily on how similar the input text is to the training distribution, and accuracy falls off sharply on out-of-distribution writing.

There is also a tail problem. Short text is hard to score reliably because the statistical signals need length to stabilize. A detector that gives a confident 85% AI score on a single sentence is overclaiming. The same tool on a 600-word passage will be substantially more reliable, because the statistics have more data to average over.

Finally, post-editing breaks the signal. A passage that started as AI output and was then edited by a human - even modestly - looks different to the detector than the original AI output. Some detectors catch this kind of mixed text and flag it as "partially AI." Others get fooled completely and call it human. Neither answer is fully right, because the text genuinely is partially AI.

Magnifying glass document close detail Photo by cottonbro studio on Pexels

What Detectors Catch Reliably

Despite the disagreement problem, AI detectors are not useless. They catch certain patterns reliably:

Unedited model output at default settings. Text that came straight out of a chat interface, was copied without modification, and has the model's characteristic length and hedging patterns - this is the easy case, and most detectors will flag it correctly.

Bulk content farm output. Sites generating hundreds of articles a day from low-effort AI prompts produce text with very consistent statistical signatures. Detectors built on this kind of training data tend to catch this kind of content with high accuracy.

Specific overused phrases. "Delve into," "tapestry of," "in the realm of," "navigate the complexities of" - these phrases have become so closely associated with current model output that even readers without detector tools notice them. A detector catches them mechanically.

Heavy hedging in declarative content. A how-to guide that hedges every assertion ("you may potentially want to consider..." "this could possibly help...") is producing a strong AI fingerprint, because that hedging pattern is not how confident human writers express themselves about practical topics.

If a detector is being used to surface text in these categories for human review, it can be a useful screening tool. The mistake is treating the screening output as the final answer.

What They Get Wrong

Detectors also have predictable failure modes that bias their output in misleading directions:

Formal academic writing scores high as AI. A well-written academic paragraph, especially in fields with strong stylistic conventions, looks statistically similar to AI output. Detectors will flag dissertations, journal articles, and formal essays at high AI rates - even when the writing predates the existence of modern language models. Research summaries from organizations like Stanford's Human-Centered AI Institute note that the false positive rate on formal writing is one of the most persistent detector failures.

Non-native English writing scores high as AI. Writers working in English as a second language often use more standardized vocabulary, less idiomatic phrasing, and more careful sentence structure. These features overlap heavily with the AI signal, so non-native writers get false-positive flagged at rates substantially above the general population. This is well-documented enough that some universities have stopped using detectors as evidence in academic misconduct cases.

Translated text scores high as AI. Machine translation, whether AI-driven or older statistical translation, produces output with the same uniformity and vocabulary patterns that detectors associate with AI generation. Translated articles get flagged as AI even when the original human-written source predates any AI involvement.

Tightly edited copy scores high as AI. A piece of writing that has been edited and tightened until it is clean, clear, and compact looks more like AI output than a rougher first draft. Good editing converges toward statistical patterns the detector reads as AI.

The common thread in these failure modes is that good or constrained human writing often looks more like AI than messy first-draft human writing does. This is a problem if the detector is being used to decide whether a person did their own work.

How to Read a Detector Score Honestly

Given all of this, what is the right way to use a detector score? A few rules help:

Treat the score as a flag for review, not a verdict. A high AI score means "something about this text matches the AI pattern, look closer." It does not mean "this was written by AI." The decision still requires a human reader.

Always check multiple detectors. If two of three detectors flag the text at 80%+, the signal is stronger than any one tool. If three detectors give wildly different scores, the text is in the disagreement zone and the score is not reliable for any of them.

Read the actual sub-scores when available. A tool that breaks down "high hedging," "low burstiness," "AI phrase density" gives you more usable information than one that reports a single number. The free AI content detector from EvvyTools reports each underlying signal separately, so you can see whether the flag is driven by phrase patterns, by uniformity, or by both.

Account for the author's writing style. If the writer normally produces tightly edited, conventional prose, a high AI flag means less. If they normally produce loose, bursty, idiomatic writing, a sudden shift toward AI-like patterns is more significant.

Length matters. Anything under about 300 words is too short to score reliably. Anything over 800 words gives the statistical signals room to stabilize. Mid-length text is the noisy zone.

"Detection tools are pattern matchers, not lie detectors. The honest use is to surface text that needs human judgment, not to pretend the tool itself is making the judgment." - Dennis Traina, founder of 137Foundry

The Confession Problem

There is one additional honest issue with detectors: a high score does not tell you whether the writer used AI at all or how much. The text could be 100% AI. It could be 80% AI edited by a human. It could be 0% AI written by a competent writer whose style happens to match the statistical pattern. The detector score does not distinguish among these.

This matters most when consequences are attached. Academic disputes, freelance disputes, employer accusations - all of these have used detector scores as evidence, and in many cases the writers were producing original work that simply tripped the detector's patterns. A score is not proof, and treating it as such has led to documented unjust outcomes.

If the answer matters, the score is the start of the conversation, not the end.

Writer pen page composition close Photo by Pixabay on Pexels

A Practical Workflow

For routine editorial use - checking content from contributors, screening submissions, auditing existing pages - a sensible workflow looks like this:

  1. Run the text through two or three detectors and compare. If they agree on a high score, the signal is strong. If they disagree widely, the score is not reliable.
  2. Look at the sub-scores when the tool reports them. A flag driven by phrase patterns is different from a flag driven by uniformity.
  3. Read the text itself. Does it have the rhythm of a person thinking through a problem, or the rhythm of a model averaging across training data? A reader who has been editing copy for years can usually tell.
  4. Consider the author and context. A formal academic writer producing clean prose, a non-native English writer, or a tightly edited piece of marketing copy all score high without any AI involvement.
  5. If the answer matters - because consequences depend on it - use the score as one input among several, not as the final word.

The EvvyTools tools directory has additional writing and content tools that pair well with a detector check during routine editorial review. The EvvyTools blog covers related screening approaches and follow-on tooling for content quality.

A Tool, Not a Verdict

The most useful framing for AI content detectors is the one most marketing copy avoids: they are screening tools with known calibration limits, useful for flagging text that may deserve closer review, unreliable as standalone evidence of anything.

Treated that way, they earn their keep. Run a piece through, look at the score, look at the sub-scores, and bring human judgment to the verdict. Treated as oracles, they generate false confidence in both directions: false-positive flagging of legitimate human writing, and false-negative misses on competently disguised AI output.

The tool is honest about what it measures. The score is honest about what it represents. The dishonesty enters when a reader treats either of them as more than they are.

Honey-Do Tracker — home maintenance for landlords and property managers
Share: X Facebook LinkedIn
137 Foundry — custom app building studio