After two decades as an analytics executive across tech-enabled services companies plus an MBA that concentrated in enterprise analytics – I’ve developed a reflex: when a model’s output looks too clean, I check what it’s actually measuring. A study released this month gave me a fresh reason to trust that reflex.
Researchers from the University of Maryland, the National University of Singapore, and Ohio State took 2,245 human-written resumes from before ChatGPT existed, had seven major LLMs rewrite each one, then asked each AI to pick the better version.
Every model picked itself. GPT-4o did it 97.6% of the time. LLaMA-3.3-70B: 96.3%. Qwen-2.5-72B: 95.9%. DeepSeek-V3: 95.5%. When the researchers controlled for actual writing quality – bringing in human judges who rated the originals as clearer, more coherent, and more effective – the AIs still preferred the AI version.
This is the AI wrapper problem made visible.
A lot of what’s being sold right now as “AI hiring” or “AI screening” is a thin layer wrapped around a general-purpose LLM, asked to make qualitative judgments it was never built to make. When you ask an LLM to evaluate a resume, it isn’t really evaluating substance. It’s pattern-matching against its own statistical fingerprint – the analytical-but-confident tone, the specific word distributions, the cadence of spearheaded and leveraged and drove cross-functional alignment. And when the evaluator is itself a word prediction algorithm, it picks up on those stylistic patterns with far more consistency than any human reviewer ever could.
So the wrapper looks intelligent. The output reads coherently. But under the hood, you’re not getting an evaluation of qualifications. You’re getting an evaluation of how AI-shaped the writing is.
That’s the wrong question to be asking. And as someone who has spent a career designing measurement systems and machine learning pipelines, I can tell you the failure mode here is a familiar one: when the metric and the evaluator share the same underlying generator, you stop measuring the thing you care about and start measuring the generator’s preferences. It’s a textbook leakage problem dressed up in a friendlier interface.
When we built Marovi, we accounted for this bias from the start. Our matching engine doesn’t ask which resume is better written? It doesn’t care whether your bullets were polished by ChatGPT or written longhand. It evaluates the substance: does this candidate actually have multiple instances of doing Y, as a Z-level role, at an ABC-type company, in the last X years, with accomplishments that map to what the target role requires?
That’s a fundamentally different question than what an LLM wrapper answers. “Tell me who has the better experience” collapses into phrase similarity and stylistic pattern-matching – exactly what the Maryland study just exposed. “Does this candidate have experience in role Y, at company type Z, with frequency F and recency R, weighted against the target like this…” is structured judgment built on aggregated data points. It’s how an expert recruiter actually thinks, and it’s how any defensible evaluation system needs to be designed.
You can see it in our match justifications. Every score traces back to specific facts: years of relevant experience, recency, frequency of similar accomplishments, company and industry context. Not “this candidate reads as senior.” Not “the writing suggests strong leadership.” The actual qualifications, weighted the way a domain expert would weight them.
This matters more by the month. As more candidates use AI to polish their materials, and more recruiters deploy AI wrappers to screen them, we’re heading toward a closed feedback loop where the resume that sounds most like ChatGPT gets picked by the tool that thinks like ChatGPT. Actual qualifications drift further from the signal.
The way out isn’t smarter prompting on top of a general LLM. It’s purpose-built evaluation that treats experience matching as the structured, multi-variable problem it actually is.
Jin Ro is a former analytics executive with deep experience in data and machine learning across tech-enabled services companies, and holds an MBA with a concentration in enterprise analytics.
