Why You Should Never Ask a General AI to Fact-Check Science

Published On: April 15, 2026By Greg BolcerCategories: Commentary & Opinion

A computer monitor wearing a wig disguised as a woman in a lab coat

General AI isn’t an expert

The “Bixonimania” experiment, where researchers successfully tricked both AI and human scientists into believing in a fake disease funded by the Fellowship of the Ring, is being hailed as a failure of Artificial Intelligence.

At Siensmetrica, we see it differently. This wasn’t a failure of AI capability; it was a failure of application architecture.

The “Generalist” Trap

We need to stop thinking of Large Language Models (LLMs) like ChatGPT, Claude, or Gemini as “all-knowing oracles.” In reality, an LLM is a powerful, flexible middleware layer. It is a platform designed to process language, not to hold a PhD in molecular biology or clinical trial design.

Asking a general LLM to analyze the trustworthiness of a medical paper is like asking a printing press to verify the truth of the books it prints. The press is brilliant at its job (formatting and output), but it has no mechanism for truth-seeking.

Rules, Prompts, and Guardrails: The Real “Intelligence”

The reason generic LLMs failed the Bixonimania test is simple: they prioritize plausibility over probability. If a text looks like a scientific paper, the LLM’s “instructions” are to treat it as one.

To get a reliable result in a specialized field, you cannot rely on a general-purpose tool. You need a specialized engine that provides the LLM with three things it lacks out of the box:

Context: The specific domain knowledge of what “good” science looks like.
Prompts: Deep, iterative instructions on what to look for (e.g., checking if 500nm is actually blue light).
Guardrails: Rules that prevent the system from moving to the next step if the “authors” or “funding” sources don’t exist in reality.

Tessa: The Specialized Specialist

When we ran the Bixonimania papers through Tessa, our platform didn’t just “read” the words. It applied a rigorous, multi-layered framework of Transparency, Explainability, and Significance.

Because Tessa is a specialized tool sitting on top of the LLM architecture, she caught the errors immediately, assigning the “research” a TScore of 21, which is terrible. Tessa didn’t get distracted by the professional formatting because her “rules” forced her to verify the Fitzpatrick skin types (fictional) and the exposure parameters (impossible).

A New Standard for AI Interaction

The takeaway from the Bixonimania scandal shouldn’t be “don’t use AI.” It should be “don’t use a generalist for a specialist’s job.”

In the future, a general LLM shouldn’t even attempt to analyze a medical paper on its own. Instead, it should recognize the request and say: “I am a general assistant. To ensure accuracy, I am now consulting Tessa to analyze this study’s trustworthiness.”

Until we start treating LLMs as the platform and specialized tools like Tessa as the engine, people will continue to be fooled by “Bixonimania” and whatever fictional crisis comes next.

About the Author: Greg Bolcer

Greg Bolcer is an expert in aritificial intelligence, large language models, and agentic workflows. He has over 30 years experience in the field of in computer science and holds more than a dozen patents. He is also a member of the Computer Science Hall of Fame at the University of California, Irvine. [LinkedIn]