Why You Should Never Ask a General AI to Fact-Check Science
At Siensmetrica, we see it differently. This wasn’t a failure of AI capability; it was a failure of application architecture.
The “Generalist” Trap
We need to stop thinking of Large Language Models (LLMs) like ChatGPT, Claude, or Gemini as “all-knowing oracles.” In reality, an LLM is a powerful, flexible middleware layer. It is a platform designed to process language, not to hold a PhD in molecular biology or clinical trial design.
Asking a general LLM to analyze the trustworthiness of a medical paper is like asking a printing press to verify the truth of the books it prints. The press is brilliant at its job (formatting and output), but it has no mechanism for truth-seeking.
Rules, Prompts, and Guardrails: The Real “Intelligence”
The reason generic LLMs failed the Bixonimania test is simple: they prioritize plausibility over probability. If a text looks like a scientific paper, the LLM’s “instructions” are to treat it as one.
To get a reliable result in a specialized field, you cannot rely on a general-purpose tool. You need a specialized engine that provides the LLM with three things it lacks out of the box:
- Context: The specific domain knowledge of what “good” science looks like.
- Prompts: Deep, iterative instructions on what to look for (e.g., checking if 500nm is actually blue light).
- Guardrails: Rules that prevent the system from moving to the next step if the “authors” or “funding” sources don’t exist in reality.
Tessa: The Specialized Specialist
When we ran the Bixonimania papers through Tessa, our platform didn’t just “read” the words. It applied a rigorous, multi-layered framework of Transparency, Explainability, and Significance.
Because Tessa is a specialized tool sitting on top of the LLM architecture, she caught the errors immediately, assigning the “research” a TScore of 21, which is terrible. Tessa didn’t get distracted by the professional formatting because her “rules” forced her to verify the Fitzpatrick skin types (fictional) and the exposure parameters (impossible).
A New Standard for AI Interaction
The takeaway from the Bixonimania scandal shouldn’t be “don’t use AI.” It should be “don’t use a generalist for a specialist’s job.”
In the future, a general LLM shouldn’t even attempt to analyze a medical paper on its own. Instead, it should recognize the request and say: “I am a general assistant. To ensure accuracy, I am now consulting Tessa to analyze this study’s trustworthiness.”
Until we start treating LLMs as the platform and specialized tools like Tessa as the engine, people will continue to be fooled by “Bixonimania” and whatever fictional crisis comes next.







