P-lausible Deniability: Why Flawed Statistics Are Undermining Scientific Research

Every year, thousands of scientific breakthroughs depend on a single number: the p-value. But how often is that number wrong? According to our analysis, it is happening a lot. Luckily, we have tools to check.

Picture this: A groundbreaking medical study claims a new drug reduces blood pressure by 30%. The result hinges on a part of a single sentence: "statistically significant, t(38) = 2.21, p = 0.0332". Journals accept it. Headlines announce it. High-risk patients get their hopes up.

But what does the t and the p mean? What if the statistics are wrong?

A study by John, Loewenstein, & Prelec (2012) found that 1 in 5 psychologists admitted to the questionable research practice of fudging the p-value to make them appear smaller, making them more likely to be published.

Based on our internal analysis of millions of articles from PubMed, we have found approximately 1 in 3 papers with statistical claims contains at least one detectable statistical inconsistency, where the numbers reported simply don't add up. This doesn't mean that the conclusion is incorrect, but still, it can be a problem. Also, we have found that most statistical reporting are not self-contained—like pieces to determine the p-value are scattered across paragraphs and graphs, making them difficult to verify.

How can this be?

Anatomy of a Statistical Claim: The P-Value Explained

The p-value (the p in the sentence above) is often treated like a magical verdict, but it's the final step in a logical process. To understand its power and fragility, we need to see how it's made.

In the hypothetical study above, imagine that researchers gathered a group of patients, gave half the new drug and half a placebo, and measured the difference in blood pressure between the two groups.

Step 1: Start with Skepticism (The Null Hypothesis)

Science begins with a default, skeptical stance called the null hypothesis (H₀). In our case, H₀ states: "The drug has no effect. Any difference we see between the groups is just random chance." The goal of the experiment is to see if we can gather enough evidence to confidently reject this skeptical view.

Step 2: Measure the Evidence (The Test Statistic)

Researchers can't just say, "the drug group's blood pressure was lower." They need to quantify it. They use a test statistic—a single number that summarizes how far their data deviates from the world of the null hypothesis.

Think of it as a "signal-to-noise" ratio. For a simple comparison like ours, a t-test is common. The resulting t-statistic measures the difference between the groups (the signal) and standardizes it by the variability within the groups (the noise). A large t-statistic means the signal is strong compared to the background noise.

Step 3: Define "Surprising" (The Distribution)

This is the most crucial—and often misunderstood—part. How do we know if our t-statistic is "large" enough to be meaningful?

We compare it to a theoretical distribution. This is a mathematical curve, like the famous bell curve, that shows us what t-statistics we would expect to get if the null hypothesis were true and we ran this exact experiment thousands of times. The shape of this curve (e.g., a t-distribution) is precisely determined by the degrees of freedom, which are directly related to the sample size.

A larger sample size leads to a narrower, more "confident" distribution, making it easier to spot a real effect. This is why a test statistic from a study with 1,000 patients means more than the same statistic from a study with 20.

Step 4: Calculate the Probability (The P-Value)

Finally, we arrive at the p-value.

Definition: The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one we found, assuming the null hypothesis is true.

It answers the question: "In a world where the drug does nothing, how often would we see a result this impressive just by luck?"

If our p-value is 0.0332, it means that a result this strong would only happen 3.32% of the time if the drug were useless. Since that's quite unlikely, we feel justified in rejecting the null hypothesis and concluding the drug likely has a real effect.

But notice how the entire chain of logic depends on a few key numbers:

The test statistic (e.g., t = 2.21)
The degrees of freedom (e.g., df = 38)

But if the resulting p-value is above 0.05, we fail to reject the null hypothesis, which means we don't have enough evidence to say the drug has a real effect. We don't get to publish the results, and the drug doesn't get approved (it is a lot more complicated than that, but that's the gist of it).

Common Statistical Inconsistencies

The chain above is prone to several common unintentional and intentional errors. Based on our internal analyses, the most common are:

1. The Vanishing Details Problem

The most common issue isn't calculation errors—it's missing information. Studies report conclusions without providing the statistical details needed to verify them:

P-values without test statistics
Test statistics without degrees of freedom
Conclusions without sample sizes
Results buried in figures without extractable numbers

2. P-Hacking and The Temptation to Publish

This isn't always an innocent mistake. The immense pressure to publish creates a powerful incentive to cheat, a practice known as p-hacking. This can range from selectively reporting only significant results, to trying different statistical tests until one produces p < 0.05, or, in the most blatant cases, simply falsifying the numbers.

This temptation is real. In a landmark survey of over 2,000 psychologists, 22% admitted to the questionable research practice of "rounding down" a p-value to make it cross the 0.05 threshold (John, Loewenstein, & Prelec, 2012). When the difference between publication and rejection is a few thousandths of a decimal point, the pressure to fudge the numbers can be immense.

3. The Calculation Cascade

Sometimes the numbers are all there, but they don't add up. Common errors include:

P-values that don't match the reported test statistics
Degrees of freedom that don't align with sample sizes
Test statistics that are mathematically impossible given the data

4. The Multiple Testing Mirage

When researchers run dozens of statistical tests but only report the significant ones, they create a false impression of certainty. With the standard significance level of 0.05, you'd expect 1 in 20 tests to be "significant" just by chance. Run enough tests, and you're guaranteed to find "significant" results—even in completely random data.

Free Tools to Check for Statistical Inconsistencies

Thankfully, the research community has developed tools to address these issues. Integrity checkers like statcheck can parse APA-style statistics and check for inconsistencies. Here is a small tutorial of how to use statcheck in R.

Let's say you copied the following text from a paper:

the effect was very significant (F(2, 65) = 3.02, p < .05)

We can use the statcheck function to check the statistical claims in a text.

if (!require("statcheck")) install.packages("statcheck")
library(statcheck)
txt <- "the effect was very significant (F(2, 65) = 3.02, p < .05)"
stat <- statcheck(txt)
stat

The output of the code above is:

  Source Statistic df1 df2 Test.Comparison Value Reported.Comparison Reported.P.Value
1      1         F   2  65               =  3.02                   <             0.05
##     Computed                      Raw Error DecisionError
## 1 0.05569781 F(2, 65) = 3.02, p < .05  TRUE          TRUE

As you can see, the computed p-value is 0.05569781, which would change the conclusion from "statistically significant" to "not statistically significant", hence the DecisionError is TRUE.

There are other free tools to check for statistical inconsistencies. The GRIM test if the reported means of integer data such as Likert-type scales are consistent with the given sample size and number of items. The GRIM and other tests are included in an ambitious R package called scrutiny which aim to be a comprehensive tool for integrity checks.

ReviewerZero AI: An Integrated Solution

While these tools are valuable, they usually have strong formatting constraints and are time consuming to apply. Based on our internal evaluation, we find 43% more statistical reports than statcheck can parse. And ReviewerZero AI automatically incorporates a whole suite of checks automatically:

Extract All Statistical Claims: Using AI, we find statistical tests, even when they don't follow a strict format.
Verify P-Values: We recalculate p-values from the reported test statistic (e.g., t, F, χ²) and degrees of freedom to check for inconsistencies.
Check for Completeness: Our system also flags statistical claims that are missing key information needed for verification.
Analyze Rounding Patterns: We account for rounding conventions that might otherwise lead to false positives.

While statistical analysis is a core feature, our platform also offers tools for Image Integrity, Citation Verification, and Author Verification. You can learn more about all our capabilities on our features page.

Interested in learning more about research integrity? Join Our Closed Beta or Schedule a call to learn more about our platform.