← Back to Blog

Statistical Checks Update: Effect Sizes and Confidence Intervals

statistical-integrityeffect-sizesconfidence-intervals
DA
Daniel Acuna

We have expanded ReviewerZero's statistical checks beyond in-text p-values. The system now verifies effect sizes, confidence intervals, and the values reported in tables, including odds ratios, hazard ratios, and regression coefficients. This post lists what is now covered.

The regression table above looks unremarkable. Three of its rows contain a statistical error, but none of the errors is visible at a glance. Confirming that a table like this is correct means checking every cell by hand: recomputing the test statistic, the p-value, and the interval, and matching each against the reported significance. That is slow, easy to get wrong, and a single paper can contain many such tables.

We have expanded ReviewerZero's statistical checks to do that checking automatically. In addition to recomputing in-text p-values, the system now verifies effect sizes and confidence intervals, including the values reported in tables: odds ratios, hazard ratios, risk ratios, and regression coefficients with their standard errors. This post lists each new capability, and at the end returns to the table above.

In an earlier post on recomputing reported p-values, we described how the system recomputes a p-value from a reported test statistic and degrees of freedom, so that an error such as "t(38) = 2.21, p = 0.0332" is caught before it reaches the published literature. Many statistical claims, however, are not reported as a single in-text sentence; they appear in tables, as the effect sizes and confidence intervals the checks below now cover.

The status ladder

For every statistic it can reconstruct, the system returns one of four statuses:

  • Verified. The reported value was recomputed and the numbers agree.
  • 🟠 Worth a look (suggestion). The reported value does not match the data, but the conclusion is unchanged, often a typo or rounding slip. This is a suggestion.
  • 🟠 Unable to verify. The paper did not report enough information to check this value. The system reports that rather than guessing.
  • 🔴 Issue. A genuine contradiction: the conclusion does not hold, or the numbers are impossible.

A false flag against a correct result is more costly than a missed flag, so the threshold for the red status is high and the underlying computation is conservative. Most results are verified.

1. Effect sizes are now testable

Previously, the checks targeted in-text test statistics and did not evaluate effect sizes such as odds ratios, hazard ratios, or regression coefficients, because an effect size is not itself a test statistic. This update adds those checks. The principle is that an effect size reported with a confidence interval, or with a standard error, already carries the information needed to check it: the interval shows how precise the estimate is, and that precision is exactly what a p-value depends on.

So for these table rows the system reconstructs the implied p-value and compares it against the significance the authors report. When the two agree, the row is verified. When a result is marked significant but its interval is too wide to support that claim, or marked non-significant when its interval clearly excludes no effect, the discrepancy is flagged.

2. Confidence interval versus reported p-value

Earlier versions of our statistical checks did not cross-check a confidence interval against the p-value reported in the same row. This update adds that comparison: when both the confidence interval and the reported p-value are available, the system checks that they agree. Consider an odds ratio of 1.56 with a 95% CI of (1.22-2.08) and a reported p-value of 0.042. A lower bound of 1.22 sits well above the no-effect value of 1.0, which is a much stronger result than a p-value of 0.042 would suggest. The two values are inconsistent and should be reconciled.

Because the result is significant either way, this is a reporting discrepancy rather than a wrong conclusion. It is classed as 🟠 worth a look.

3. Non-standard significance asterisk legends

The meaning of an asterisk in a table is not standardized. Earlier, the checks assumed a fixed 0.05 cutoff for asterisks, which is not consistent across papers. In one paper * means p < 0.05; in another, the footnote reads "*, **, *** denote significance at the 10%, 5%, and 1% levels," which is the reverse order. A fixed cutoff would incorrectly flag a coefficient that is significant at the level the authors stated. This update replaces that fixed cutoff: the system now reads each table's own legend and evaluates every starred cell against that table's threshold.

4. Logical impossibilities

Some contradictions require only arithmetic and logic, with no distribution. These sit in the 🔴 red status:

  • A point estimate outside its own confidence interval. "0.14 (95% CI [0.20, 0.25])" is impossible, because the estimate must fall inside its interval. A value was likely transposed.
  • A confidence interval that crosses the null yet is marked significant. A risk ratio of 0.614 with a 95% CI of (0.336, 1.120) carrying a star is inconsistent, because an interval that contains 1.0 cannot be significant at that level.

5. Reporting limits and rounding reconciliation

Two behaviors keep the checks precise.

Reporting insufficient inputs. If a cell reports only a bare asterisk, or a value that cannot be mapped to a coefficient and its standard error, the system marks it 🟠 unable to verify rather than produce a verdict it cannot support.

Reconciling within rounding. A reported p = 0.665 next to a recomputed 0.6672 is consistent with rounding and is not flagged. Before flagging a value, the system checks whether the reported number falls inside the range that rounding of the statistic and its degrees of freedom could produce. Only a value outside that range is surfaced. The p = 0.665 versus 0.6672 case is not flagged.

Back to the table

The table at the top of this post is the kind the checks above are built to read. Three of its rows do not hold up.

The opening table, reviewed: three rows flagged, the rest verified.
  • Household income (log): 🔴 Issue. The coefficient, 0.41, falls outside its own 95% confidence interval of [0.48, 0.69]. A point estimate cannot lie outside its interval.
  • Female: 🔴 Issue. The 95% confidence interval, [-0.03, 0.39], includes zero, so the coefficient cannot be significant at the 5% level, yet it is starred and reported at p = 0.041.
  • Anxiety score (GAD-7): 🟠 Worth a look. The confidence interval, [-0.128, -0.050], excludes zero, and the standard error implies a small p-value, but the row reports p = 0.27 as non-significant. The reported p-value was likely mistyped.

The other rows reconcile and are marked ✅ verified, including the model fit statistics. Each verdict is a direct recomputation from the values in the table, returned in seconds rather than the minutes it takes to check every cell by hand.

Summary

These additions extend the statistical checks from recomputing a single in-text p-value to reviewing the prose and the tables together: test statistics and effect sizes, confidence intervals, and asterisk legends. The result is closer to a second statistician reading the whole paper.


Interested in trustworthy science? Join our beta or schedule a call to see statistical verification on your own papers.

Authors

Daniel Acuna

Daniel Acuna

Founder & CEO

Daniel Acuna is the founder of ReviewerZero, dedicated to using AI to detect and prevent research integrity issues.