Hypothesis-Testing Demands Trustworthy Data—A Simulation Approach to Inferential Statistics Advocating the Research Program Strategy

Krefeld-Schwalb, Antonia; Witte, Erich H.; Zenker, Frank

doi:10.3389/fpsyg.2018.00460

METHODS article

Front. Psychol., 24 April 2018

Sec. Quantitative Psychology and Measurement

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.00460

Hypothesis-Testing Demands Trustworthy Data—A Simulation Approach to Inferential Statistics Advocating the Research Program Strategy

$\r\nAntonia Krefeld-Schwalb$ Antonia Krefeld-Schwalb¹

Erich H. Witte²

Frank Zenker³^*

¹Geneva School of Economics and Management, University of Geneva, Geneva, Switzerland
²Institute for Psychology, University of Hamburg, Hamburg, Germany
³Department of Philosophy, Lund University, Lund, Sweden

In psychology as elsewhere, the main statistical inference strategy to establish empirical effects is null-hypothesis significance testing (NHST). The recent failure to replicate allegedly well-established NHST-results, however, implies that such results lack sufficient statistical power, and thus feature unacceptably high error-rates. Using data-simulation to estimate the error-rates of NHST-results, we advocate the research program strategy (RPS) as a superior methodology. RPS integrates Frequentist with Bayesian inference elements, and leads from a preliminary discovery against a (random) H₀-hypothesis to a statistical H₁-verification. Not only do RPS-results feature significantly lower error-rates than NHST-results, RPS also addresses key-deficits of a “pure” Frequentist and a standard Bayesian approach. In particular, RPS aggregates underpowered results safely. RPS therefore provides a tool to regain the trust the discipline had lost during the ongoing replicability-crisis.

Introduction

Like all sciences, psychology seeks to establish stable empirical hypotheses, and only “methodologically well-hardened” data provide such stability (Lakatos, 1978). In analogy, data we cannot replicate are “soft.” Recent attempts to replicate allegedly well-established results of null-hypothesis significance testing (NHST), however, did broadly fail. As did the five preregistered replications, conducted between 2014 and 2016, reported in Perspectives of Psychological Science (Alogna et al., 2014; Cheung et al., 2016; Eerland et al., 2016; Hagger et al., 2016; Wagenmakers et al., 2016). This implies that the error-proportions of NHST-results generally are too large. For many more replication attempts should otherwise have succeeded.

We can partially explain the replication failure of NHST-results by citing questionable research practices that inflate the Type-I error probability (false positives), as signaled by a large α-error (Nelson et al., 2018). If researchers collect undersized samples, moreover, then this raises the Type-II error probability (false negatives), as signaled by a large β-error. (The latter implies a lack of sufficient test-power i.e., 1 – β-error). Ceteris paribus, as these errors increase, the replication-probability of a true hypothesis decreases, thus lowering the chance that a replication attempt obtains a similar data-pattern as the original study. Since NHST remains the statistical inference strategy in empirical psychology, many today (rightly) view the field as undergoing a replicability-crisis (Erdfelder and Ulrich, 2018).

It adds severity that this crisis extends beyond psychology—to medicine and health care (Ioannidis, 2014, 2016), genetics (Alfaro and Holder, 2006), sociology (Freese and Peterson, 2017), and political science (Clinton, 2012), among other fields (Fanelli, 2009)—and affects each field as a whole. A 50% replication-rate in cognitive psychology vs. a 25% replication-rate in social psychology (Open Science Collaboration, 2015), for instance, merely makes the first subarea appear more crisis-struck. Since all this keeps from having too much trust in our published empirical results, the term “confidence-crisis” is rather apt (Baker, 2015; Etz and Vandekerckhove, 2016).

The details of how researchers standardly employ NHST coming under doubt has sparked renewed interest in statistical inference. Indeed, many researchers today self-identify as either Frequentists or Bayesians, and align with a “school” (Fisher, Neyman-Pearson, Jeffreys, or Wald). However, statistical inference as a whole offers no more (nor less) than a probabilistic logic to estimate the support that a hypothesis, H, receives from data, D (Fisher, 1956; Hacking, 1965; Edwards, 1972; Stigler, 1986). This estimate is technically an inverse probability, known as the likelihood, L(H|D), and (rightly) remains central to Bayesians.

An important precondition for calculating L(H|D) is the probability of D given H, p(D,H). Unlike L(H|D), we cannot determine p(D,H) other than by induction over data. This (rightly) makes p(D,H) central to Frequentists. Testing H against D—in the sense of estimating L(H|D)—thus presupposes induction, but nevertheless remains distinct conceptually. Indeed, the term “test” in “NHST” misleads. For NHST tests only p(D,H), but not L(H|D). This may explain why publications regularly over-report an NHST-result as supporting a hypothesis. Indeed, many researchers appear to misinterpret NHST as the statistical hypothesis-testing method it emphatically is not.

To clarify why testing p(D,H) conceptually differs from testing L(H|D), this article compares NHST with the research program strategy (RPS), a hybrid-approach that integrates Frequentist with Bayesian statistical inference elements (Witte and Zenker, 2016a,b, 2017a,b). As “stand-ins” for real empirical data, we here simulate the distribution of a (dependent) variable in hypothetical treatment- and control-groups to simulate that variable's arithmetic mean in both groups. Our simulated data are sufficiently similar to data that actual studies would collect for purposes of assessing whether an independent, categorical variable (e.g., an experimental manipulation) significantly influences a dependent variable. Therefore, simulating the parameter-range over which hypothetical data are sufficiently replicable does estimate whether actual data are stable, and hence trustworthy.

We outline RPS [section The Research Program Strategy (RPS)], detail three statistical measures (section Three Measures), explain purpose, method, and the key-result of our simulations (section Simulations), offer a critical discussion (section Discussion), then compare RPS to a “pure” Frequentist and a standard Bayesian approach (section Frequentism Vs. Bayesianism Vs. RPS), and finally conclude (section Conclusion). As supplementary material, we include the R-code, a technical appendix, and an online-app to verify quickly that a dataset is sufficiently stable¹.

The Research Program Strategy (RPS)

With the construction of empirical theories as its main aim, RPS distinguishes the discovery context from the justification context (Reichenbach, 1938). The discovery context readily lets us induce a data-subsuming hypothesis without requiring reference to a theoretical construct. Rather, discerning a non-random data-pattern, as per p(D,H₀) < α ≤ 0.05, here sufficiently warrants accepting the H₁-hypothesis that is a best fit to D as a data-abbreviation. Focusing on non-random effects alone, then, discovery context research is fully data-driven.

In the justification context, by contrast, data shall firmly test a theoretical H₁-hypothesis, i.e., verify or falsify the H₁ probabilistically. A hypothesis-test must therefore pitch a theoretical H₁-hypothesis either against the (random) H₀-hypothesis, or against some substantial hypothesis besides the H₁ (i.e., H₂, …, H_n₋₁, H_n). Were the H₁-hypothesis we are testing indistinct from the data-abbreviating H₁, however, then data once employed to induce the H₁ now would confirm it, too. As this would level the distinction between theoretical and inductive hypotheses, it made “hypothesis-testing” an empty term. Hence, justification context research must postulate a theoretical H₁.

Having described and applied RPS elsewhere (Witte and Zenker, 2016a,b, 2017a,b), we here merely list the six (individually necessary and jointly sufficient) RPS-steps to a probabilistic hypothesis-verification².

Preliminary discovery

The first step discriminates a random fluctuation (H₀) from a systematic empirical observation (H₁), measured either by the p-value (Fisher) or the α-error (Neyman-Pearson). Under accepted errors, we achieve a preliminary H₁-discovery if the empirical effect sufficiently deviates from a random event.

Substantial discovery

Neyman-Pearson test-theory (NPTT) states the probability that a preliminary discovery is replicable as the (1–β-error), aka test-power. If we replicate a preliminary discovery while α- and β-error (hereafter: α, β) remain sufficiently small, a preliminary H₁-discovery turns into a substantial H₁-discovery.

Preliminary falsification

A substantial H₁-discovery may entail that we thereby preliminarily falsify the H₀ (or another point-hypothesis). As the falsification criterion, we propose that the likelihood-ratio of the theoretical effect-size d > 0, postulated by the H₁, and of a null-effect d = 0, postulated by the H₀, i.e., $\frac{L (d > 0 | D)}{L (d = 0 | D)}$ , must exceed Wald's criterion $\frac{(1 - β)}{α}$ (Wald, 1943).

Substantial falsification

A preliminary H₀-falsification turns into a substantial H₀-falsification if the likelihood-ratio of all theoretical effect-sizes that exceed the minimum theoretical effect-size d > δ = dH₁ – dH₀, and of the H_0(d=0), i.e., $\frac{L (d > δ | D)}{L (d = 0 | D)}$ , exceeds the same criterion, i.e., $\frac{(1 - β)}{α}$ .

Preliminary verification

We achieve a preliminary H₁-verification if the likelihood ratio of the point-valued H_{1(d = δ)} and the H_0(d=0) exceeds, again, $\frac{(1 - β)}{α}$ .

Substantial verification

Having preliminarily verified the H_1(d=δ) against the H_0(d=0), we now test how similar δ is to the empirical (“observed”) effect-size's maximum-likelihood-estimate, MLE_(demp). As our verification criterion, we propose the ratio of both likelihood-values (i.e., the maximal ordinate of the normal distribution divided by its ordinate at the 95% interval-point), which is approximately 4 (see next section). If δ's likelihood falls within the 95%-interval centered on MLE_(demp), then we achieve a substantial H₁-verification. This means we now accept “H_1(d=δ)” as shorthand for the effect-size our data corroborate statistically.

RPS thus starts in the discovery context by using p-values (Fisher), proceeds to an optimal³ test of a non-zero effect-size against either a random-model or an alternative model (Neyman-Pearson), and reaches—entering into the justification context—a statistical verification of a theoretically specified effect-size based on probably replicable data⁴ (see Figure 1). All along, of course, we must assume accepted α- and β-error.

FIGURE 1

Figure 1. The six steps of the research program strategy (RPS).

In what we call the data space, RPS-steps 1 and 2 thus evaluate probabilities; RPS-steps 3–5 evaluate likelihoods in the hypotheses space; and RPS-step 6 returns to the data space. For data alone determine if the point-hypothesis from RPS-step 5 is substantially verified at RPS-step 6, or not. As if in a circle, then, RPS balances threes steps in the data space (1, 2, 6) with three steps in the hypotheses space (3, 4, 5).

Importantly, individual research groups rarely command sufficient resources to collect a sufficiently large sample that achieves the desirably low error-rates a well-powered study requires (see note 3). To complete all RPS-steps, therefore, groups must coordinate joint efforts, which requires a method to aggregate underpowered studies safely (We return to this toward the end of our next section).

Since RPS integrates Frequentist with Bayesian statistical inference-elements, the untrained eye might discern an arbitrary “hodgepodge” of methods. Of course, the Frequentist and Bayesian schools both harbor advantages and disadvantages (Witte, 1994; Witte and Kaufman, 1997; Witte and Zenker, 2017b). For instance, Bayesian statistics allows us to infer hypotheses from data, but normally demands greater effort than using Frequentist methods. The simplicity and ubiquity of Frequentist methods, by contrast, facilitates the application and communication of research results. But it also risks to neglect assumptions that affect the research process, or to falsely interpret such statistical magnitudes as confidence intervals or p-values (Nelson et al., 2018). Decisively, however, narrowly sticking to any one school would simply avoid attempting to integrate each school's best statistical inference-elements into an all-things-considered best strategy. RPS does just this.

RPS motivates the selection of these elements by its main goal: to construct informative empirical theories featuring precise parameters and hypotheses. As RPS-step 1 exhausts the utility of α, or the p-value (preliminary discovery), for instance, β additionally serves at RPS-step 2 (substantial discovery). In general, RPS deploys inference elements at any subsequent step (e.g., the effect size at RPS-step 2–5; confidence intervals at RPS-step 6) to sequentially increase the information of a preceding step's focal result.

Unlike what RPS may suggest, of course, the actual research process is not linear. Researchers instead stipulate both the hypothesis-content and the theoretical effect-size freely. Nevertheless, a hypothesis-test deserving its name—one estimating L(H|D), that is—requires replicable rather than “soft” data, for such data alone can meaningfully induce a stable effect-size.

RPS therefore measures three qualities: induction quality of data, as well as falsification quality and verification quality of hypotheses, to which we now turn.

Three Measures

This section defines three measures and their critical values in RPS. The first measure estimates how well data sustain an induced parameter; the second and third measure estimate how well replicable data undermine and, respectively, support a hypothesis⁵.

Def. induction quality: Based on NPTT, we measure induction quality as α and β, given a fixed sample size, N, and two point-valued hypotheses, H₀ and H₁, yielding the effect-size difference dH₁ – dH₀ = δ.

The measure presupposes the effect-size difference dH₁ – dH₀ = δ, for otherwise we could not determine test-power (1–β).

Since induction quality pertains to the (experimental) conditions under which one collects data, the measure qualifies an empirical setting's sensitivity. Whether a setting is acceptable, or not, rests on convention, of course. RPS generally promotes α = β = 0.05, or even α = β = 0.01, as the right sensitivity (see section Frequentism Vs. Bayesianism Vs. RPS). By contrast, α = 0.05 and β = 0.20 are normal today. Since $\frac{β}{α} = \frac{0.20}{0.05} = 4$ , this makes it four times more important to discover an effect than to replicate it—an imbalance that counts toward explaining the replicability-crisis.

A decisive reason to instead equate both errors (α = β) is that this avoids a bias pro detection (α) and contra replicability (1–β). Given acceptable induction quality, a substantial discovery thus arises if the probability of data passes the critical value $\frac{(1 - β)}{α}$ . Under α = β = 0.05, for instance, we find that $\frac{(1 - β)}{α} = \frac{0.95}{0.05} = 19$ . Hence, for the H₁ to be statistically significantly more probable than the H₀, we have it that p(H₁, D) = 19 × p(H₀, D).

Thus, we evidently can fully determine induction quality prior to data-collection for hypothetical data. Therefore, the measure says nothing about the focal outcome of a hypothesis-test. As we evaluate L(H|D) in the justification context, by contrast, the same measure nevertheless quantifies the trust that actual data deserve or—as the case may be—require.

Def. falsification quality: Based on Wald's theory, we measure falsification quality as the likelihood-ratio of all hypotheses the effect-size of which exceeds either the H₀ (preliminary falsification) or δ (substantial falsification), and the point-valued H₀, i.e., L(d > 0|D)/L(d = 0|D). Our proposed falsification-threshold (1−β)/α thus depends on induction quality of data.

The falsification quality measure rests on both the H₁ and a fixed amount of actual data. It comparatively tests the point-valued H₀ against all point-alternative hypotheses that exceed d H₁–d H₀ = δ. For instance, α = β = 0.05 obviously yields the threshold 19 (or log 19 = 2.94); α = β = 0.01 yields 99 (log 99 = 4.59), etc⁶. Since it is normally unrealistic to set α = β = 0, “falsification” here demands a statistical sense, rather than one grounding in an observation a deterministic law cannot subsume. Thus, a statistical falsification is fallible rather than final.

The same holds for verification:

Def. verification quality: Again based on Wald's theory, we measure verification quality as the likelihood-ratio of a point-valued H₁ and a substantially falsified H₀. The threshold for a preliminary verification is again $\frac{(1 - β)}{α}$ (thus, too, depends on induction quality of data). As the threshold for a substantial verification, we propose the value 4.

To explain this value, RPS views a H₁-verification as preliminary if the maximum-likelihood-estimate (MLE) of data falls below the ratio of the maximum corroboration, itself determined via a normal curve's maximal ordinate, viz., 0.3989, and the ordinate at the 95%-interval centered on the maximum, viz., 0.10. As our confirmation threshold, this yields ≈4. Hence, a ratio <4 sees the theoretical parameter lie inside the 95%-interval. RPS would thus achieve a substantial verification.

Following Popper (1959), many take hypothesis-verification to be impossible in a deterministic sense. Understood probabilistically, by contrast, even a substantial verification of one point-valued hypothesis against another such hypothesis is error-prone (Zenker, 2017). The non-zero proportion of false negative decisions thus keeps us from verifying even the best-supported hypothesis absolutely. We can therefore achieve at most relative verification.

Assume we have managed to verify a parameter preliminarily. If the MLE now deviates sufficiently from that parameter's original theoretical value, then we must either modify the parameter accordingly, or may otherwise (deservedly) be admonished for ignoring experience. The MLE thus acts as a stopping-rule, signaling when we may (temporarily) accept a theoretical parameter as substantially verified.

The six RPS steps thus obtain a parameter we can trust to the extent that we accept the error probabilities. Unless strong reasons motivate doubt that our data are faithful, indeed, the certainty we invest into this parameter ought to mirror (1–β), i.e., the replication-probability of data closely matching a true hypothesis (Miller and Ulrich, 2016; Erdfelder and Ulrich, 2018).

Before sufficient amounts of probably replicable data arise in praxis, however, we must normally integrate various studies that each fail the above thresholds. RPS's way of integration is to add the log-likelihood-ratios of two point-hypotheses, each of which is “loaded” with the same prior probability, p(H₁) = p(H₀) = 0.50. Also known as log-likelihood-addition, RPS thus aggregates data of insufficient induction quality by relying on the well-known equation:

\frac{L (H_{1} | D)}{L (H_{0} | D)} = \frac{p (H_{1}) p (D, H 1)}{p (H_{0}) p (D, H_{0})}

We proceed to simulate select values from the full parameter-range of possible RPS-results. These values are diverse enough to extrapolate to implicit values safely. The subsequent sections offer a discussion and then compare RPS to alternative methodologies.

Simulations

Overview

Using R-software, we simulate data for hypothetical treatment- and control-groups, calculate the group-means, and then compare these means with a t-test. While varying both induction quality of data and the effect-size, we simulate the resulting error rates. Since the simulated error-proportions of a t-test approximate the error-probability of data, this determines the parameter-range over which empirical results (such as those that RPS's six steps obtain) are stable, and hence trustworthy.

In particular, we estimate:

(i) the necessary sample size, N_MIN, in order to register, under (1–β), the effect-size δ as a statistically significant deviation from random⁷;

(ii) the p-value, as the most commonly used indicator in NHST;

(iii) the likelihood that the empirical effect-size d_(emp) exceeds the postulated effect-size δ, i.e., L(d > δ|D), as a measure of substantial falsification;

(iv) the likelihood of the H₀, i.e., L(δ = 0|D), as a measure of type I and type II errors;

(v) the likelihood of the H₁, i.e., the true effect-size L(δ|D), as a measure of preliminary verification;

(vi) the maximum-likelihood-estimate of data, MLE(x), when compared to the likelihood of the H₁, as a measure of substantial verification.

We conduct five simulations. Simulations 1 and 2 estimate the probability of true positive and false negative results as a function of the effect-size and test-power. Our significance level is set to α = 0.05, respectively to α = 0.01. Simulation 3 estimates the probability of false positive results. The remaining two simulations address engaging with data in post-hoc fashion. Simulation 4 evaluates shedding 10% of data that least support the focal hypothesis. To address research groups' individual inability to collect the large samples that RPS demands, Simulation 5 mimics collaborative research by adding the log-likelihood-ratios of underpowered studies.

Simulation 1

Purpose

Simulation 1 manipulates the test-power and the true effect-size to estimate the false negative error-rates (respectively the true positive rate) throughout RPS's six steps.

Method

We manipulate 16 datasets that each contain 100 samples of identical size and variance. We represent a sample by the mean of a normally distributed variable in two independent groups (treatment and control), summarized with the test-statistic t. Between these 16 datasets, we vary the effect-size δ = [0.01, 0.2, 0.5, 0.8], and thus vary the difference between the group-means. We also vary test-power (1–β) = [0.4, 0.5, 0.8, 0.95], and thus let induction quality range from “very poor,” i.e., (1–β) = 0.4, to “medium,” i.e., (1–β) = 0.95. Under α = 0.05 (one-sided), we estimate N_MIN to meet the respective test-power (Simulation 2 tightens the significance level to α = 0.01).

Results and Discussion

For both the experimental and the control group, Table 1 lists N_MIN to register the effect-size δ as a statistically significant deviation from random (substantial discovery). Generally, given constant test-power (1–β), the smaller (respectively larger) δ is, the larger (smaller) is N_MIN. This shows how N_MIN depends on β.

TABLE 1

Table 1. The estimated minimum sample size for a two sample t-test as a function of test-power (1–β) and effect size δ, given α = 0.05.

For the sample sizes in Table 1, moreover, Table 2 states the proportion of p-values that fall below α = 0.05, given a test-power value. This estimates the probability of a substantial discovery. As the standard deviation of the p-value here indicates, we retain a large variance across samples especially for data of low induction quality.

TABLE 2

Table 2. The proportion P of substantial discoveries, indicated by p-values below the significance level α = 0.05, as a function of the effect-size δ and test-power (1–β).

TABLE 3

Table 3. The proportion of substantial falsifications and preliminary verifications, as indicated by the respective likelihood ratio (LR) meeting or exceeding the threshold $L R \geq \frac{(1 - β)}{α}$ .

As with Table 1, Table 2 shows that the larger the test-power value is, the larger is the proportion of substantial discoveries, ceteris paribus. We obtain a similar result when estimating the probability of a substantial falsification or a preliminary verification, as per the likelihood-ratios $\frac{L (d > 0 | D)}{L (d = 0 | D)}$ and $\frac{L (d = δ | D)}{L (d = 0 | D)}$ meeting the threshold $\frac{(1 - β)}{α}$ .

In case of a preliminary verification, however, we obtain a larger proportion of false negative results than in case of a substantial falsification. For in verification we narrowly test a point-valued H₀ against a point-valued H₁. Whereas in falsification we test a point-valued H₀ against an interval H₁. Therefore, the verification criterion is “less forgiving” than the falsification criterion.

Using bar plots to illustrate the distribution of likelihood-ratios (LRs) for a preliminary verification, Figure 2 shows that LRs often fall below the threshold $\frac{(1 - β)}{α}$ . However, if data are only of medium induction quality (α = β = 0.05), we find a large proportion of LRs > 3. We should therefore not immediately reject the H₁, if $\frac{(1 - β)}{α}$ < LR > 3, because LR > 3 indicates some evidence for H₁. Instead, we should supply additional data before evaluating the LR. If we increase the sample by 50% of its original size, N/2, for instance, but the LR still falls below the threshold, then we may add yet another N/2, and only then sum the log-LRs. If this too fails to yield a preliminary H₁-verification (or a H₀-verification), then we may still use this empirical result as a parameter-estimate which future studies might test.

FIGURE 2

Figure 2. Illustration of true positives. Bar plots indicate the frequencies of likelihood ratios ( $\frac{L (d > δ | D)}{L (d = 0 | D)}$ set in light gray, and $\frac{L (d = δ | D)}{L (d = 0 | D)}$ in dark gray) that, respectively, fall above the criterion $\frac{(1 - β)}{α}$ (two leftmost bars), between this criterion and three (two middle bars), and below three (two rightmost bars), as a function of induction quality of data, provided the H₁ is true, under α = 0.05 [itself defined via d and (1–β), the latter here abbreviated as “pow”].

An important caveat is that the likelihood-ratio measures the distance between data and hypothesis only indirectly. Even though the likelihood steadily increases as the mean of data approaches the effect-size that the H₁ postulates, we cannot infer this distance from the LR alone, but must study the distribution itself. For otherwise, even if $L R \geq \frac{(1 - β)}{α}$ , we would risk verifying the H₁ although the observed mean of data does not originate with the H₁-distribution, but with a distinct distribution featuring a different mean.

Moving beyond RPS-step 5, we can only address this caveat adequately by constraining the data-points that substantially verify the H₁ to those lying in an acceptable area of variance around the H₁. Table 4 reports the proportion of preliminarily H₁-verifying samples that now fail the criterion for a substantial H₁-verification, and thus amount to additional false negatives. We can reduce these errors by increasing the sample size, which generally reduces the error-probabilities.

TABLE 4

Table 4. The proportion of preliminary verifications as per $L R \geq \frac{(1 - β)}{α}$ , given the empirical effect-size d lies outside the interval comprising 95% of expected values placed around the H₁, where $\frac{L (d | D)}{L (d = δ | D)} > \frac{p d f (P_{50} | d)}{p d f (P_{95} | d)} > 4$ .

To account for the decrease in β after constraining the sample size in PRS-step 5, of course, the value of the threshold $\frac{(1 - β)}{α}$ now is higher, too. Hence, meeting it becomes more demanding. RPS-step 6 nevertheless increases our certainty that the data-mean originates with the hypothesized H₁-distribution, and so increases our certainty in the theoretical parameter.

Table 5 states the proportion of datasets that successfully complete RPS's six steps, i.e., preliminary and substantial discovery (steps 1, 2) as well as preliminary and substantial falsification and verification (steps 3–6). For data of low to medium induction quality, we retain a rather large proportion of false negatives.

TABLE 5

Table 5. The proportion of substantial verifications, after substantial discoveries and subsequent preliminary verifications were obtained, given the H₀ had been substantially falsified.