Edited by: Nikolaus Kriegeskorte, Medical Research Council Cognition and Brain Sciences Unit, UK
Reviewed by: Nikolaus Kriegeskorte, Medical Research Council Cognition and Brain Sciences Unit, UK; Alexander Walther, Medical Research Council Cognition and Brain Sciences Unit, UK
*Correspondence: Joshua K. Hartshorne, Department of Psychology, Harvard University, 33 Kirkland Street, Cambridge, MA 02138, USA. e-mail:
This is an open-access article distributed under the terms of the
Recent reports have suggested that many published results are unreliable. To increase the reliability and accuracy of published papers, multiple changes have been proposed, such as changes in statistical methods. We support such reforms. However, we believe that the incentive structure of scientific publishing must change for such reforms to be successful. Under the current system, the quality of individual scientists is judged on the basis of their number of publications and citations, with journals similarly judged via numbers of citations. Neither of these measures takes into account the replicability of the published findings, as false or controversial results are often particularly widely cited. We propose tracking replications as a means of post-publication evaluation, both to help researchers identify reliable findings and to incentivize the publication of reliable results. Tracking replications requires a database linking published studies that replicate one another. As any such database is limited by the number of replication attempts published, we propose establishing an open-access journal dedicated to publishing replication attempts. Data quality of both the database and the affiliated journal would be ensured through a combination of crowd-sourcing and peer review. As reports in the database are aggregated, ultimately it will be possible to calculate replicability scores, which may be used alongside citation counts to evaluate the quality of work published in individual journals. In this paper, we lay out a detailed description of how this system could be implemented, including mechanisms for compiling the information, ensuring data quality, and incentivizing the research community to participate.
The current system of conducting, reviewing, and publishing scientific findings – while enormously successful – is by no means perfect. Peer review, the primary vetting procedure for publication, is often slow, contentious, and uneven (Mahoney,
There are many considerations that go into determining research quality, but perhaps the most fundamental is replicability. Recently, numerous reports have suggested that many published results across a range of scientific disciplines do not replicate (Ioannidis et al.,
In the present paper, we first discuss evidence that the rate of replicability of published studies is low, including novel data from a survey of researchers in psychology and related fields. We propose that this low replicability stems from the current incentive structure, in which replicability is not systematically considered in measuring paper, researcher, and journal quality. As a result, the current incentive structure rewards the publication of non-replicable findings, complicating the adoption of needed reforms. Thus, we outline a proposal for tracking replications as a form of post-publication evaluation, and using these evaluations to calculate a metric of replicability. In doing so, we aim not only to enable researchers to easily find and identify reliable results, but also to improve the incentive structure of the current system of scientific publishing, leading to widespread improvements in scientific practice and increased replicability of published work.
Many aspects of current accepted practice in psychology, neuroscience, and other fields necessarily decrease replicability. Some of the most common issues include a lack of documentation of null findings; a tendency to conduct low-powered studies; failure to account for multiple comparisons; data-peeking (with continuation of data collection contingent on current significance level); and a publication bias in favor of surprising (“newsworthy”) results.
Null results are less likely to be published than statistically significant findings. This has been extensively documented in the medical literature (Dickersin et al.,
Preferential publication of significant effects necessarily biases the record. Consider cases in which multiple labs all test the same question, or in which the same lab repeatedly tests the same question while iteratively refining the method. By chance alone, some of the experiments will result in publishable statistically significant effects; the likelihood that a finding may be spurious is masked by the fact that the null results are not published.
The significance-bias also leads to the overestimation of real effects. Measurement is probabilistic: the measured effect size in a given experiment is a function of the true effect size plus some random error. In some experiments, the measured effect will be larger than the true effect, and in some it will be smaller. Suppose the statistical power of the experiment is 0.8 (a particularly high level of power for studies in psychology; see below). This means that the effect will be statistically significant only if it is in the top 80% of its sampling distribution. Twenty percent of the time, when the effect is – by chance – relatively small, the results will be non-significant. Thus, given that an effect was significant, the measured effect size is probably larger than the actual effect size, and subsequent measurements will find smaller effects due to the familiar phenomenon of regression to the mean. The lower the statistical power, the more the effect size will be inflated.
A number of findings suggest that the statistical power in psychology and neuroscience experiments is typically low. According to multiple meta-analyses, the statistical power of a typical psychology or neuroscience study to detect a medium-sized effect (defined variously as
All else being equal, low statistical power would increase the proportion of significant results that are spurious. For instance, suppose researchers are investigating a hypothesis that is equally likely to be true or false (the prior likelihood of the null hypothesis is 50%), using methods with statistical power = 0.8. In this case, 6% of significant results will be false positives (True positives: 0.5 × 0.8 = 0.4; False positives: 0.5 × 0.05 = 0.025; Ratio: 0.025/0.425 = 0.059). If Power = 0.2, this increases to 20%. If the prior likelihood of the null hypothesis is 90% (i.e., if an effect would be surprising, or when data-mining), the false positive rate will be 69% (for additional discussion, see Yarkoni,
If one tests for 10 different possible effects in each experiment, the chance of finding at least one significant at the
Many researchers compile and analyze data prior to testing a full complement of subjects. There is nothing wrong with this, so long as the decision to stop data collection is made independent of the results of these preliminary analyses, or so long as the final result is then replicated with the same number of subjects. Unfortunately, the temptation to stop running participants once significance is reached – or to run additional participants if it has not been reached – is difficult to resist. This data-peeking and contingent stopping has the potential to significantly increase the false positive rate (Feller,
Researchers are more likely to submit – and editors more likely to accept – “newsworthy” or surprising results. Spurious results are likely to be surprising, and thus are likely to be over-represented in published reports. Consistent with this claim, there is some evidence that highly cited papers are less likely to replicate (Ioannidis,
Several studies have found low rates of replicability across multiple scientific fields. Ioannidis (
Likewise, several studies have found that initial reports of effect size are often exaggerated. This has been noted in medicine (Ioannidis et al.,
Less is known about replication rates in psychology and neuroscience. In a series of five meta-analyses of fMRI studies, Wager and colleagues estimated that between 10 and 40% of activation peaks are false positives (Wager et al.,
In order to add to our knowledge of replicability rates in psychology and related disciplines, we surveyed 49 researchers in these disciplines, who reported a total of 257 attempted replications of published studies (for details, see
As reviewed above, a number of factors promote low replicability rates across a range of fields. These problems are reasonably well known, and in many cases solutions have been proposed, such as use of different statistical methods and self-replication prior to publication. However, in spite of these solutions, evidence suggests that replicability remains low and thus that the proposed solutions have not been widely adopted. Why would this be the case? We propose that the incentive structure of the current system diminishes the ability and tendency of researchers to adopt these solutions. Namely, current methods of judging paper, researcher, and journal quality fail to take replicability into account, and in effect incentivize publishing spurious results.
There are three primary
Firstly, eliminating false positives means publishing fewer papers, since null results are difficult to publish. Second, ensuring that effect sizes are not inflated means reporting results with smaller effect sizes, which may be seen as less interesting or less believable. Third, as discussed above, spurious results are more likely to be surprising and newsworthy. Thus, eliminating spurious results disproportionately eliminates publications that would be widely cited and published in top journals.
These drawbacks are compounded by the fact that many of the improved practices that ensure replicability take time and resources. Learning to use new statistical methods often requires substantial effort. Increasing an experiment’s statistical power may require testing more participants. Eliminating stopping of data collection contingent on significance level (data-peeking) also means erring on the side of testing more participants. Perhaps the best insurance against false positives is pre-publication replication by the authors. All these strategies take time.
In addition, there is relatively little cost associated with publishing unreliable results, as failures to replicate are rarely published and not systematically tracked. As a result, knowledge of the replicability of results mainly travels via word-of-mouth, through specific personal interactions at conferences and meetings. There are obvious concerns about the reliability of such a system, and there is little evidence that this system is particularly effective. We are aware of several cases in which a researcher invested months or years into unsuccessfully following up on a well-publicized effect from a neighboring subfield, only to later be told that it is “well-known” that the effect does not replicate.
Moreover, even when a failure-to-replicate is published, the results often go unnoticed. For example, a meta-analysis by Maraganore et al. (
It follows that researchers who take additional steps to ensure the quality of their data will ultimately spend more time and resources on each publication and, all else equal, will end up with fewer, less-often-cited papers in lower-quality journals. In the same way, journals that adopt more stringent publication standards may drive away submissions, particularly of the surprising, newsworthy findings that are likely to be widely cited. Certainly, the vast majority of researchers and editors are internally motivated to publish real, reliable results. However, we also cannot continue practicing science without jobs, grants, and tenure. This situation sets up a classic Tragedy of the Commons (Hardin,
Individuals can solve the Tragedy of the Commons by adopting common rules or changing incentive structures. To give a recent example, Jaeger (
In a similar way, collective action is needed to solve the problem of low replicability: Because the incentive structure of the current system penalizes any member of the community who is an early adopter of reforms, an organized community change is needed. Instead of maintaining a system in which individual incentives (publish as often as possible) run counter to the goals of the group (maintain the integrity of the scientific literature), we can change the incentives by placing value on replicability directly. To do this, we propose tracking the replicability of published studies, and evaluating the quality of work post-publication partly on this basis. By tracking replicability, we hope to provide concrete incentives for improvements in research practice, thus allowing the widespread adoption of these improved practices.
Below, we lay out a proposal for how replications might be tracked via an online open-access system tentatively named
In a system such as Google Scholar, each paper’s reference is presented alongside the number of times that paper has been cited, and each paper is linked to a list of the papers citing that target paper. Replication Tracker would function in a similar manner, except that it would be additionally indexed by specialized citations that link papers based on one attempting to replicate the other. Thus, each paper’s reference would appear alongside not only a citation count, but an attempted replication count and information about the paper’s replicability.
Replication Tracker’s attempted replication citations are termed
For replications to be tracked, they must be reported. As discussed above, many replication attempts remain unpublished. Thus, Replication Tracker would be paired with an online, open-access journal devoted to publishing Brief Reports of replication attempts. After a streamlined peer review process, these Brief Reports would be published and connected to the papers they replicate via RepLinks in the Replication Tracker.
This system will ultimately form a rich dataset, consisting of RepLinks between attempted replications and the original findings. Each RepLink’s ratings would indicate the type and strength of evidence of the findings. These ratings would be aggregated, and used to compute statistics on replicability. For instance, the system could summarize the data for each paper in terms of a
RepLinks must, minimally, link a replication attempt with its target paper, note whether the finding was replication or non-replication, and note the strength of evidence for this finding.
There are many factors that enter into these decisions. For instance, a particular attempted replication may have investigated all of the findings in the target paper, or may have only attempted to replicate some subset. The findings may be more similar or less similar as well: All effects may have successfully replicated, or none; or some findings may have replicated while others did not. In addition, whether a replication serves as strong evidence of the replicability or non-replicability of the original finding depends on the extent of similarity of the methods used, and whether the attempt had high or low statistical power.
We propose capturing these issues in two ratings. The first rating, termed the
The second rating would be a
The ratings described above involve a number of difficult determinations. Given that no two studies can have exactly identical methods, how similar is similar enough? How does one determine whether a study has sufficient statistical power, given that the effect’s size is itself under investigation?
To make these determinations, we turn to those individuals most qualified to make them: researchers in the field. Crowd-sourcing has proven a highly effective mechanism of making empirical determinations in a variety of domains (Giles,
The system also utilizes multiple moderators. These moderators would take joint responsibility for tending the RepLinks and Brief Reports (see below) on papers in their subfields. Moderators would be scientists, and could be invited (e.g., by the founding members), although anyone with publications in the field could apply to be a moderator.
In submitting and rating RepLinks, researchers may disagree with one another as to the correct Type of Finding or Strength of Evidence ratings for a given RepLink, or may disagree as to whether two papers are sufficiently similar as to qualify as a replication attempt. Users who agree with an existing rating may easily second it with a thumbs-up, while users who disagree with the existing ratings may submit their own additional ratings. Users who believe that the papers in question do not qualify as replications may flag the RepLink as irrelevant (RepLinks that have been flagged a sufficient number of times would no longer be used to calculate Replicability Scores, though these suppressed RepLinks would be visible under certain search options). These ratings would be combined together using crowd-sourcing techniques to determine the aggregate Type of Finding and Strength of Evidence scores for a given RepLink (see below).
Data must be aggregated by this system at multiple levels. First, multiple ratings for a given RepLink must be combined into aggregate Type of Finding and Strength of Evidence ratings for that RepLink. Second, where a single target paper has been the subject of multiple replication attempts, the different RepLinks must be aggregated into a single Replicability Score and Strength Score for that target paper. In the same way, scores may be combined across multiple papers to determine aggregate replicability across a literature, an individual researcher’s publications, or a journal.
Aggregates need not be mere averages. How to best aggregate ratings across multiple raters is an active area of research in machine learning (Albert and Dodd,
In addition, ratings from certain users would be weighted more heavily than others, as is done in many rating aggregation algorithms (e.g., Snow et al.,
Only strict replications, not convergent data from different methods, will be tracked in the proposed system. This may seem counter-intuitive, since tracking converging results is crucial for determining which theories are most predictive. However, the goal of the proposed system is not to directly evaluate which
Registering for the system and submitting RepLinks would not require authenticating one’s identity. However, authors of papers could choose to have their identities authenticated in order to have comments on their own papers be marked as author commentaries (many RepLinks will almost certainly be submitted by authors, as they are most invested in the issues involved in replication of their own studies).
Identity authentication could be accomplished in multiple ways. For instance, a moderator could use the departmental website to verify the author’s email address and send a unique link to that email address. Clicking on that link would enable the user to set up an authenticated account under the users’ own name. Moderator’s identities could be authenticated in a similar manner.
Although any user can contribute to Replication Tracker, moderators play several additional key roles. First, they evaluate submitted Brief Reports, and submit the initial RepLinks for any accepted Brief Report. Similarly, when new RepLinks are submitted, moderators are notified and can flag irrelevant RepLinks or submit their own ratings. Thus, it is important that (a) there are enough moderators, and (b) the moderators are sufficiently qualified. In the case of moderator error, the Replication Tracker contains numerous ways by which other moderators and users can override the erroneous submission (submitting additional RepLink scores; flagging the erroneous RepLink, etc.). In order to recruit a sufficient number of moderators, we suggest allowing existing moderators to invite additional moderators as well as allowing researchers to apply to be moderators. Moderators could be selected based on objective considerations (number of publications, years of service, etc.), subjective considerations (by a vote of existing moderators), or both.
The Replication Tracker system is also ideally suited to tracking retractions. Retractions may be submitted by users as a specially marked type of RepLink, which would require moderator approval before posting. Retracted studies would appear with the tag
The efficacy of Replication Tracker is limited by the number of published replication attempts. As discussed above, both successful replications and null results are difficult to publish, and often remain undocumented. Thus, we propose launching an open-access journal that publishes all and any replication attempts of suitable quality.
Unlike full papers elsewhere, these
Review of Brief Reports would be handled by moderators. When a Brief Report is submitted, all moderators of that subfield would be automatically emailed with a request to review the proposed post. The review could then be “claimed” by any moderator. If no one claims the post for review within a week, the system would then automatically choose one of the relevant moderators, and ask if they would accept the request to review; if they decline, further requests would be made until someone agreed to review. Authors would not be able to be the sole moderator/reviewer for replications of their own work. As in the PLoS model, the moderator could evaluate the
The presumption of the review process would be acceptance. Brief Reports would be returned for revision when appropriate, as in the case of using inappropriate statistical tests; but would only be rejected if the paper does not actually qualify as a replication attempt (based on the criteria discussed above). In the latter case, authors of Brief Reports could appeal the decision, which would then be reviewed by two other moderators. On acceptance, the Brief Report would be published online in static form with a DOI, much like any other publication, and thus be part of the citable, peer reviewed record. The appropriate RepLinks would be likewise added to Replication Tracker. As with any RepLink, these could be suppressed if flagged as irrelevant a sufficient number of times (see above). Thus, while publication in Brief Reports is permanent (barring retractions), incorporation into Replication Tracker is always potentially in flux – as is appropriate for a post-review evaluation process.
As in any literature database, users would begin by using a search function (either simple or advanced) to locate a paper of interest (Figure
The user would then click on a reference from the list to bring up more detailed information about that target paper (Figure
Information about each RepLink could be expanded, to show each individual rating along with that users’ associated comments, if any (Figure
The Replication Tracker would serve several functions. First, it would enable a new way of navigating the literature. Second, we believe it would motivate researchers to conduct and report attempted replications, helping correct biases in the literature such as the file-drawer problem. Third, it will vastly improve access to and communication regarding replication attempts. Perhaps most importantly, it would help incentivize and reward costly efforts to ensure replicability pre-publication, helping to mitigate a Tragedy of the Commons in scientific publishing.
However, in addition to these potential benefits, tracking, and publishing replication attempts raises non-trivial issues, and has the potential for unintended consequences. We consider several such concerns below and discuss how these concerns may be addressed or allayed.
The usefulness of the database for tracking replicability will be a function of the amount of replication information added to it in the form of RepLinks, metadata information, and Brief Reports. This will require considerable participation by a broad swath of the research community. Because researchers are more likely to contribute to a system that they already find useful, an important determiner of success will be the ability to achieve a critical mass of such information. We have considered several ways of increasing the likelihood that the system quickly reaches critical mass.
First, there should be a considerable number of founding members, so that a wide range of researchers are engaged in the project prior to launch. This will not only help with division of labor, but will also help clarify the many design decisions that go into creating the details of the system. The more diverse the founding group is, the more likely the final system will be acceptable to researchers in multiple fields and disciplines. This paper serves as a first step in starting the needed dialog.
Second, we suggest concentrating on first reaching critical mass for a few select subfields of psychology and neuroscience, instead of simultaneously attempting to obtain critical mass in all fields of science at once. In order to reach critical mass within the first few subfields, we suggest that prior to the public launch of Replication Tracker, founding members conduct targeted replicability reviews of specific literatures within those subfields, writing RepLinks and soliciting Brief Reports during the process. These data would be used to write review papers, which would be published in traditional journals. These review papers would be useful publications in and of themselves and would help demonstrate the empirical value of tracking replications. This would help recruit additional founders, moderators and funding – all while major components are added to the database. Only once enough coverage of the literatures within those subfields has been achieved would Replication Tracker be publically launched.
In addition to tracking published replications, the proposed system attempts to ameliorate the file-drawer problem by allowing researchers to submit Brief Reports of attempted replications. Several previous attempts have been made to publish null results and replication attempts (e.g., Journal of Articles in Support of the Null Hypothesis; Journal of Negative Results in Biomedicine) often with low rates of participation (JASNH has published 32 papers since its launch in 2002). Nonetheless, we believe several aspects of our system would motivate increased participation. Firstly, the format of Brief Reports significantly decreases the time commitment of preparation, as the Reports consist of the method and results section only. Second, these Brief Reports will not only be citable, but will also be highly findable, as they will be RepLinked to the relevant published papers. Thus we expect these Reports to have some value, perhaps equivalent to a conference paper or poster. We believe that the combination of lesser time investment and increased value will lead to increased rates of submission.
Because each paper may include multiple findings that differ in replicability, there is a good argument to be made that what should be tracked is the replicability of a given result. We propose tracking the replicability of papers instead, for several reasons.
The first reason is one of feasibility. We believe that tracking each finding separately would be infeasible, as what counts as an individual finding may be subjective, and the vast number of units of analysis even within a single paper becomes prohibitive. An intermediate level would be to track individual experiments. However, publication formats do not always include separate headings for each individual experiment (e.g.,
Secondly, even organizing the system at the level of experiment will not allow an aggregated replicability score to capture every nuance of the scientific literature. It will always be necessary for the reader to examine written information for more detail, including the full text of the RepLinked papers. For these detail-oriented readers, the proposed system provides a novel way to navigate through published work (by following RepLinks to find and read papers with attempted replications) and an efficient way to view comments on each of these papers (Figure
The rate of published replications appears to be low: For instance, over a 20-year period, only 5.3% of 701 publications in nine management journals included attempts to replicate previous findings (Hubbard et al.,
We do not believe these issues undermine the utility of Replication Tracker for several reasons. First, the findings which are of broadest interest to the community are likely the very same findings for which the most replications are attempted. Thus, while many low-impact papers may lack replication data, the system will be most useful for the papers where it is most needed. Secondly, even low numbers of replications are often sufficient: because spurious results are unlikely to replicate, even only a handful of successful replications significantly increases the likelihood that a given finding is real (Moonesinghe et al.,
Commenters on the present paper have suggested that since new fields may still be designing the details of their methods, and may be less sure of what aspects of the method are necessary to correctly measure the effects under investigation, their initial results may appear less replicable. In this case, using replicability scores as a measure of paper, researcher, and journal quality – one of our explicit aims – could potentially stifle new fields of enquiry.
This is an important concern if true. We do not know of any systematic empirical data that would adjudicate the issue. However, we suspect that other factors may systematically increase replicability in new lines of inquiry. For example, young fields may focus on larger effects, with established fields focusing on increasingly subtle effects over time (cf Taubes and Mann,
We additionally note that it is not our intention that replicability become the sole criteria by which research quality is measured, nor do we think that is likely to happen. New fields are likely to generate excitement and citations, which will produce their own momentum. The goal is that replicability rates be considered in addition.
Commenters on the present paper have also suggested several ways in which Replication Tracker might underestimate replicability. Underestimating the replicability of a field could undermine both scientists’ and the public’s confidence in the field, leading to decreased interest and funding.
Researchers may be more motivated to submit non-replications to the system as Brief Reports, while successful replications would languish in file-drawers. We suspect that this problem would disappear as the system gains popularity: Researchers typically attempt replications of effects that are crucial to their own line of work and will find it useful to report those replications in order to have their own work embedded in a well-supported framework. Moreover, many replication attempts are conducted by the authors of the original study, who will be intrinsically motivated to report successful replications in support of their own work. Nonetheless, this is an issue that should be evaluated and monitored as Replication Tracker is introduced, so that adjustments can be made as necessary.
Another concern is that if on average the researchers that tend to conduct large numbers of strict replications are less skilled than the original researchers, this could lead to non-replications due to unknown errors. If this is the case, this issue could be compensated for in two ways. First, as Replication Tracker and Brief Reports raise the profile of replication, more skilled researchers may begin to conduct and report more replications. Second, as discussed above, there are numerous machine learning techniques to identify the most reliable sources of information. These techniques could be applied to mitigate this issue, by discounting replication data from users that have not been reliable sources of information in the past.
Since the statistical power to detect an effect is never 1.0, even true effects sometimes do not replicate. High-profile papers in particular will be much more likely to be subject to replication attempts; since some replications even of real effects will fail, high-profile papers may be unfairly denigrated. This issue is compounded if typical statistical power in that literature is low, making replication improbable.
These issues can be dealt with directly in Replication Tracker, by appropriately weighing this probabilistic information. Recall that Replication Tracker provides both a Replicability Score, indicating whether existing evidence suggests that the target paper replicates, as well as a Strength of Evidence Score. A single non-replication – particularly one with only mid-sized power – is not strong evidence for non-replicability, and this should be reflected in the Strength Score. Replication attempts with low-power should not be RepLinked at all. If 8 of 10 replication attempts succeed – consistent with statistical power of 0.8 – that should be counted as strong evidence of replicability.
Finally, we must consider whether the changes people will make to their work will actually lead to an increased d′ (ability to detect true effects) or whether these changes will simply result in a tradeoff: researchers may eliminate some false positives (Type I error) only at the expense of increasing the false negative rate (Type II error). It is an open question whether fields like psychology and neuroscience are currently at an optimal balance between Type I and Type II error, and Replication Tracker would help provide data to adjudicate this issue. Moreover, some of the potential reforms would almost certainly increase d′, like conducting studies with greater statistical power.
Replicability is a crucial measure of research quality; however, certain types of errors cannot be detected in by such a system. For instance, data may be misinterpreted, or a flawed method of analysis may be repeatedly used. Thus, while tracking replicability is an important component of post-publication assessment, it is not the only one needed. We have suggested presenting replicability metrics side-by-side with citation counts (Figure
While it is tempting to try to build a single system to track multiple aspects of research quality, we believe that constructing such a system will be extremely difficult, as different data structures are required to track each aspect of research quality. The Replication Tracker system, as currently envisioned, is optimized for tracking replications: The basic data structure is the RepLink, a connection between a published paper and a replication attempt of its findings. In contrast, to determine the truth value of a particular idea or theory, papers should be rated on how well the results justify the conclusions and linked to one another on the basis of theoretical similarity, not just strict methodological similarity. As such, we think that such information is likely best tracked by an independent system, which can be optimized accordingly. Ultimately, results from these multiple systems may then be aggregated and presented together on a single webpage for ease of navigation.
In conclusion, we propose tracking replication attempts as a key method of identifying high-quality research post-publication. We argue that tracking and incentivizing replicability directly would allow researchers to escape the current Tragedy of the Commons in scientific publishing, thus helping to speed the adoption of reforms. In addition, by tracking replicability, we will be able to determine whether any adopted reforms have successfully increased replicability.
No measure of research quality can be perfect; instead, we aim to create a measure that is robust enough to be useful. Citation counts have proven very useful in spite of the metrics’ many flaws as measures of a paper’s quality (for instance, papers which are widely criticized in subsequent literature will be highly cited). We do not propose replacing citation counts with replicability measures, but rather augmenting the one with the other. Tracking replicability and tracking citations have complementary strengths and weaknesses: Influential results may not be replicable. Replicable results may not be influential. Other post-publication evaluations, such as those described within other papers in this Special Topic, could be presented alongside these quantitative metrics. Assembling replicability data alongside other metrics in an open-access Web system should allow users to identify results that are both influential and replicable, thus more accurately identifying high-quality empirical work.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We contacted 100 colleagues directly as part of an anonymous Web-based survey. Colleagues of the authors from different institutions were invited to participate, as well as the entire faculty of one research university and one liberal arts college. Forty-nine individuals completed the survey: 26 faculty members, 9 post-docs, and 14 graduate students. Thirty-eight of these participants worked at national research universities. Respondents represented a wide range of sub-disciplines: clinical psychology (2), cognitive psychology (11), cognitive neuroscience (5), developmental psychology (10), social psychology (6), school psychology (2), and various inter-subdisciplinary areas.
The survey was presented using Google Forms. Participants filled out the survey at their leisure during a single session. The full text of the survey, along with summaries of the results, is included below. All research was approved by the Harvard University Committee on the Use of Human Subjects, and informed consent was obtained.
(11 cognitive psychology, 10 developmental psychology, 6 social psychology, 5 cognitive neuroscience, 2 school psychology, 2 clinical psychology, 13 multiple/other).
Total: 257; Mean: 6; Median: 2; SD: 11
(3 excluded: “NA,” “too many to count,” “50+”)
Excluding those excluded in (1):
Total: 127; Mean: 4; Median: 1; SD: 7
Excluding those excluded in (1):
Total: 77; Mean: 2; Median: 1; SD: 5
Excluding those excluded in (1):
Total: 79; Mean: 2; Median: 1; SD: 4
[comments]
Total: 48; Mean: 1; Median: 0; SD: 3
[3 excluded: “a few,” “countless,” (lengthy discussion)]
Excluding those excluded in (1):
Total: 38; Mean: 2; Median: 0.5; SD = 4
[comments]
[comments]
Total: 1312 (one participant reported “1000”); Mean: 31; Median: 3.5; SD: 154
(6 excluded: “many,” “ton,” “countless,” “30–50%?” 2 unreadable/corrupted responses)
Excluding those excluded in (1):
Total: 656 (one participant reported “500”); Mean: 17; Median: 2; SD: 81
[comments]
The first author was supported through the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program. Many thanks to Tim O’Donnell, Manizeh Khan, Tim Brady, Roman Feiman, Jesse Snedeker, and Susan Carey for discussion and feedback.