Edited by: Jessica A. Turner, Mind Research Network, USA
Reviewed by: Angela R. Laird, Research Imaging Institute, The University of Texas Health Science Center, USA; Jason Steffener, Columbia University, USA; David N. Kennedy, University of Massachusetts Medical School, USA
*Correspondence: Michael P. Milham, Center for the Developing Brain, Child Mind Institute, 445 Park Avenue, New York, NY 10022, USA. e-mail:
This article was submitted to Frontiers in Brain Imaging Methods, a specialty of Front. Neurosci.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.
The neuroimaging community has been increasingly called up to openly share data. Although data sharing has been a cornerstone of large-scale data consortia, the incentive for the individual researcher remains unclear. Other fields have benefited from embracing a data publication form – the data paper – that allows researchers to publish their datasets as a citable scientific publication. Such publishing mechanisms both give credit that is recognizable within the scientific ecosystem, and also ensure the quality of the published data and metadata through the peer review process. We discuss the specific challenges of adapting data papers to the needs of the neuroimaging community, and we propose guidelines for the structure as well as review process.
Recent years have witnessed renewed calls for data sharing
Given the continued challenges of “publish or perish,” tenure, and funding in a worsening economic and funding environment, the cost-benefit ratio has become central to any discussion of sharing (Fienberg et al.,
Numerous models have attempted to incentivize participation in data sharing, with varying success (NIH,
Central to this discussion is whether the data generator receives credit through authorship in publications by data users. In this regard, the field has noticed a clear divide, with consortia such as ADNI requiring explicit co-authorship on any manuscript generated with their data (ADNI,
In the search for solutions, some imaging researchers are calling for consideration of “data papers” as a means for increasing the potential benefits of data sharing for the individual investigator (Breeze et al.,
Despite the fact that this form of publishing data has been common in other fields (Ecological Archives: Data Papers, Supplements, and Digital Appendices for ESA Journals
In sum, while the concept of data papers is of clear intrigue to the imaging community, the infrastructure for supporting their publication and dissemination is lacking. We feel this need from the experiential vantage of both data generators and users, and from having played key roles in the FCP/INDI consortia. In bringing these perspectives to bear on the proposed aims of data papers, the goals of the present work are twofold: First, we explore the mechanics of data papers and provide minimum standards with the hope of making the data paper concept more concrete and tangible for the community. Second, we review open questions and potential pitfalls for data papers, and offer solutions toward making data sharing appropriately rewarding at all levels of participation.
Comprehensive and detailed specification of data samples is a prerequisite for the utility of data papers to be realized. Data papers should provide the data user with the information required to understand the data set at a level of detail known to the data generators. Without such requirements for thoroughness in description of the data generation process, we run the risk of misunderstood, and thus misused data sets (Gardner et al.,
Study Overview
Explicit goals for creation of data sample
Guiding principles in study design
Participants
Sample size
Recruitment strategy
Inclusion and exclusion criteria
Sample demographics and characteristics
Informed consent methodology
Experimental Design
Study type (e.g., cross-sectional vs. longitudinal)
Study timeline
Study workflow
Outline of scan session(s)
Task and stimulus descriptions, code used for presentation (when applicable)
Instructions given to the subjects
Description of data not included in shared data sample
Phenotypic Assessment Protocol
Demographics
Phenotypic assessment protocol (when applicable)
Diagnostic assessments protocols (when applicable)
Qualifications of research staff collecting data (including measures of rater reliability when appropriate)
Scan Session Details
MR protocol specification describing the order, type, purpose, and acquisition parameters for each scan
Conditions for each scan [e.g., eyes open/closed for resting state fMRI (R-fMRI), watching movie, or listening to music for structural, stimuli for task fMRI – see experimental design]
Data Distribution
Distribution site
Distribution type (database, repository, local ftp)
Imaging data formats (e.g., NIFTI, DICOM, MINC)
Imaging data conventions (e.g., neurological vs. radiological)
Phenotypic data key
Missing data
License
Successful use of shared data requires a clear and comprehensive description of the study’s motivation and design. Similarly, it is important to be aware of alternative designs that were considered, and the rationale for their rejection. Without such knowledge, user analyses and interpretations may be susceptible to biases (e.g., recruitment and sampling) that limit the validity or generalizability of findings. For example, in studies of psychiatric populations characterized by heterogeneity (e.g., Attention Deficit Hyperactivity Disorder, schizophrenia, autism), some investigators may bias their samples in favor of one behavioral profile or clinically defined subtype over another due to the specific hypotheses to be tested, or relative ease of recruitment (e.g., when one subtype has a higher prevalence than another). Others may take all volunteers (i.e., opportunistic sampling) or use an equal representation of subtypes, either treating the population as unitary or analytically accounting for the heterogeneity.
Increasingly, researchers are sharing data that was obtained as part of a larger study. Although well-justified and encouraged, biases can be introduced by the other data obtained as part of the effort, and need to be taken into account. An obvious example comes from R-fMRI studies, where scans are often “tacked onto” the end of ongoing task-based activation studies. Failure to provide users with information about task-activation scans included prior to the occurrence of a resting scan in a given study can result in systematic biases of R-fMRI metrics related to the nature of the task performed (McIntosh et al.,
Data sharing and open science are two related but distinct phenomena. Individuals can choose to share their data with a limited set of collaborators or the broader community. Even for datasets intended for open access, it is possible that data usage agreements must first be used to protect participant privacy. Importantly, publication of a data paper regarding a restricted dataset could share useful insights into experimental design and/or potentially motivate members of the scientific community to approach the generating authors and seek collaboration (Gardner et al.,
In the past, authors in journals such as “Science”
Despite calls for consensus among ethics board regarding standards for sharing, marked variation remains. The FCP/INDI efforts provide a clear example of the variability among ethics boards in their decision-making. Built upon the premise of full anonymization of data in compliance with the 18 protected health identifiers specified by the Health Insurance Portability and Accountability Act (HIPAA) privacy rules, the FCP/INDI efforts do not require prior consent by participants for sharing. While the vast majority of ethics boards supported this decision, either deeming the de-identified data no longer be human research data, or arguing the benefits outweigh the minimal risks of sharing, some did require re-consenting. Importantly, not all data can meet the standards of the HIPAA privacy rules (e.g., community-ascertained samples such as the NKI-Rockland Sample, where the county of residence is known for all participants). Even if fully de-identified, concerns can exist for the sharing of data sets with comprehensive medical information or potentially incriminating information (e.g., substance use), increasing the risks associated with data sharing (Ohm,
A major motivating force for data papers is to ensure that all parties involved in a research endeavor receive proper credit. Unfortunately, a common tension in the preparation of any research report is the determination of individuals meriting authorship (Pearson,
Central to the success of data papers, is to conduct high quality peer review that is tantamount to more traditional research report formats. Failure to ensure that reporting of the goals, design, and description for the sample being reported on, or the quality control and fidelity checks employed during acquisition will rapidly undermine the data sharing process and the credibility of data papers. Similarly, if a reviewer is not able to directly review the data distribution prepared by the author and verify the readiness of the shared sample for distribution, the process will be compromised. It must be noted that it should not be expected of the reviewer to manually check each data item to be shared – such an expectation would likely discourage reviewers from participating in the data paper process. Reviewers can be expected to assess the appropriateness of described data preparation and distribution mechanisms for sharing, as well as the venue(s) selected for sharing and the readiness of the reported data for sharing upon acceptance of the data paper. Individuals selected to review a data paper should be capable of evaluating study design and collection for a procedures for a given sample, as well as a person versed in sharing procedures that can evaluate venue. Explicit data paper standards and checklists should be provided to reviewers by journals to ensure a minimum standard for data papers.
As demonstrated by the various FCP and INDI releases, the need for data updates and corrections post-release is a reality
A relevant question is whether publication in a subscription journal works against the collaboration-promoting goals of data papers. Although a valid point, such an assertion likely minimizes the complexities of publication, including cost-sharing. For example, while open access journals are increasing touted for the removal of barriers to readers, the cost is shifted to authors, which can be prohibitive for some. Also, many subscription journals have open access options, equating them to open access journals if authors are willing to pay the fee. Additionally, as already noted, data papers should not necessarily be limited to open access datasets. In the end, it would appear that open access publications would gain maximal exposure and citations (Eysenbach,
Currently there is no centralized mechanism for hosting data, and it is unclear whether or not the field would be willing to accept such an entity. Efforts are currently underway to create the possibility of a federation among data sharing resources (e.g., INCF dataspace)
Although not a necessity to initiate the process of data papers, over time, neuroinformatics tools hold great potential in facilitating rapid generation and review of data papers, as well as the preparation of datasets for sharing. With respect to the generation of data papers, sophisticated data basing systems such as COINS (Scott et al.,
While we have aimed to describe how data papers could benefit the neuroimaging research community, they are not immune to more general problems that are present across the current publishing landscape. For instance, a common challenge for any large-scale effort is determining whether to report the work as one single project, or to break it into two or more smaller studies for the purpose of increasing the speed or number of publications. Similar concerns may arise in data papers, where researchers may be tempted to divide a design into pieces, and to report on sub-cohorts separately. Such practices should be discouraged at all levels of the publication process, as only complete data publication provides a clear understanding of the intended experimental design and decisions made.
A key concern is that data papers may fall outside what is typically valued by the academic community as “scholarly” contributions. Appropriate citation of data papers can make it possible for researchers to receive credit for data sharing via publication-based metrics of impact. However, creation of alternative metrics that directly assess data generation productivity via data paper publication and citation will likely prove to be more effective in drawing attention to the value of data sharing on biosketches and curriculum vitae, which are currently deficient in this regard. In the bigger picture, funding, and academic institutions will have to work on the development of formal policies for promoting and rewarding investigators generating data of use to the scientific community. Some mechanisms do currently exist. For example, the “broader impact” portion of National Science Foundation (NSF) applications allows reviewers to reward those proposing designs of value to the larger community, beyond their own goals. Data papers can help provide a track record to support the promises of investigators in sharing the data generated. Additionally, inclusion of funding acknowledgments in data papers provides researchers an opportunity to gain proper recognition for their efforts, and will help researchers document compliance with data sharing mandates. Although helpful, these mechanisms are not sufficient to fully reward researchers for their efforts in data sharing in funding decisions and do little to assist with major areas of concern, such as tenure decisions and fulfillment of departmental obligations.
Data papers hold the potential to provide data generating researchers the much deserved recognition for their efforts in study design, execution, and maintenance, while at the same increasing their motivation to share and collaborate with others through the publication process. The present work detailed minimum requirements to ensure the completeness and utility of data papers, and provided initial insights into controversies or questions that may arise for authors, reviewers, publishers, and data users in the process. A key remaining challenge is the reform of academic and funding institution practices to increase recognition of the need for, and scientific merits of generation of high quality data for the field by those best suited for the task. We hope that the development, development, and refinement of neuroinformatics tools will facilitate and further motivate the data paper process in the future.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Blaise Frederick, SatrajitGhosh, Caitlin Hinz, David Kennedy, Nolan Nichols, ZarrarShehzad, Jessica Turner, and Sebastian Urchs for comments and discussion. Financial support for the work by Michael P. Milham was provided by NIMH awards to Michael P. Milham (BRAINS R01MH094639, R03MH096321) and gifts from Joseph P. Healey and the Stavros Niarchos Foundation to the Child Mind Institute.
1All data sharing efforts discussed in this article concern raw data; derived data poses unique issues beyond the scope of the current proposal.
2
3
4
5
6
7
8
9
10
11
12