Edited by: John Van Horn, University of California, Los Angeles, USA
Reviewed by: Daniel Gardner, Weill Cornell Medical College, USA; Andrei Irimia, University of Southern California, USA
*Correspondence: Rhodri Cusack, Brain and Mind Institute, Western University, London, ON N6A 1W8, Canada e-mail:
This article was submitted to the journal Frontiers in Neuroinformatics.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Recent years have seen neuroimaging data sets becoming richer, with larger cohorts of participants, a greater variety of acquisition techniques, and increasingly complex analyses. These advances have made data analysis pipelines complicated to set up and run (increasing the risk of human error) and time consuming to execute (restricting what analyses are attempted). Here we present an open-source framework, automatic analysis (
The last two decades have seen enormous growth in the use of magnetic resonance imaging (MRI) as a tool to understand brain function, and in the size and complexity of the datasets acquired. The number of participants in individual studies has grown for many reasons, including: the increasing availability of MRI scanners; a move from fixed- to random-effects designs (Friston et al.,
Furthermore, the neuroimaging data acquired from each participant have become richer. Whereas in the past, researchers frequently collected data using a single method, many now acquire diverse MRI protocols, including structural (e.g., T1, T2, PD), functional (echoplanar imaging; EPI), connectivity (diffusion-weighted imaging; DWI), fieldmaps (multi-echo; gradient echo) and myelination (magnetization transfer ratio; MTR) measurements in single studies. Accelerated sequences using parallel imaging (SENSE, GRAPPA, and multiband EPI) have allowed for finer temporal or spatial resolution and increased the size of datasets by up to an order of magnitude.
Alongside the increasing quantity of data, the palette of analysis methods has also grown. In functional MRI (fMRI), in addition to the standard preprocessing stages of motion correction, slice-timing correction, warping-to-template (normalization) and smoothing, denoising is now possible using tools based upon independent components analysis (Calhoun et al.,
The increasing quantity of raw data and greater number of computationally intensive analysis methods have led to two challenges. The first is an increase in the complexity of the workflows required: There are a greater number of individual “chunks” of processing, and more complex dependencies between these chunks. Furthermore, even the best-run neuroimaging study does not always proceed exactly according to plan, and there are often idiosyncrasies that result from technical glitches, operator error, or participant non-compliance. Manual intervention in this complex workflow leads to the potential for human error.
The second challenge is an increase in computation time per study. Many neuroimagers are already stretched by the need to become multidisciplinary experts in the physics of neuroimaging, the mathematics for analysis, the psychology of cognitive function, and the biology of the brain. They do not all necessarily relish the additional challenge of becoming a programmer and computer scientist so that they can make the most efficient use of computing resources.
The many stages of analysis required to draw conclusions from MRI data were once almost universally accomplished using point-and-click interfaces, a practice many continue. However, as the field matures, this sort of “manual” analysis is becoming increasingly impractical and unattractive. Here, we present a software package, automatic analysis (
Once the decision is made to use a processing pipeline, there are a number of options. Although the best solution depends a great deal on individual preferences and priorities, we have engineered
Neuroimaging benefits enormously from a dynamic software development community, with new analysis tools frequently disseminated by large teams. However, these packages focus primarily on implementing specific tools, rather than managing efficient workflows.
As neuroimaging pipelines become increasingly complicated, it becomes important to develop elegant ways of describing them. With
To make it easier to identify the code that is responsible for a given task, and to facilitate parallel computing, each stage of processing is described by an encapsulated “module.”
A separation is enforced between the algorithms that should be applied and the data (i.e., participants and sessions) on which they should operate. This separation ensures that modules are re-useable: once written in the context of one analysis, modules may usually be re-used without modification in another analysis of different data.
Modules are never called directly by the user; instead, their execution is handled by the
Checking for previously-completed stages also facilitates complex pipelines with multiple analysis pathways. For example, in the case where all processing stages save one are identical (e.g., to compare preprocessing with and without slice-timing correction),
As analyses become more computationally intensive, being able to easily accelerate them across a cluster of machines is increasingly important. Often, execution time determines what analyses a user can bear. For example, even if an analysis runs in a single-threaded manner in a practical amount of time (say 5 days), a user will be highly discouraged from running it again to fix some small issue.
A precise record of everything that has happened in an
One of the drawbacks of batch analysis is that a user may be tempted to only look at the final results, and not inspect the data at each stage of processing. However, complex analysis pipelines can fail in a greater number of ways than simpler pipelines. Some failures can be obvious (e.g., activation outside the brain due to imperfect registration), while others are harder to track down (e.g., weaker group activation detected due to high between-subject variability caused by motion). Consequently, inspection of data is as important as ever. Several existing solutions generate some diagnostic data during the analysis (e.g., FSL's FEAT Pre-stats and Registration reports); however, the information provided is limited, sometimes complicated to reach, and almost never submitted to between-subject analysis (important for the measurement of between-subject variance and outlier detection).
To address this problem, many
As a processing pipeline,
This manuscript describes
The core of an
This script executes a typical fMRI processing pipeline (discussed more in the next section) on a single subject (CBU110000) for a single session (imaging series 6, labeled “movie”).
The user script can set parameters, such as output paths, or settings for modules. Here, three dummy scans are specified to be ignored in the analysis by the line:
Note that the entire analysis—comprising the set of tasks to be run and the data they are to be run on—is described in a single structure (the “aap” variable). It is initially constructed by the
The tasklist is an XML format file that describes what should be done. A number of tasklists are available, many of which are useful without modification (Table
aap_tasklist_typical_fmri.xml | fMRI preprocessing and first/second level statistics |
aap_tasklist_fmri.xml | fMRI preprocessing and first/second level statistics—variant using fieldmaps, realignunwarp. |
aap_tasklist_dartelvbm8.xml | VBM with SPM8 and DARTEL |
aap_tasklist_diffusion.xml | Diffusion tractography with FSL |
aap_tasklist_diffusion2.xml | Nonlinear DTI and DKI |
aap_tasklist_freesurfer.xml | Structural processing with Freesurfer |
Each tasklist describes a series of modules that should be executed. In the example user script given above, the tasklist specified was
There are two sections to this simple tasklist. The “initialisation”
Note also that the dependencies (that is, which pieces of data act as the input to each module) are not usually explicitly specified in the tasklist. Instead, the pipeline is automatically connected up at the start of processing using information in each module's interface. This simplifies specification of tasklists, and allows modules to be reordered with reduced potential for error. The dependencies are reported at the start of an analysis.
The example of an output file tree for an
The name of the directory for the analysis is specified in:
Each module operates on data stored in a separate directory (e.g.,
These ease-of-use and aesthetic advantages come along with more fundamental benefits. Partitioning the workspace of modules into separate directories facilitates the encapsulation of data. The
Here, both modules had the suffix _00001. If either module were present more than once in a tasklist (e.g.,
Note that this architecture does not restrict the level at which a module can operate. That is, if data for all sessions and subjects are needed to complete an analysis, they will all be copied to the appropriate directory. However, as this is more often the exception than the rule, on the whole
At the heart of every
Scan input DICOM files to get series and acquisitions irrespective of filenames, which are typically site-specific. Identify structural and fieldmap series numbers. |
Find all DICOM files corresponding to the structural acquisition. |
Coregister an individual's structural to a standard space template using a rigid body transformation, which improves robustness of later normalization stage. |
Estimate nonlinear warp that will transform an individual subject's space into a standard template space (SPM normalization). |
Apply normalization parameters derived from structural to EPIs. |
Run New Segment (introduced in SPM 8) and save bias-corrected image (e.g., for segmenting). |
Tissue class segmentation using New Segment (SPM 8). |
Retrieve total tissue class volume and TIV from segmented images. |
Use DARTEL to create a template. |
Write DARTEL-warped images to MNI space. |
Divide segmented images by total gray matter (proportional scaling). |
Apply normalization parameters derived using DARTEL to other modalities (e.g., EPI, contrasts, DWI, ROIs). |
Transform images in standard MNI space (e.g., atlas labels) into native space based on normalization parameters derived using DARTEL (multimodal). |
Prepare for a Freesurfer analysis. |
Defaces structural (T1) and produces a mask. |
Apply defacing mask to a co-registered image. |
Runs a Freesurfer pipeline with recon-all. |
Use FAST (FSL) for segmentation. |
Use FIRST (FSL) to characterize structure shape. |
Create transformation matrix for ANTS normalization to study template. |
Apply inverse warp to ROIs. |
Apply warp to first level contrasts. |
Find all DICOM files corresponding to the EPI acquisitions. |
Convert the DICOM files to NIfTI format. Handles with multi-echo EPI with various weighting schemes. |
Perform motion correction with SPM. |
Slice timing correction with SPM. |
Applies to the EPIs the transformation derived from coregistering the structural to a standard-space template (in aamod_coreg_extended_1). Then, fine-tunes the registration of the EPI to the structural with a further coregistration. |
Coregisters structural to mean EPI using SPM. |
Smooth data. |
Use fieldmap (with phase and magnitude) to correct EPI distortions. |
Realign and unwarp from SPM. |
Constrained nonlinear coregistration. |
Run first level statistical model. Simple specification of events in user script. |
Run first level contrasts. Simple specification of contrasts. |
Run a |
Run repeated measures (across subjects) one-way ANOVA. |
Calculate seed-to-seed connectivity matrix from relationship of time-courses across seed regions. |
Extract ROI timeseries after first level analysis. |
Prepare PPI regressors based on ROI timeseries. |
Run individual or group tensor ICA. |
High-pass filter fMRI time series using discrete cosine model, like SPM. |
Calculate mean time course for each voxel across subjects. |
Calculate correlation of each subject's timecourse with mean. |
Statistics to find which correlations are significant across subjects. |
Get a list of all of the DICOM files that correspond to the diffusion series (typically, as identified by aamod_autoidentifyseries_timtrio). |
Convert diffusion images from DICOM to NIfTI |
Convert diffusion images from 3D to 4D. The XML file is 'aamod_3dto4d_diffusion.xml' which refers to the matlab file (using mfile_alias) 'aamod_3dto4d.m'. |
Use eddy_correct (FSL) to correct image distortions, head movements using affine registration to a reference volume (T2 image). |
Use FSL to extract the reference(s) image(s) (T2 image with |
Use FSL to extract the brain of the nodif image. Brain extraction toolbox. Its “mfile” is aamod_bet. |
Use FSL to fit a diffusion tensor model at each voxel. Note that dtifit is not necessary in order to run probabilistic tractography (which depends on the output of BEDPOSTX). |
Fit diffusion kurtosis parameters using linear model. |
Fit diffusion tensor parameters using nonlinear model. |
Coregister structural to diffusion image (dti_FA). |
Use SPM to “unnormalize" the seeds (i.e., apply the inverse matrix to transform the seed (MNI space) to diffusion space). |
Use SPM to “unnormalize” the targets (i.e., apply the inverse matrix to transform the targets (MNI space) to diffusion space). |
Use FSL to apply bedpostx Monte Carlo modeling of PDFs of diffusion parameters. |
Use FSL to apply probtrackx, which repetitively samples from the distributions on voxel-wise principal diffusion directions, each time computing a streamline through these local samples to generate a probabilistic streamline or a sample from the distribution on the location of the true streamline. |
Get the results of probtrackx (diffusion space) of each participant, merge the different splits and transform them to the MNI space. |
Averages the seed-to-target connectivity images across subjects, which we have used for visualization. |
Runs an MVPA searchlight on a set of beta or |
Convert results from searchlight into NIfTI images readable in SPM. |
Set ROIs from standard space into subject space. |
Runs an MVPA analysis within an ROI, using a set of beta or |
Each module requires two files: an XML interface (e.g.,
One of a module's most important properties, specified in this XML interface, is the “domain” at which it operates. Modules with a domain of “study” are called just once (i.e., a single instance is created each time the module occurs in the processing pipeline). Modules with a domain of “subject” are called once for each subject, while modules with a domain of “session” are called once for each session of each subject. These are the three most common module domains; others include diffusion_session, meg_session, and hyperalignment_searchlight_package. However, new domains can be easily added to the
Instances of a module should restrict their processing to a particular set of input data (i.e., for a given session-domain module, there might be an instance for subject 3, session 2). This instance should take care to only attempt to process this portion of the data, and should never attempt to write data outside its domain (in this example, to another session).
Other important properties of a module are the type of data (e.g.,
An example interface file,
The domain is specified in the attributes of the “currenttask” line, along with a description (which is displayed to the user) and the modality of the data—here “MRI.”
The next two sections are of less focus here. The “qsub” fields are estimates of the resources used by this module, for use by some parallel schedulers. The “permanenceofoutput” field is used by the garbage collection tool to delete less important, intermediate data prior to archiving. Higher numbers correspond to more important data.
More central to the function of this particular module, the “FWHM” field describes a setting of this module—in this case, the full-width half maximum of the smoothing kernel, in millimeters. There is then a description of the sorts of input data (or “streams”) that this module requires, here only “epi” data, and the output data, again just “epi” for this module. The operation of these is discussed more in the next section. The Matlab code for a module implements the function.
In the
The values in this
The file
The tasklist XML file (here
The XML interface files for each of the modules in the tasklist.
The values returned by the
Alternatively, it is sometimes more convenient to create modified XML files. XML tasklists may set parameters for an individual instance of a module, with syntax like this:
It is also possible to create XML files that inherit the parameters from the standard files, and override a few of them. For example, one can create a site/study/specific version of
in which most of the settings are imported from
SPM defaults are a special case. These can be modified in the
For users who wish to analyze fMRI data with
where:
For example,
specifies that every session of every subject was a block design, with a regressor titled “VisualStimulus” with onsets every 15 scans and a duration of 7.5 scans.
Using the “subject” and “session” fields, customized designs for each subject and/or session may be specified.
A contrast may then be specified with
where:
For example,
to contrast the first vs. the second column of every session in every subject.
If the desired second level model is to run a simple
All data into and out of an instance of a module are managed by the
A module's interface (XML file) describes the data streams that it requires wants as an input:
and what it produces as an output:
This information is then used to connect up the pipelines of data from one module to the next. So, for example, if a tasklist contains:
The module
A complexity that is largely hidden from the user is that dependencies are calculated at the level of particular instances of a module, and are affected by the domains at which the source and target modules operate. Consider this fragment of a tasklist:
Both
The
This executes an
To test whether an instance of a module needs to be executed,
Ultimately, regardless of the scheduling mechanism, instances of modules are run by calls to the
Neuroimaging studies frequently require data to be analyzed in different ways. This might be because there is some uncertainty on the ideal parameters or analysis strategy (for example, whether motion correction should be performed before or after slice timing correction, or what smoothing kernel should be used). Alternatively, it might be because the data are to be analyzed in a number of different ways—with ICA, with conventional univariate fMRI, with MVPA, and with functional connectivity
Traditionally, these scenarios would probably involve either creating entirely independent pipelines, or processing to the branch point, making a copy of the analyzed data in a different directory, and then taking the new analysis forwards. By contrast,
In this command,
By default, the input for a stream to a module comes from the last module in the tasklist that outputs that kind of data. Often, this is the desirable behavior. However, sometimes, an explicit earlier reference may be desired. This can be achieved with a fully qualified stream reference comprising
There are at least two ways a user may customize
Another way is to create a customized
A user must prepare raw data in a form acceptable for input to
In a user's tasklist (or later, as a site-specific configuration) the dicomfilter can be set, typically to one of:
For any tasklist, setting the first main module to
Provided researchers use a consistent name for their structural scans, these scans can be automatically identified by setting:
The first line requests automatic scanning for the structural (the default), and the second, which protocol should be sought. If a user sometimes acquires more than one structural (for example, if a subject moves) but always stops once they have a good one, it is possible to specify that in this circumstance the last structural is the one to be used:
A second alternative is to use data already converted into NIfTI format. This is possible, either by using the
It is often the case that a researcher will want to analyze a subset of data from a larger database, or continue an analysis that exists in a different location (i.e., a remote location). For example, a lab might store and preprocess all their subject MRI data—fMRI, structural images, and diffusion images—on a central server, but one user might want to only analyze the fMRI data from a few subjects on their local machine.
Authors Rhodri Cusack, Annika C. Linke, Conor J. Wild and colleagues at the Brain and Mind Institute are actively developing for
In addition to authors Tibor Auer and Daniel J. Mitchell a handful of other coders in the Unit also actively participate in developing
Author Alejandro Vicente-Grabovetsky and colleagues in the Doeller laboratory are actively developing for
Author Jonathan E. Peelle and his laboratory are developing structural and functional MRI analysis for Siemens 3T data.
The codebase is maintained at:
There are two main branches: the
A website (
Our software provides access to most functions of SPM, one of the most commonly used neuroimaging tools worldwide, for analyses such as fMRI modeling and voxel-based morphometry. For several diagnostics in general and DWI analysis we use the well-established FSL functions, and for cortical-surface based measures, Freesurfer.
Every processing approach has limitations, and
Like most pipelines that serve as interfaces to other tools,
Another consequence of automated pipelines such as
Finally, there is always the danger when using automated batch analysis pipelines that the researcher might try every possible combination of analysis tools and parameters —so-called “experimenter degrees of freedom”—to obtain the desired results. This is not a new problem in neuroimaging, but
Despite these possible limitations, we believe that
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The
1British spellings are used throughout
2Of course, care must be taken when trying out multiple analysis options, and exploration is best done on independent data so as not to bias the results. Our point is that there are many instances in which researchers might reasonably want to compare analysis strategies in a systematic way, which