Edited by: Jae-Ho Shin, Kyungpook National University, South Korea
Reviewed by: Haeyoung Jeong, Korean Bioinformation Center, South Korea; Rup Lal, University of Delhi, India
*Correspondence: Vineet K. Sharma
This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology
†These authors have contributed equally to this work.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Several metagenomic projects have been accomplished or are in progress. However, in most cases, it is not feasible to generate complete genomic assemblies of species from the metagenomic sequencing of a complex environment. Only a few studies have reported the reconstruction of bacterial genomes from complex metagenomes. In this work, Binning-Assembly approach has been proposed and demonstrated for the reconstruction of bacterial and viral genomes from 72 human gut metagenomic datasets. A total 1156 bacterial genomes belonging to 219 bacterial families and, 279 viral genomes belonging to 84 viral families could be identified. More than 80% complete draft genome sequences could be reconstructed for a total of 126 bacterial and 11 viral genomes. Selected draft assembled genomes could be validated with 99.8% accuracy using their ORFs. The study provides useful information on the assembly expected for a species given its number of reads and abundance. This approach along with spiking was also demonstrated to be useful in improving the draft assembly of a bacterial genome. The Binning-Assembly approach can be successfully used to reconstruct bacterial and viral genomes from multiple metagenomic datasets obtained from similar environments.
The complete genome sequences of bacteria abundant in different environmental systems are essential to uncover the genetic diversity present on this planet. However, the fact that 98% bacteria cannot be cultured using traditional methodologies is a limiting factor for their genomic sequencing using the available sequencing technologies. In this scenario, metagenomics has emerged as a culture independent methodology to directly sequence the microbial genomes from their environments. The main objective of the metagenomic projects is to access the genetic information of the inherent microbes irrespective of the fact that whether the individual complete genomic sequences are achievable or not. In most cases, it is not feasible to generate complete genomic assemblies of species from the metagenomic sequencing of a sample obtained from a complex environment. It is mainly due to the inherent enormous microbial diversity which requires massive sequencing and involves substantial cost to obtain a reasonable coverage for each species.
In the last decade, several large and small-scale metagenomic projects have been accomplished and a large number of projects are currently under progress. Due to the unprecedented improvements in next generation sequencing (NGS) technologies, the amount of data generated from the recent metagenomic projects has shown a logarithmic upward trend compared to the initial metagenomic projects. Resultantly, sequence data from multiple samples sequenced from the same environment or from similar environments has been gradually accumulating. In this scenario, the reconstruction of bacterial genomes from a mixture of metagenomic reads obtained from multiple samples of similar origin appears feasible. At present, only a few studies have reported the reconstruction of genomes from complex metagenomic samples (Luo et al.,
The sequence data obtained from a metagenomic project contains a mixture of short reads derived from the microbial species present in that environment, but lacks the information on their taxonomic origin. Therefore, as a first step the taxonomic binning of metagenomic reads into their respective genomic bins is aimed. The factors influencing the precise taxonomic binning includes the read length (Patil et al.,
The taxonomic binning is commonly carried out using a homology-based or a composition-based approach, or a combination of these two approaches (Sharma et al.,
After carrying out the taxonomic binning, the next step is the reconstruction of genomes for which the currently used methods either perform the alignment of reads against the available reference genomes or carry out
Deep sequencing data has been used to partially assemble multiple genomes from rumen metagenome with varying levels of completeness (Hess et al.,
Reconstruction of genomes with varying level of completeness has been shown for 49 genomes by Wrighton et al. for at least five phyla for which there were no previous genomic information available (Wrighton et al.,
Nielsen et al. used strategy based on binning co-abundant genes across various metagenomic samples from human gut without requiring reference genome sequences (Nielsen et al.,
The objective of the present study is to propose and evaluate the combination of binning and assembly “Binning-Assembly” (BA) approach to construct individual bacterial genomes from metagenomic reads. Since, in a single metagenome, the genomic coverage expected for a single species is generally insufficient to construct a reasonable draft, the taxonomically assigned reads from multiple metagenomes have been used cumulatively. Therefore, in this work, a two-step methodology is followed in which the reads are first assigned into taxonomy bins, followed by their assembly based on reference genomes to reconstruct draft genome sequences of related bacterial species present in human gut.
High quality human gut metagenomic data for 124 individuals generated using Illumina sequencing technology was retrieved from BGI website (
Assembly of reads was performed using Genovo with 50 iterations (Laserson et al.,
The next-generation sequencing reads were simulated using ART Software (Huang et al.,
BLAST Ring Image Generator (BRIG) (Alikhan et al.,
In the current study, a draft genome with 80% genomic assembly is termed as standard draft and a draft genome with 90% genomic assembly is termed as high quality draft as defined by Chain et al. (
The overall methodology of the present work is shown in Figure
For the reconstruction of genome sequences, the first task in this approach is the assignment of reads into different taxonomic bins to estimate the diversity and abundance of various species present in the metagenome. The taxonomic assignment of the reads was performed using Kraken (using k-mer 24), since it is a faster and accurate classifier for small read lengths (>100 bp) compared to the available binning tools. Kraken could classify an average of 32.56% of the total number of reads from each metagenome at different levels of taxonomy (Figure
A total of 219 bacterial families could be identified from the 72 metagenomic datasets (Tables
Though viruses are an important component of human gut flora, their abundance and distribution in human gut has been comparatively less explored (Mokili et al.,
After the taxonomic classification of reads, the different strategies for the reconstruction of bacterial genomic sequences were examined. If the reconstruction of a bacterial genome is attempted from a single metagenome, in most cases a reasonable draft assembly may not be achieved due to the lack of sufficient number of reads (coverage) of that genome. Therefore, an apparently better strategy would be to combine multiple metagenomes having similar origin to increase the representation of reads of different species present in the metagenomes. However, the resultant mixture of reads is likely to increase the data size tremendously to be handled by computationally intensive currently used assemblers, genomic complexity, and time required for the assembly algorithms. Therefore, in the present study, instead of combining all the reads from all metagenomes, the reads belonging to only a single genus from 72 metagenomes were pooled together to create a pool of reads specific for each genus referred to as “genus-pool” in the manuscript. The genus-pool for each individual genus is likely to facilitate the assembly of the genomes belonging only to that genus, thereby, reducing the chances of errors which may result by inclusion of reads from other genus, furthermore, reducing the data size and decreasing the time required for the assembly process. It should be noted that assembly using a genus pool of reads belonging to closely related species might also result in chimeric assembly, however, it can be countered by performing additional steps of verifying the completeness of the ORFs to validate the accuracy of the reconstructed genomes.
The genus-pool of each individual genus was aligned using BWA-MEM with the complete bacterial genomes known in the respective genus (Table
99.17 | 97 | 99.94 | |||
98.89 | 92 | 99.93 | |||
98.71 | 93 | 99.95 | |||
98.69 | 97 | 99.81 | |||
97.79 | 94 | 99.56 | |||
97.14 | 94 | 99.80 | |||
95.72 | 95 | 99.84 | |||
95.07 | 93 | 99.84 |
This suggests that standard draft genomes and high quality draft genomes (as defined by Chain et al.,
In the above section, the genus-pool from 72 metagenomes was used to generate 1156 bacterial genomic assemblies. However, it would be interesting to estimate the minimum number of reads which can yield the similar percentage of assembly, as achieved in the above section, for a particular genome. A metagenome harbors several species with different relative abundances which may vary from sample to sample. Therefore, the number of reads in each metagenome which are required for achieving reasonable (≥85%) assembly for a particular species is likely to vary depending upon the abundance of that species in the metagenomes. Therefore, in the scenario where multiple metagenomes of similar origin are available, the following two strategies were evaluated for the reconstruction of a genome.
To evaluate the first strategy, incremental assembly was performed for the eight selected bacterial genomes (Figure
However, in the case of
The previous strategy demonstrated the assembly of an abundant species from a single metagenome or on pooling multiple metagenomes sequentially. However, in this strategy, the average number of reads required to attain a reasonable (>85%) assembly, independent of the genus abundance, is estimated. For the selected top eight genomes, sets of reads representing 5–50 × genomic coverage, calculated according to the size of the selected reference genome, were created from the corresponding genus-pool. The sets of reads for each coverage were aligned against the reference genomes and the percentage of reconstructed reference genome was calculated (Figure
Since, the number of reads assigned to any viral genus-pool was too low for performing assembly, all the reads which belonged to viruses were pooled together from all 72 datasets and were aligned with the available viral genomes (Table
Assembly of the reads for the eight selected bacterial genomes was carried out using Genovo because of its high accuracy, ability to use maximum number of reads to generate large assembled contigs, along with high N50 values (Vázquez-Castellanos et al.,
Similarly, the assembly of reads belonging to the viruses from all 72 metagenomic datasets for top five viral genomes (one from each genus) shows that the percentage assembly of the respective viral genome was found to be lower (10.5–94.5%) when the contigs were used for the alignment as compared to the alignment carried out by directly using the reads (49.6–97.9%) (Figure
In public databases, the number of draft assemblies far exceeds the number of completely sequenced bacterial species identified from human gut and from other metagenomic datasets. The main reasons for the inability to achieve a complete genomic sequence in most cases is the absence of reads from some genomic regions after sequencing which remain as gaps in the draft genomes, and due to repeated regions in a genome which are bigger than the read length. Though, the latter problem cannot be resolved without longer read lengths along with high sequencing depths, however, for the former scenario, it would be interesting to see if the human gut metagenomic reads could be used to fill up these gaps to improve the draft assemblies of genomes available from human gut. To examine this hypothesis, a simulated draft genome was constructed using the complete genome sequence of
While the current work was in progress, a different approach of cumulating metagenomes to assemble new microbial species from multiple metagenomes was carried out by Nielsen et al. (as described in the Introduction Section). However, a completely different and novel Binning-Assembly (BA) approach is demonstrated in the present work to reconstruct the bacterial genomes from multiple metagenomes. Using the BA approach, a total of 31 phylum, 219 families, 584 genera and 446 bacterial species and 279 viral species were identified from 72 human gut datasets, whereas, the MGS approach reported the presence of 741 MGS including bacterial and viral species from 396 datasets. The number of reported species is higher in the later study as it was carried out using much larger number of datasets. The major difference in the two approaches is that in the present study the reference genomes have been used to reconstruct the genomic assemblies, whereas, in the study by Nielsen et al., MGS were constructed from gut without using any reference genomes.
Out of the 1156 bacterial genomes identified in this study, >50% assembly could be achieved for 181 genomes. Furthermore, 126 bacterial genomes and 11 viral genomes could be reconstructed with >80% assembly which asserts the usability of this approach to reconstruct genomes from a metagenomic mixture of reads. The acceptance of metagenome-derived genomes may be arguable due to the assembly of regions of a bacterial species using metagenomic mix of reads obtained from multiple samples of same environment, such as human gut in this study. Therefore, in this work, multiple steps were taken to ensure high accuracy of the reconstructed genomes. At the first step, the reads belonging only to bacterial kingdom were selected after the taxonomic assignment of all reads by Kraken which ensures that the eukaryotic and viral reads are removed before proceeding for assembly. Furthermore, consideration of reads only from a single genus by constructing the genus-pool for each genus removes the possibility of the presence of reads from other genera which makes the assembly process more specific and less complex.
Though, it could be argued that the mixing of reads from multiple metagenomes to form a genus-pool might result in some chimerism during the assembly, however, in the case of all eight resultant draft genomes in this manuscript, the observed high (>95%) identity of the assembled genomes with the respective reference genome attests to the accuracy of the assembly. Further verification of the assembled genomes was performed by examining the completeness of the ORF's of the reconstructed genomes, and this analysis revealed 99.8% of the ORF's to be completely present with 99.8% identity. In case of
It is apparent that the achievable percentage assembly of a genome depends upon its abundance (number of reads) in the metagenome. The strategy-I reveals that an abundant genomic species can be easily assembled up to 91% with a minimal 8 × coverage using reads from a single metagenome. Therefore, for abundant species, the reconstruction should first be attempted using only a single metagenome. However, in general, for the assembly of any genomic species (irrespective of its abundance), the strategy-II shows that a sequencing depth of 25 × –30 × of that species is sufficient to achieve >85% assembly of that genome which also concurs with previous reports (Chitsaz et al.,
Promising results were also achieved by spiking the “genus-pool” of reads with the reads of a simulated draft genome belonging to that genus. The gap regions “n” in the simulated draft genome could be replaced by nucleotides with 98.5% accuracy and could improve the assembly from 90.9 to 96.9%. This appears to be a useful strategy to improve the assembly of the incomplete draft genomes which outnumbers the completed genomes in the public databases.
An apparent limitation of the current approach is the dependence on the classification accuracy and efficiency of binning algorithm which is limited at this point mainly due to the lower read lengths and unavailability of reference genomic sequences in the public databases, which is expected to improve with time. It is to be noted that only 1/3rd of the total reads (32.56%) could be classified into taxonomic groups using Kraken and on using only these reads, 90 high quality draft genomes with >90% assembly could be reconstructed by using BA approach. Furthermore, no improvement was observed in the assemblies after the addition of leftover reads (Text
Another limitation of this approach is its dependence on reference genomes for alignment to reconstruct the genome sequences. This limits its usability for those bacterial genomes for which a closely related reference genome is available. However, more and more bacterial genomes are being sequenced worldwide at a rapid rate and in this scenario, the main advantage of this approach is the rapid and reliable reconstruction of strains of the known species or closely related members of the known species which are likely to be present in different populations or environments.
The analysis presented in this study demonstrates the merits and limitations of binning and assembly based approach and thus it is likely to act as a reference for the reconstruction of bacterial genomes from metagenomic reads.
SK, AG, and VS developed the idea. AG, SK, VP, KH performed the analysis. AG and AS developed the scripts. SK, AG, and VS wrote the manuscript.
We thank the intramural funding received from IISER Bhopal for carrying out this work. AG is a recipient of DST-INSPIRE Fellowship and thanks the Department of Science and Technology for the fellowship.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank MHRD, Govt of India, funded Centre for Research on Environment and Sustainable Technologies (CREST) at IISER Bhopal for its support. However, the views expressed in this manuscript are that of the authors alone and no approval of the same, explicit or implicit, by MHRD should be assumed.
The Supplementary Material for this article can be found online at: