Edited by:
Reviewed by:
*Correspondence:
This article was submitted to Genomic Assay Technology, a section of the journal Frontiers in Genetics.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Next generation sequencing (NGS) technologies, primarily based on massively parallel sequencing, have touched and radically changed almost all aspects of research worldwide. These technologies have allowed for the rapid analysis, to date, of the genomes of more than 2,000 different species. In humans, NGS has arguably had the largest impact. Over 100,000 genomes of individual humans (based on various estimates) have been sequenced allowing for deep insights into what makes individuals and families unique and what causes disease in each of us. Despite all of this progress, the current state of the art in sequence technology is far from generating a “perfect genome” sequence and much remains to be understood in the biology of human and other organisms’ genomes. In the article that follows, we outline why the “perfect genome” in humans is important, what is lacking from current human whole genome sequences, and a potential strategy for achieving the “perfect genome” in a cost effective manner.
Next generation sequencing (NGS) technologies, primarily based on massively parallel sequencing (MPS;
Massively parallel sequencing technologies continue to advance in efficiency with recent improvements allowing “single pixel per spot” imaging on patterned DNA arrays to generate many terabases of read data per sequencing run (
Within the approximately three billion DNA letters inherited from each parent and the sixteen thousand maternally inherited mitochondrial letters (about 6 billion total) exists a program for the development and adaptive functioning of all of our tissues. As such, it is reasonable to believe that perfect analysis of the human genome will become the ultimate genetic test (
Perfect WGS or the “perfect genome” is free of errors, meaning there is no need to validate medically relevant variants using some orthogonal sequencing method. DNA bases are not skipped nor are additional bases added in a perfectly read genome and we can be confident that everything that could affect the health of the individual, including all
Over half of the human genome is composed of repetitive sequence elements (
Most individuals have more than 2 million locations in their genome where the 22 autosomal chromosomes inherited from each parent differ (
Next generation sequencing has begun to become more prevalent in the clinical setting, but false positive error rates are still too high [one false positive error in every 1 million bases (
Thousands of cells are not always available in a clinical setting, but that is the amount of DNA required by most WGS technologies. Ideally WGS could be performed on 10 cells or less. This would allow for the application of WGS to
Our “perfect genome” solution employs advanced massively parallel DNA sequencing of “co-barcoded” reads from long genomic DNA molecules, and efficient
First, a sample providing sufficient (e.g., >10X) genome coverage in long (e.g., 30–300 kb) DNA fragments is needed. This redundancy allows the differentiation of true sequences from errors and artifacts and the separation of long clustered repeats (e.g., full length LINEs or segmental duplications) as discussed below. Implicit in this requirement is DNA isolated from at least several cells from one individual and the preservation of long DNA molecules. This eliminates formalin-fixed paraffin-embedded (FFPE) samples, in which DNA is typically degraded, and single cells as sources for our “perfect genome.”
Second, nearly all reads from a long fragment must share the same unique barcode (“co-barcoding”) and that barcode must be different from the barcodes used to tag reads from the majority of (i.e., >80%) related long DNA fragments (e.g., overlapped DNA fragments covering a genomic region or copies of long repeats). However, reads from many unrelated long fragments can have the same barcode.
Third, sufficient base coverage from each long DNA fragment (at least ∼30% and preferably >60% of bases with read coverage per fragment between 0.5X and 2X) is needed to link most of the sequence belonging to a long fragment and achieve almost all the benefits of very long reads (i.e., >50 kb). In the case of ideal unbiased WGS, a total read coverage of 40X would be sufficient for 10 diploid cells with co-barcoding, although in practice over 100X would be preferred.
Fourth, a read structure, as discussed below, that allows for the resolution of frequent short repeats (e.g., Alu and similar repeats) and the ability to measure the length of homopolymers and tandem repeats is important.
It is critical to satisfy all four listed requirements in this optimally designed arrangement in such a manner that DNA tagging does not add significantly to the total cost.
The principles behind co-barcoded reads (
Co-barcoding, as previously demonstrated (
How many barcodes are needed? The more the better, but to be able to measure common long repeats such as telomeres from 20 cells (∼2,000 telomeres) we would need at least 4,000 compartments (providing >75% isolated telomeres). This number of compartments is also needed to accurately count mitochondrial genomes and assess heteroplasmy as there are often more than 100 mitochondria per cell. Therefore, 5,000–10,000 is a good starting point. Using 10,000 individual compartments with 10,000 unique barcodes and 20 cells would result in only one hundred and twenty 100 kb molecules (∼12 Mb of DNA) per barcode. This small number of bases per each barcode dramatically reduces the chance of co-barcoding related long fragments. Additionally, 10,000 barcodes reduces the number of highly similar Alu repeats and LINEs with the same barcode to <1,000 for each repeat type. However, while 10,000 barcodes should be sufficient this suggests a larger number would be beneficial for analyzing these highly redundant DNA elements.
With more barcodes it may be more optimal to use more cells, ideally 50 if available, to increase the number of staggered long DNA fragments covering each genomic region. This would provide more power to separate long proximate repeats as discussed in
Using 10,000 barcodes and 50 or more cells allows for the possibility to skip amplification of long DNA [e.g., MDA (
Long DNA fragments, over 1 Mb in some cases, are needed to phase across long genomic regions with no or a few heterozygous loci commonly found in the genomes of non-African populations (
Five thousand to 10,000 individual DNA aliquots is achievable with current technologies using our multi-step no-purification biochemical protocol (
Current MPS read lengths of about 50 bases or more are suitable for our proposed “perfect genome” strategy. However as discussed above, long DNA amplification will most likely be necessary if this read architecture is used. If an approach without long DNA amplification proves to be beneficial, a feasible near-term solution could be improved paired-end MPS with ∼300 base reads from each end of 500–1,500 bp subfragments (including the barcode sequence). Cost-effective and scalable single molecule sequencing of more than 500 bases would be even better if and when it becomes available. These longer paired-end or single reads could improve the resolution of repeats in close proximity to each other that are longer than current reads but too short to be resolved by the mate-pair read structure due to variation in the subfragment size (generally varying from 500 to 1,500 bp). Furthermore, longer reads reduce the negative effects of non-randomness from DNA fragmenting. 500 base reads would also provide a more precise measure of long triple and other tandem repeats. In our proposed co-barcoding based strategy, longer continuous reads (e.g., 5 kb) are less beneficial, especially if they are error prone and/or more expensive.
Our low cost “perfect genome” solution requires 10–100 cells and thus has to have a low read coverage per each long fragment (usually less than 2X), otherwise it would be very expensive to generate 600X read coverage (10 cells × 2 parental sets of chromosomes × 30X standard read coverage). Therefore, the trivial approach of collecting all reads with the same barcode to assemble long DNA fragments is not applicable. Instead, integrating low read coverage from 10 or more original long overlapped fragments with different barcodes per each genomic region is required. We previously described an approach for variant calling disregarding barcodes and then using barcodes for phasing heterozygous variants by calculating the number of shared compartments for alleles in the neighboring heterozygous loci (
The above approaches are very effective but use a reference-biased process in the initial variant calling. Using a large number of barcodes allows implementation of a computationally and cost efficient (less than $200 by our estimates) reference-free or reference-unbiased assembly. Cycle sequencing used in MPS generates reads without indel errors. This allows for the non-gapped read-to-read alignments necessary for efficient
In this manuscript we have outlined a proposal for achieving the “perfect genome” sequence using current and yet to be developed technologies. In
The most important part of our proposal that is not currently available is a high quality, affordable 300 base MPS paired-end read length (a total of 600 bases read per subfragment). However, given the dramatic improvement in NGS read lengths over the past few years, we believe 300 base reads will be available soon. As discussed, a 10–100 fold low-bias amplification of long DNA coupled with 100 base reads is an alternative solution if 300 base reads are not feasible or cost-effective. It should be noted that while we focused on human genome sequencing for this article, the strategy outlined here is a viable solution for analyzing the genomes of any species.
Improvements to MPS-based NGS continue to advance our ability to efficiently analyze a large number of individual human genomes with high sensitivity and specificity, including separate sequence assembly of parental chromosomes (haplotyping) with recently reported false positive SNV errors of less than 10 per human genome sequenced from a 10-cell sample [Peters et al., under review
The authors are employed by Complete Genomics, Inc., a whole human genome sequencing company. Complete Genomics, Inc. is a wholly owned subsidiary of BGI-Shenzhen, a DNA sequencing company. The authors have shares in BGI-Shenzhen. Complete Genomics, Inc. has filed patents on many of the topics discussed in this article.
We would like to acknowledge the ongoing contributions and support of all Complete Genomics, Inc. employees, in particular the many highly skilled individuals that work in the libraries, reagents, and sequencing groups that make it possible to generate high quality whole genome data.