Assembly of a pan-genome from deep sequencing of 910 humans of African descent

Rachel M. Sherman, Juliet Forman, Valentin Antonescu, Daniela Puiu, Michelle Daya, Nicholas Rafaels, Meher Preethi Boorgula, Sameer Chavan, Candelaria Vergara, Victor E. Ortega, Albert M. Levin, Celeste Eng, Maria Yazdanbakhsh, James G. Wilson, Javier Marrugo, Leslie A. Lange, L. Keoki Williams, Harold Watson, Lorraine B. Ware, Christopher O. OlopadeOlufunmilayo Olopade, Ricardo R. Oliveira, Carole Ober, Dan L. Nicolae, Deborah A. Meyers, Alvaro Mayorga, Jennifer Knight-Madden, Tina Hartert, Nadia Hansel, Marilyn G. Foreman, Jean G. Ford, Mezbah U. Faruque, Georgia M. Dunston, Luis Caraballo, Esteban G. Burchard, Eugene R. Bleecker, Maria I. Araujo, Edwin F. Herrera-Paz, Monica Campbell, Cassandra Foster, Margaret Anne Taub, Terri L Beaty, Ingo Ruczinski, Rasika Mathias, Kathleen C. Barnes, Steven L Salzberg

Research output: Contribution to journalLetter

Abstract

We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

Original languageEnglish (US)
Pages (from-to)30-35
Number of pages6
JournalNature Genetics
Volume51
Issue number1
DOIs
StatePublished - Jan 1 2019

Fingerprint

High-Throughput Nucleotide Sequencing
Genome
Human Genome
DNA
Population
Proteins

ASJC Scopus subject areas

  • Genetics

Cite this

Assembly of a pan-genome from deep sequencing of 910 humans of African descent. / Sherman, Rachel M.; Forman, Juliet; Antonescu, Valentin; Puiu, Daniela; Daya, Michelle; Rafaels, Nicholas; Boorgula, Meher Preethi; Chavan, Sameer; Vergara, Candelaria; Ortega, Victor E.; Levin, Albert M.; Eng, Celeste; Yazdanbakhsh, Maria; Wilson, James G.; Marrugo, Javier; Lange, Leslie A.; Williams, L. Keoki; Watson, Harold; Ware, Lorraine B.; Olopade, Christopher O.; Olopade, Olufunmilayo; Oliveira, Ricardo R.; Ober, Carole; Nicolae, Dan L.; Meyers, Deborah A.; Mayorga, Alvaro; Knight-Madden, Jennifer; Hartert, Tina; Hansel, Nadia; Foreman, Marilyn G.; Ford, Jean G.; Faruque, Mezbah U.; Dunston, Georgia M.; Caraballo, Luis; Burchard, Esteban G.; Bleecker, Eugene R.; Araujo, Maria I.; Herrera-Paz, Edwin F.; Campbell, Monica; Foster, Cassandra; Taub, Margaret Anne; Beaty, Terri L; Ruczinski, Ingo; Mathias, Rasika; Barnes, Kathleen C.; Salzberg, Steven L.

In: Nature Genetics, Vol. 51, No. 1, 01.01.2019, p. 30-35.

Research output: Contribution to journalLetter

Sherman, RM, Forman, J, Antonescu, V, Puiu, D, Daya, M, Rafaels, N, Boorgula, MP, Chavan, S, Vergara, C, Ortega, VE, Levin, AM, Eng, C, Yazdanbakhsh, M, Wilson, JG, Marrugo, J, Lange, LA, Williams, LK, Watson, H, Ware, LB, Olopade, CO, Olopade, O, Oliveira, RR, Ober, C, Nicolae, DL, Meyers, DA, Mayorga, A, Knight-Madden, J, Hartert, T, Hansel, N, Foreman, MG, Ford, JG, Faruque, MU, Dunston, GM, Caraballo, L, Burchard, EG, Bleecker, ER, Araujo, MI, Herrera-Paz, EF, Campbell, M, Foster, C, Taub, MA, Beaty, TL, Ruczinski, I, Mathias, R, Barnes, KC & Salzberg, SL 2019, 'Assembly of a pan-genome from deep sequencing of 910 humans of African descent', Nature Genetics, vol. 51, no. 1, pp. 30-35. https://doi.org/10.1038/s41588-018-0273-y
Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nature Genetics. 2019 Jan 1;51(1):30-35. https://doi.org/10.1038/s41588-018-0273-y
Sherman, Rachel M. ; Forman, Juliet ; Antonescu, Valentin ; Puiu, Daniela ; Daya, Michelle ; Rafaels, Nicholas ; Boorgula, Meher Preethi ; Chavan, Sameer ; Vergara, Candelaria ; Ortega, Victor E. ; Levin, Albert M. ; Eng, Celeste ; Yazdanbakhsh, Maria ; Wilson, James G. ; Marrugo, Javier ; Lange, Leslie A. ; Williams, L. Keoki ; Watson, Harold ; Ware, Lorraine B. ; Olopade, Christopher O. ; Olopade, Olufunmilayo ; Oliveira, Ricardo R. ; Ober, Carole ; Nicolae, Dan L. ; Meyers, Deborah A. ; Mayorga, Alvaro ; Knight-Madden, Jennifer ; Hartert, Tina ; Hansel, Nadia ; Foreman, Marilyn G. ; Ford, Jean G. ; Faruque, Mezbah U. ; Dunston, Georgia M. ; Caraballo, Luis ; Burchard, Esteban G. ; Bleecker, Eugene R. ; Araujo, Maria I. ; Herrera-Paz, Edwin F. ; Campbell, Monica ; Foster, Cassandra ; Taub, Margaret Anne ; Beaty, Terri L ; Ruczinski, Ingo ; Mathias, Rasika ; Barnes, Kathleen C. ; Salzberg, Steven L. / Assembly of a pan-genome from deep sequencing of 910 humans of African descent. In: Nature Genetics. 2019 ; Vol. 51, No. 1. pp. 30-35.
@article{d6da2929688948b38358e1c4313c5a50,
title = "Assembly of a pan-genome from deep sequencing of 910 humans of African descent",
abstract = "We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10{\%} more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.",
author = "Sherman, {Rachel M.} and Juliet Forman and Valentin Antonescu and Daniela Puiu and Michelle Daya and Nicholas Rafaels and Boorgula, {Meher Preethi} and Sameer Chavan and Candelaria Vergara and Ortega, {Victor E.} and Levin, {Albert M.} and Celeste Eng and Maria Yazdanbakhsh and Wilson, {James G.} and Javier Marrugo and Lange, {Leslie A.} and Williams, {L. Keoki} and Harold Watson and Ware, {Lorraine B.} and Olopade, {Christopher O.} and Olufunmilayo Olopade and Oliveira, {Ricardo R.} and Carole Ober and Nicolae, {Dan L.} and Meyers, {Deborah A.} and Alvaro Mayorga and Jennifer Knight-Madden and Tina Hartert and Nadia Hansel and Foreman, {Marilyn G.} and Ford, {Jean G.} and Faruque, {Mezbah U.} and Dunston, {Georgia M.} and Luis Caraballo and Burchard, {Esteban G.} and Bleecker, {Eugene R.} and Araujo, {Maria I.} and Herrera-Paz, {Edwin F.} and Monica Campbell and Cassandra Foster and Taub, {Margaret Anne} and Beaty, {Terri L} and Ingo Ruczinski and Rasika Mathias and Barnes, {Kathleen C.} and Salzberg, {Steven L}",
year = "2019",
month = "1",
day = "1",
doi = "10.1038/s41588-018-0273-y",
language = "English (US)",
volume = "51",
pages = "30--35",
journal = "Nature Genetics",
issn = "1061-4036",
publisher = "Nature Publishing Group",
number = "1",

}

TY - JOUR

T1 - Assembly of a pan-genome from deep sequencing of 910 humans of African descent

AU - Sherman, Rachel M.

AU - Forman, Juliet

AU - Antonescu, Valentin

AU - Puiu, Daniela

AU - Daya, Michelle

AU - Rafaels, Nicholas

AU - Boorgula, Meher Preethi

AU - Chavan, Sameer

AU - Vergara, Candelaria

AU - Ortega, Victor E.

AU - Levin, Albert M.

AU - Eng, Celeste

AU - Yazdanbakhsh, Maria

AU - Wilson, James G.

AU - Marrugo, Javier

AU - Lange, Leslie A.

AU - Williams, L. Keoki

AU - Watson, Harold

AU - Ware, Lorraine B.

AU - Olopade, Christopher O.

AU - Olopade, Olufunmilayo

AU - Oliveira, Ricardo R.

AU - Ober, Carole

AU - Nicolae, Dan L.

AU - Meyers, Deborah A.

AU - Mayorga, Alvaro

AU - Knight-Madden, Jennifer

AU - Hartert, Tina

AU - Hansel, Nadia

AU - Foreman, Marilyn G.

AU - Ford, Jean G.

AU - Faruque, Mezbah U.

AU - Dunston, Georgia M.

AU - Caraballo, Luis

AU - Burchard, Esteban G.

AU - Bleecker, Eugene R.

AU - Araujo, Maria I.

AU - Herrera-Paz, Edwin F.

AU - Campbell, Monica

AU - Foster, Cassandra

AU - Taub, Margaret Anne

AU - Beaty, Terri L

AU - Ruczinski, Ingo

AU - Mathias, Rasika

AU - Barnes, Kathleen C.

AU - Salzberg, Steven L

PY - 2019/1/1

Y1 - 2019/1/1

N2 - We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

AB - We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

UR - http://www.scopus.com/inward/record.url?scp=85057069294&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057069294&partnerID=8YFLogxK

U2 - 10.1038/s41588-018-0273-y

DO - 10.1038/s41588-018-0273-y

M3 - Letter

C2 - 30455414

AN - SCOPUS:85057069294

VL - 51

SP - 30

EP - 35

JO - Nature Genetics

JF - Nature Genetics

SN - 1061-4036

IS - 1

ER -