Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

CAAPA consortium

Research output: Contribution to journalComment/debate

Abstract

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

Original languageEnglish (US)
Article numbergiy052
JournalGigaScience
Volume7
Issue number6
DOIs
StatePublished - Jun 1 2018

Fingerprint

Computer Communication Networks
Merging
Computing Methodologies
Exome
Workflow
Genomics
Genome
Message passing
Electric sparks
Sorting
Scalability
Genes
Processing

Keywords

  • Hadoop
  • HBase
  • MapReduce
  • Sorted merging
  • Spark
  • Whole-genome sequencing

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications

Cite this

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files. / CAAPA consortium.

In: GigaScience, Vol. 7, No. 6, giy052, 01.06.2018.

Research output: Contribution to journalComment/debate

@article{7c51980206d3450f82f64f990896ec67,
title = "Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files",
abstract = "Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.",
keywords = "Hadoop, HBase, MapReduce, Sorted merging, Spark, Whole-genome sequencing",
author = "{CAAPA consortium} and Xiaobo Sun and Jingjing Gao and Peng Jin and Celeste Eng and Burchard, {Esteban G.} and Beaty, {Terri H.} and Ingo Ruczinski and Mathias, {Rasika Ann} and Kathleen Barnes and Fusheng Wang and Qin, {Zhaohui S.} and Barnes, {Kathleen C.} and Boorgula, {Meher Preethi} and Monica Campbell and Sameer Chavan and Ford, {Jean G.} and Cassandra Foster and Li Gao and Hansel, {Nadia N.} and Edward Horowitz and Lili Huang and Romina Ortiz and Joseph Potee and Nicholas Rafaels and Scott, {Alan F.} and Taub, {Margaret A.} and Candelaria Vergara and Yijuan Hu and Johnston, {Henry Richard} and Levin, {Albert M.} and Badri Padhukasahasram and Williams, {L. Keoki} and Dunston, {Georgia M.} and Faruque, {Mezbah U.} and Kenny, {Eimear E.} and Kimberly Gietzen and Mark Hansen and Rob Genuario and Dave Bullis and Cindy Lawley and Aniket Deshpande and Grus, {Wendy E.} and Locke, {Devin P.} and Foreman, {Marilyn G.} and Avila, {Pedro C.} and Leslie Grammer and Kim, {Kwang Youn A.} and Rajesh Kumar and Robert Schleimer and Genevieve Wojcik",
year = "2018",
month = "6",
day = "1",
doi = "10.1093/gigascience/giy052",
language = "English (US)",
volume = "7",
journal = "GigaScience",
issn = "2047-217X",
publisher = "BioMed Central",
number = "6",

}

TY - JOUR

T1 - Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

AU - CAAPA consortium

AU - Sun, Xiaobo

AU - Gao, Jingjing

AU - Jin, Peng

AU - Eng, Celeste

AU - Burchard, Esteban G.

AU - Beaty, Terri H.

AU - Ruczinski, Ingo

AU - Mathias, Rasika Ann

AU - Barnes, Kathleen

AU - Wang, Fusheng

AU - Qin, Zhaohui S.

AU - Barnes, Kathleen C.

AU - Boorgula, Meher Preethi

AU - Campbell, Monica

AU - Chavan, Sameer

AU - Ford, Jean G.

AU - Foster, Cassandra

AU - Gao, Li

AU - Hansel, Nadia N.

AU - Horowitz, Edward

AU - Huang, Lili

AU - Ortiz, Romina

AU - Potee, Joseph

AU - Rafaels, Nicholas

AU - Scott, Alan F.

AU - Taub, Margaret A.

AU - Vergara, Candelaria

AU - Hu, Yijuan

AU - Johnston, Henry Richard

AU - Levin, Albert M.

AU - Padhukasahasram, Badri

AU - Williams, L. Keoki

AU - Dunston, Georgia M.

AU - Faruque, Mezbah U.

AU - Kenny, Eimear E.

AU - Gietzen, Kimberly

AU - Hansen, Mark

AU - Genuario, Rob

AU - Bullis, Dave

AU - Lawley, Cindy

AU - Deshpande, Aniket

AU - Grus, Wendy E.

AU - Locke, Devin P.

AU - Foreman, Marilyn G.

AU - Avila, Pedro C.

AU - Grammer, Leslie

AU - Kim, Kwang Youn A.

AU - Kumar, Rajesh

AU - Schleimer, Robert

AU - Wojcik, Genevieve

PY - 2018/6/1

Y1 - 2018/6/1

N2 - Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

AB - Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

KW - Hadoop

KW - HBase

KW - MapReduce

KW - Sorted merging

KW - Spark

KW - Whole-genome sequencing

UR - http://www.scopus.com/inward/record.url?scp=85050892395&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050892395&partnerID=8YFLogxK

U2 - 10.1093/gigascience/giy052

DO - 10.1093/gigascience/giy052

M3 - Comment/debate

C2 - 29762754

AN - SCOPUS:85050892395

VL - 7

JO - GigaScience

JF - GigaScience

SN - 2047-217X

IS - 6

M1 - giy052

ER -