A methodology for exploring biomarker - phenotype associations

Application to flow cytometry data and systemic sclerosis clinical manifestations

Hongtai Huang, Andrea Fava, Tara Guhr, Raffaello Cimbro, Antony Rosen, Francesco Boin, Hugh Ellis

Research output: Contribution to journalArticle

Abstract

Background: This work seeks to develop a methodology for identifying reliable biomarkers of disease activity, progression and outcome through the identification of significant associations between high-throughput flow cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from disease progression or unnecessary treatment side effects. The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and validate ILD risk screening tools. Results: Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high degree of success (>82% correct classification in validation; 79 patients in the training data set, 40 patients in the validation data set). Conclusions: IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge of systemic sclerosis pathogenesis.

Original languageEnglish (US)
Article number293
JournalBMC Bioinformatics
Volume16
Issue number1
DOIs
StatePublished - Sep 15 2015

Fingerprint

Systemic Sclerosis
Flow Cytometry
Pulmonary diseases
Flow cytometry
Systemic Scleroderma
Biomarkers
Interstitial Lung Diseases
Phenotype
Lung
Methodology
Scleroderma
Screening
Genes
Disease Progression
Subset
Random Forest
Immune Response
Gene
Progression
Learning systems

Keywords

  • Conditional random forests
  • Flow cytometry
  • Gene set enrichment analysis
  • Interstitial lung disease
  • Scleroderma

ASJC Scopus subject areas

  • Applied Mathematics
  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications

Cite this

A methodology for exploring biomarker - phenotype associations : Application to flow cytometry data and systemic sclerosis clinical manifestations. / Huang, Hongtai; Fava, Andrea; Guhr, Tara; Cimbro, Raffaello; Rosen, Antony; Boin, Francesco; Ellis, Hugh.

In: BMC Bioinformatics, Vol. 16, No. 1, 293, 15.09.2015.

Research output: Contribution to journalArticle

Huang, Hongtai ; Fava, Andrea ; Guhr, Tara ; Cimbro, Raffaello ; Rosen, Antony ; Boin, Francesco ; Ellis, Hugh. / A methodology for exploring biomarker - phenotype associations : Application to flow cytometry data and systemic sclerosis clinical manifestations. In: BMC Bioinformatics. 2015 ; Vol. 16, No. 1.
@article{07bac3288c84483b932f1a04a54c6e09,
title = "A methodology for exploring biomarker - phenotype associations: Application to flow cytometry data and systemic sclerosis clinical manifestations",
abstract = "Background: This work seeks to develop a methodology for identifying reliable biomarkers of disease activity, progression and outcome through the identification of significant associations between high-throughput flow cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from disease progression or unnecessary treatment side effects. The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and validate ILD risk screening tools. Results: Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high degree of success (>82{\%} correct classification in validation; 79 patients in the training data set, 40 patients in the validation data set). Conclusions: IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge of systemic sclerosis pathogenesis.",
keywords = "Conditional random forests, Flow cytometry, Gene set enrichment analysis, Interstitial lung disease, Scleroderma",
author = "Hongtai Huang and Andrea Fava and Tara Guhr and Raffaello Cimbro and Antony Rosen and Francesco Boin and Hugh Ellis",
year = "2015",
month = "9",
day = "15",
doi = "10.1186/s12859-015-0722-x",
language = "English (US)",
volume = "16",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - A methodology for exploring biomarker - phenotype associations

T2 - Application to flow cytometry data and systemic sclerosis clinical manifestations

AU - Huang, Hongtai

AU - Fava, Andrea

AU - Guhr, Tara

AU - Cimbro, Raffaello

AU - Rosen, Antony

AU - Boin, Francesco

AU - Ellis, Hugh

PY - 2015/9/15

Y1 - 2015/9/15

N2 - Background: This work seeks to develop a methodology for identifying reliable biomarkers of disease activity, progression and outcome through the identification of significant associations between high-throughput flow cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from disease progression or unnecessary treatment side effects. The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and validate ILD risk screening tools. Results: Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high degree of success (>82% correct classification in validation; 79 patients in the training data set, 40 patients in the validation data set). Conclusions: IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge of systemic sclerosis pathogenesis.

AB - Background: This work seeks to develop a methodology for identifying reliable biomarkers of disease activity, progression and outcome through the identification of significant associations between high-throughput flow cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from disease progression or unnecessary treatment side effects. The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and validate ILD risk screening tools. Results: Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high degree of success (>82% correct classification in validation; 79 patients in the training data set, 40 patients in the validation data set). Conclusions: IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge of systemic sclerosis pathogenesis.

KW - Conditional random forests

KW - Flow cytometry

KW - Gene set enrichment analysis

KW - Interstitial lung disease

KW - Scleroderma

UR - http://www.scopus.com/inward/record.url?scp=84941636848&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84941636848&partnerID=8YFLogxK

U2 - 10.1186/s12859-015-0722-x

DO - 10.1186/s12859-015-0722-x

M3 - Article

VL - 16

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 293

ER -