Predicting the impact of genetic variation on human health remains an important and difficult challenge. Often, algorithmic classifiers are tasked with predicting binary traits (e.g. positive or negative for a disease) from missense variation. Though useful, this arrangement is limiting and contrived, because human diseases often comprise a spectrum of severities, rather than a discrete partitioning of patient populations. Furthermore, labeling variants as causal or benign can be error prone, which is problematic for training supervised learning algorithms (the so-called garbage in, garbage out phenomenon).We explore the potential value of training classifiers using continuous-valued quantitative measurements, rather than binary traits. Using 20 variants from cystic fibrosis transmembrane conductance regulator (CFTR) nucleotide-binding domains and six quantitative measures of cystic fibrosis (CF) severity, we trained classifiers to predict CF severity from CFTR variants. Employing cross validation, classifier prediction and measured clinical/functional valueswere significantly correlated for four of six quantitative traits (correlation P-values from 1.35 × 10-4 to 4.15 × 10-3). Classifiers were also able to stratify variants by three clinically relevant risk categories with 85-100% accuracy, depending on which of the six quantitative traits was used for training. Finally, we characterized 11 additional CFTR variants using clinical sweat chloride testing, two functional assays, or all three diagnostics, and validated our classifier using blind prediction. Predictions were within the measured sweat chloride range for seven of eight variants, and captured the differential impact of specific variants on the two functional assays. This work demonstrates a promising and novel framework for assessing the impact of genetic variation.
ASJC Scopus subject areas
- Molecular Biology