Our vector of 2236 features likely includes features that are irrelevant to the
prediction of DFLs and features that have high mutual correlations. We used
a two-step empirical feature selection to select a subset of features
characterized by high predictive value and low mutual correlations.
In the first step we removed low-quality features that have low correlation
with the annotation of the DFLs. We have two types of features: real-valued
(e.g. features computed as an average over the sliding window) and binary
(e.g. disordered versus ordered status of the residue in the center of the
window). Inspired by ( Disfani et al. , 2012 ; Yan et al. , 2015 ), we used
point-biserial correlation coefficient ( r pb ) and φ coefficient ( φ ),
respectively, for these two feature types:
rpb=MDFL−MNDFLSn×nDLF×nNDLFn2−−−−−−−−−√rpb=MDFL-
MNDFLSn×nDLF×nNDLFn2
(3)
φ=countF1ADFL×countF0ANDFL−countF1ANDFL×countF0ADFLcountF1 ×countF0 ×count ADFL
×countANDFLφ=countF1ADFL×countF0ANDFL-
countF1ANDFL×countF0ADFLcountF1 ×countF0 ×count ADFL×countANDFL
(4)
In formula ( 3 ), MDFL and MNDFL ( nDFL and nNDFL ) are the means (numbers)
of values a given real-valued feature for the residues annotated as DFLs and
NDFLs, respectively; n = nDFL + nNDFL and S n is the standard deviation of
all values of that feature. In formula ( 4 ), count FiAk is the number of
values i = {0, 1} of binary feature F corresponding to residues with
values k = {NDFL, DFL} of the annotation A ; count Fi and count Ak are the
number of values i = {0, 1} of binary feature F and the number of residues
with values k = {NDFL, DFL} of the annotation A , respectively. We
calculated average r pb (for the real-valued features) and φ (for the binary
features) for all considered features from four correlations computed on the
training folds from the 4-fold cross-validation on the training dataset. We
normalized the values of the average r pb and φ correlations to the −1 to 1
range using min–max normalization, and removed the features for which the
absolute normalized rpb or φ value is less than threshold T step1 . Next, we
ranked the remaining features by their absolute normalized r pb or φ values.
In the second step, inspired by ( Disfani et al. , 2012 ; Yan et al. , 2015 ), we
eliminated mutually correlated features using the Pearson correlation
coefficient ( r pc ). First, a set of selected features is initialized with the top-
ranked in the first step feature. Next, we calculated r pc between the next-
ranked feature and all selected features. If the absolute value of this r pc is
less than threshold T step2 , then we add this next-ranked features into the set
of selected features, otherwise we do not add it. We apply this procedure
through the entire list of ranked features passed from the first step.
We vary values of each of the two thresholds, T step1 and T step2 between 0.1
and 0.9 with step of 0.05, to obtain 17 × 17 = 289 different feature sets. The
corresponding feature sets vary in size between 1 and 884 features. Each
feature set is used with three classifiers: logistic regression, naive Bayes and
k-nearest neighbor, in the 4-folds cross-validation on the training dataset to
select the design that offers the highest AUC value. We also parameterized
logistic regression and k-nearest neighbor classifiers for each of these
experiments by selecting their parameters that correspond to the highest
AUC in the 4-folds cross-validation on the training dataset. Naive Bayes has
no parameters. For the logistic regression, we considered ridge = 10 x ,
where xranges from −4 to 4 with step of 1. For the k-nearest neighbor, we
consider the number of neighbors k ranging from 50 to 800 with the step of
50.Supplementary Table S1 summarizes results with the highest AUC value
for each of the three classifiers, which are selected from across the
experiments that correspond to 7514 combinations of the two thresholds and
different parameters of classifiers (289 combinations for Naïve Bayes +
9*298 for logistic regression + 16*289 for k-nearest neighbor).
We selected the logistic regression classifier with four features that gives the
highest values of AUC, AUC lowFPR and ratio. The differences in these
three measures of predictive quality between the logistic regression and the
other two classifiers are statistically significant. The ratio reveals that the
selected design is 3.3 times better than a random predictor when predicting
with low FPR, i.e. when a high fraction of predictions of DFL residues
(predicted positive residues) is correct. The architecture of this model is
shown in Figure 1 . Given an input AA sequence, it uses putative
annotations of structured and long disordered regions generated with IUPred
and two physicochemical properties of residues that quantify propensity for
formation of helices and turns.