List of considered features for the predictive model
Features from the amino acid (AA) sequence (40 features)
• CENT_AA{AA type}: binary coding for the type of AA of the residue in the center
(CENT) of the window (20 features).
• WIN_AA_content{AA type}: number of residues of a given type of AA in the sliding
window (WIN), divided by the length of the window (20 features).
Features based physicochemical properties of AAs quantified based on the 531 amino
acid indices from the AAindex database (AAind, 2124 features):
• CENT_AAind_val{index name}: value of a given AAindex for the type of AA of the
residue in the center of the window (531 features).
• WIN_AAind_avg{index name}: average value of a given AAindex for all residues in
the sliding window (531 features).
• WIN_AAind_std{index name}: standard deviation of values of a given AAindex for
all residues in the sliding window (531 features).
• WIN_AAind_dif{index name}: difference between average value of a given AAindex
for all residues in the sliding window and average value for residues on segments
that flank the window on both sides; the number of these flanking residues equals to
the half of the window size (i.e., eight residues that extend the original window on
side are used). These features were inspired by ref. (Disfani, et al., 2012) (531
features).
Features from the putative secondary structure (SS) derived from the input
sequence using PSIPRED (SS, 22 features):
• CENT_SS_is{H, E, C}: binary coding for the type of SS of the residue in the center
(CENT) of the window (3 features).
• WIN_SS_content{H, E, C}: number of helix, strand and coil residues in the sliding
window divided by the length of the window (3 features).
• WIN_SS_sum{HE, HC, EC}: sum of number of helix and strand residues, helix and
coil residues, and strand and coil residues in the sliding window, normalized by the
length of the window (3 features).
• WIN_SS_num_region{H, E, C}: number of helix, strand and coil regions in the
sliding window, normalized by the length of the win-dow. Each region consists of a
segment of consecutive helix/strand/coil residues; the minimal length is 3/1/2, which
is the size of the shortest helix/strand (beta bridge)/coil. (3 features).
• WIN_SS_sum_regionHEC: sum of the number of helix, strand and coil regions in
the sliding window, normalized by the length of the window (1 feature).
• WIN_SS_{longest, shortest, avg}_region{H, E, C}: longest, shortest and average
length of helix, strand and coil regions in the sliding window, normalized by the
length of the window (3 × 3 = 9 features).
Features from the putative intrinsically disordered and structured regions derived
from the input sequence using IUPred (IUP, 40 features):
• CENT_IUP_is{L, S, D}: binary encoding of the prediction of long disordered
regions with IUPred_long, short disordered regions with IUPred_short and structured
regions with IUPred_struct for the residue in the center of the window (3 features).
• CENT_IUP_val{L, S}: propensity score for disorder predicted with IUPred_long and
IUPred_short for the residue in the center of the window (2 features).
• WIN_IUP_content{L, S}_{0, 1}: number of ordered and disorder residues
predicted with IUPred_long and IUPred_short in the sliding window, divided by the
length of the window (2 × 2 = 4 features).
• WIN_IUP_num_region{L, S}_{0, 1}: number of ordered and disordered regions
predicted with IUPred_long and IUPred_short in the sliding window, normalized by
the length of the window. Each region consists of a segment of consecutive
disordered or ordered residues; the minimal length of disordered regions is 4
(Monastyrskyy, et al., 2011; Monastyrskyy, et al., 2014) (2 × 2 = 4 features).
• WIN_IUP_sum_region_{L, S}_01: sum of the number of ordered and disorder
regions predicted with IUPred_long and IUPred_short in the sliding window,
normalized by the length of the window (2 features).
• WIN_IUP_{longest, shortest, avg}_region{L, S}_{0, 1}: longest, shortest and
average length of ordered and disorder regions predicted with IUPred_long and
IUPred_short in the sliding window, normalized by the length of the window (3 × 2
× 2 = 12 features).
• WIN_IUP_{avg, std}{L, S}: average and standard deviation of propensity scores
predicted with IUPred_long and IUPred_short for residues in the sliding window. (2
× 2 = 4 features).
• WIN_IUP_fractionD{0, 1}: number of residues in structured regions and other
regions (not located in structured regions) predicted with IUPred_struct in the sliding
window, divided by the length of the window (2 features).
• WIN_IUP_{longest, shortest, avg}_regionD{0, 1}: longest, shortest and average
length of structured regions and other regions (not located in structured regions)
predicted with IUPred_struct in the sliding window, normalized by the length of the
window. Each region consists of a segment of consecutive structured or non-
structured residues (3 × 2 = 6 features).
• WIN_IUP_sum_regionD01: sum of the number of structured regions and other
regions (not located in structured regions) predicted with IUPred_struct in the sliding
window, normalized by the length of the window (1 feature).
Features based on the sequence complexity derived from the input sequence using
SEG (SEG, 10 features):
CENT_SEG_isH: binary encoding of the high vs. low complexity computed with SEG
of residue in the center of the window (1 feature).
• WIN_SEG_content{L, H}: number of residues in the sliding window in low and high
complexity regions, divided by the length of the window (2 features).
• WIN_SEG_{longest, shortest, avg}_region{L, H}: longest, shortest and average
length of low and high complexity regions in the sliding window, normalized by the
length of the window (3 × 2 = 6 features).
• WIN_SEG_sum_regionLH: sum of the number of low and high complexity regions
in the sliding window, normalized by the length of the window (1 feature)