Section 3.1 Case Study: Cell Segmentation in High-Content Screening
Medical researchers often seek to understand the effects of medicines or diseases on the size, shape,
development status, and number of cells in a living organism or plant.
To do this, experts can examine the target serum or tissue under a microscope and manually assess
the desired cell characteristics.
This work is tedious and requires expert knowledge of the cell type and characteristics.
Library and Data Loading
For this research, Hill et al. (2007 ) assembled a data set consisting of 2,019 cells.
Of these cells, 1,300 were judged to be poorly segmented (PS) and 719 were well segmented (WS);
1,009 cells were reserved for the training set.
For a particular type of cell, the researchers used different stains that would be visible to different
optical channels.
Channel one was associated with the cell body and can be used to determine the cell perimeter, area,
and other qualities.
Channel two interrogated the cell nucleus by staining the nuclear DNA (shown in blue shading in this
figure).
Channels three and four were stained to detect actin and tubulin, respectively.
These are two types of filaments that transverse the cells in scaffolds and are part of the cell’s
cytoskeleton.
Retain the original training set
Section.2 Data Transformations for Individual Predictors
knitr::opts_chunk$set(echo = TRUE)
library(AppliedPredictiveModeling)
data(segmentationOriginal)
help(segmentationOriginal)
head(segmentationOriginal)
names(segmentationOriginal)
segTrain <‐ subset(segmentationOriginal, Case == "Train")
## Remove the first three columns (identifier columns)
segTrainX <‐ segTrain[, ‐(1:3)]
segTrainClass <‐ segTrain$Class
length(segTrainClass) # 1009
The most straightforward and common data transformation is to center scale the predictor variables.
Scaling the data coerce the values to have a common standard deviation of one.
These manipulations are generally used to improve the numerical stability of some calculations.
The only real downside to these transformations is a loss of interpretability of the individual values
since the data are no longer in the original units.
g302Bg4C94g1BA8g63Bg4BCg773g1846g439g773g2CB4"status"g2134g57D
The probability of falling on either side of the distribution’s mean is roughly equal.
A right-skewed distribution has a large number of points on the left side of the distribution (smaller
values) than on the right side (larger values).
A general rule of thumb to consider is that skewed data whose ratio of the highest value to the lowest
value is greater than 20 have significant skewness.
If the predictor distribution is roughly symmetric, the skewness values will be close to zero.
As the distribution becomes more right skewed, the skewness statistic becomes larger.
statusColNum <‐ grep("Status", names(segTrainX))
statusColNum
# make sure they are unary, binary, or trinary
sapply(segTrainX[, statusColNum], table)
segTrainXNC <‐ segTrainX[, ‐statusColNum] # NC: no categorical vars.
## The column VarIntenCh3 measures the standard deviation of the intensity of the pixels in the
actin filaments ()
max(segTrainX$VarIntenCh3)/min(segTrainX$VarIntenCh3) # 870.8872 is much greater than 20!
library(e1071) # for skewness()
skewness(segTrainX$VarIntenCh3) # 2.391624
library(caret)
## Use caret's preProcess function to transform. for skewness
?preProcess
segPP <‐ preProcess(segTrainX, method = "BoxCox") # A kind of power transformation
str(segPP)
names(segPP)
class(segPP)
help(preProcess)
## Apply the transformations
segTrainTrans <‐ predict(segPP, segTrainX) # predict(, )
## Results for a single predictor
segPP$bc$VarIntenCh