TASK: Please see the dataset ('ELP.txt' See the attachment). Using that dataset, A) make your
own idea of analysis (i.e., make a hypothesis), B) conduct the analysis using the dataset, and
C) report the results.
How did I make 'ELP.txt'?
This file is a part of large dataset which is made by "English Lexicon Project".
In this project, the researchers collect behavioral data (time and accuracy for lexical decision
and naming) as well as the characteristics of stimulus such as frequency, neighborhood,
grammatical category, etc... We can download (at least a part of) the whole data to investigate
a certain characteristic of a word can affect the processing of the word. If you want to see the
detail of the project, visit the web-page(http://elexicon.wustl.edu/).
To make a dataset, I chose some of the characteristics and the performance of lexical
decision. As a result, a data file contains 17 variables. Descriptions of the variables are as
follows,
ID: ID represents the case number, which is arbitrarily set and meaningless for us.
Sub_ID: Sub_ID represents the participant id number, which ranges 1 to 50.
Trial: Trial shows the order of trials of each participants, whose range is 1 to 3374.
Type: Type shows the stimulus type. 1 means a meaningful word. 0 means a nonsense word.
D_Accuracy: D_Accuracy represents the success/ failure of the trial. 1 means that a participant
correctly answered. 0 means an incorrect trial.
D_RT: D_RT is the time for a response in millisecond.
D_word: D_word shows what was presented on a screen as a stimulus.
Outlier: Outlier shows the too-fast and too-slow trials. 0 means a non-outlier trial. 1 means an
outlier trial, where RT was less than 200msec or more than 2000msec.
D_Zscore: D_Zscore is a normalized RT (whose mean is 0 and sd is 1). Since I have not
explained the normalization in an analysis, you do not have to care about this variable.
The following variables contain values only in meaningful word trials.
Word: Word shows the stimulus word.
Length: the number of letters
Freq_HAL: Freq_HAL is defined as the frequency of a word as reported by the HAL Study
based on the HAL corpus, which consists of approximately 131 million words gathered
across 3,000 Usenet newsgroups during February 1995.
Lund, K., & Burgess, C. (1996) Producing high-dimensional semantic
spaces from lexical co-occurrence. Behavior. Research Methods, Instruments,
& Computers, 28, 203-208.
Log_Freq_HAL: Log-transformed Freq_HAL above
Ortho_N: Ortho_N is the number of words that can be obtained by changing
one letter while preserving the identity and positions of the other
letters (i.e., Coltheart’s N; Coltheart, Davelaar, Jonasson, & Besner,
1977). For example, the ELP returns the following orthographic
neighbors of CAT: OAT, COT, VAT, CAB, MAT, CAM, BAT, RAT,
CAD, HAT, CAP, PAT, FAT, SAT, EAT, CAR, CUT, CAN.
Phono_N:Phono_N is defined as the number of phonological neighbors that a word has. This
statistic excludes homophones
Phono_N_H:Phono_N_H is defined as the number of phonological neighbors that a word
has.This statistic includes homophones
POS: POS is the part of speech of the word.
JJ adjective ("beautiful")
NN noun ("beauty")
RB adverb ("beautifully")
VB verb ("beautify")
encl enclitic group ("beauty's")
minor all other ("the", "in", "what", "uh")
? unknown
| separates alternatives: "can" VB|NN
Please notice that the variables (Length, Freq_HAL, Log_Freq_HAL, Ortho_N, Phono_N,
Phono_N_H) have the interval scale (i.e., continuous variable). Especially, Length and
Log_Freq_HAL have the normal distributions (you can check that), while other do not.
How do you write your paper?
I recommend reporting the followings.
1. hypothesis (and prediction)
2. R codes and their outputs (including graphs)
3. Summary of the analysis
What analyses can be conducted for this dataset?
I think you can conduct any analyses which we have learned in this course (i.e., t-tests,
ANOVAs, regression analyses). Please conduct at least one analysis (of course you can do as
many as you want!).
If you want to test your idea by ANOVAs, you can make extra variables coding the continuous
variables. For instance, if you want to test difference between words having orthographic
neighbors and those having no orthographic neighbors, you can use Ortho_N.
After reading a data set, you can cut nonword trials. then, make an extra variable (e.g., ortho)
and give 1 for those having any numbers of orthographic neighbors (i.e., Ortho_N >= 1) and 0
for those having no orthographic neighbors (i.e., Ortho_N == 0).
In R, the codes are like the followings.
#cut the nonsense word trials
worddata = 1,"ortho"] <- 1
Note for reading a data file.
you can use read.table() to read a text file. but this time, please add an extra argument, quote
= "" because apostrophe is included in D_Word and Word variables in the file and it can be
regarded as the edge of a string variable. If you set quote = "", you can avoid this.
sample codes are as follows,
rawdata <- read.table("ELPdata,txt", sep = "\t", fill = TRUE, header = TRUE, quote = "")
Deadline: 23:59, 28th January. This deadline is the final one. I cannot postpone it any further
because of the deadline for the assessment.
Questions: Please let me know by e-mails: