首页 > > 详细

Assignment 6: Relation Extraction & Classification

 Assignment 6: Relation Extraction & Classification

Introduction
Relation extraction is the task of “finding and classifying semantic relations among the text entities”[ J+M 3rd ed., Ch.18, p.1]. Given a sentence containing two entities (called head and tail), the goal is to classify the relation between the head entity and the tail. For example, from the sentence “Newton served as the president of the Royal Society”, the relation “is a member of” between the head entity “Newton” and the tail entity “the Royal Society” can be extracted.
 
Newton served as the president of the Royal Society. member_of
head tail relation
 
In this assignment, you will build a Naive Bayes model based on bag-of-word (BoW) features to classify the relation of a sentence. Your program should process and classify a sentence as indicating one of the following relations: publisher, director, performer, and characters.
Task
Input
Three csv files: train.csv, eval.csv, and test.csv, adapted from the FewRel dataset, will be provided. You may need to perform some preprocessing on the data to facilitate classification. Each csv file contains 5 columns:
 
Column Example Description
row_id 435 The unique row id.
tokens Trapped and Deceived ) is a 1994 television film directed by Robert Iscove . The tokenized sentence, separated by a single space.
relation director The correct relation (original label).
head_pos 0 1 2 Position of the head entity (Trapped and Deceived). Indices start with 0 and are separated by a single space.
tail_pos 11 12 Position of the tail entity (Robert Iscove).
 
Part 1: Classify Text
Write and document a program that trains a Naive Bayes classifier (Chapter 4) for the 4-class classification problem at hand. Your program will process texts and classify them as belonging to one of the following classes: publisher, director, performer, and characters. The NB classifier must be implemented from scratch, instead of calling libraries like scikit-learn.
 
Your program should take train.csv as input, train the model, and print the training accuracy (using 3-fold cross validation, see Section 4.8). After training the model, the program should take dev.csv and test.csv as input, write the predictions to files, and print the accuracy on the test set.
 
The output should be csv files (named output_[inputFilename].csv) containing the following 3 columns: original_label, classifier_assigned_label, and row_id. As an example, the output file should be named output_test.csv when running the test data through your classifier. There are no original labels in dev.csv, and the column should be left blank.
 
For the test set, report a confusion matrix with precision and recall as in Fig 4.5. Also, report the aggregated (pooled) micro-averaged and macro-averaged precision as in Fig 4.6.
 
Document and justify your design decisions. The description in the textbook leaves some of the options for building an NB classifier and feature selection open (e.g., tokenization, model parameters, handling stop words, and handling unknown words). It is up to you to decide what to do (if needed) with the head and tail entities. You are expected to make choices with respect to these issues, explain those choices, and justify them in your report.
Part 2: Error analysis
Take a look at text excerpts that are incorrectly classified. Which classes of text have a tendency to be misclassified as another (e.g., is performer misclassified as characters)? Do the misclassified texts share common attributes? If so, what are they?
 
Take a look at the output from running eval.csv through your classifier. Do the assigned labels make sense? When they don’t, do the texts share attributes with the test.csv cases that were misclassified? 
 
Note: the questions listed above are not exhaustive.
 
 
Output: CSV files, code, report, and documentation
Your final repository should have this structure:
code/
data/
output/output_dev.csv
output/output_test.csv
.gitignore
README.md
report.pdf
Predictions
Include the output from when you test your model using dev.csv and test.csv. Each sample should be output with its predicted label and true label. The true label for dev.csv will be blank as the input file does not contain a true label.
Report
●The report should be in PDF format.
●All of the decisions that you make should be justified within the report.
●It should contain a table describing model performance (e.g., accuracy).
●It should contain any of the other information that is requested above.
●Please use appropriate sub-headings so that it is easier for the marker to find key information.
Code and README
Include all source files in the code/ folder, along with a readme file (in Markdown or plain text) describing necessary steps to run the code. You should also acknowledge all the sources consulted in the readme file.
Suggestions
●Chapter 4 of the textbook provides detailed instructions for building a NB text classifier, as well as the methods for evaluation. 
●Chapter 18 describes the relation extraction task, which you can read if you are interested.
●For input and output, use Python’s built-in csv module or pandas to ensure the format is correct.
 
 
Description of the relations
 
Relation Description
director director(s) of film, TV-series, stageplay, video game or similar
performer actor, musician, band or other performer associated with this role or musical work
publisher organization or person responsible for publishing books, periodicals, games or software
characters characters which appear in this item (like plays, operas, operettas, books, comics, films, TV series, video games)
 
Further reading
The method for relation extraction described in this assignment, i.e., training a text classifier to classify the relations, requires a substantial amount of training data for each of the relations. However, the method is not very scalable as it is not easy to obtain a large amount of labelled data to support supervised learning. To alleviate this issue, some clever ideas emerged. Distant supervision could generate a “silver-standard” training set by first finding many pairs of entities that belong to the relation of interest, and then using all sentences containing those entities as training data for that relation. Few-shot learning tries to train ML models that could easily generalize to unseen classes by only looking at a few examples from the unseen class.  
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!