Assignment 6: Relation Extraction & Classification

Introduction

Relation extraction is the task of “finding and classifying semantic relations among the text entities”[ J+M 3rd ed., Ch.18, p.1]. Given a sentence containing two entities (called head and tail), the goal is to classify the relation between the head entity and the tail. For example, from the sentence “Newton served as the president of the Royal Society”, the relation “is a member of” between the head entity “Newton” and the tail entity “the Royal Society” can be extracted.

Newton served as the president of the Royal Society. → member_of

head tail relation

In this assignment, you will build a Naive Bayes model based on bag-of-word (BoW) features to classify the relation of a sentence. Your program should process and classify a sentence as indicating one of the following relations: publisher, director, performer, and characters.

Task

Input

Three csv files: train.csv, eval.csv, and test.csv, adapted from the FewRel dataset, will be provided. You may need to perform some preprocessing on the data to facilitate classification. Each csv file contains 5 columns:

Column Example Description

row_id 435 The unique row id.

tokens Trapped and Deceived ) is a 1994 television film directed by Robert Iscove . The tokenized sentence, separated by a single space.

relation director The correct relation (original label).

head_pos 0 1 2 Position of the head entity (Trapped and Deceived). Indices start with 0 and are separated by a single space.

tail_pos 11 12 Position of the tail entity (Robert Iscove).

Part 1: Classify Text

Write and document a program that trains a Naive Bayes classifier (Chapter 4) for the 4-class classification problem at hand. Your program will process texts and classify them as belonging to one of the following classes: publisher, director, performer, and characters. The NB classifier must be implemented from scratch, instead of calling libraries like scikit-learn.

Your program should take train.csv as input, train the model, and print the training accuracy (using 3-fold cross validation, see Section 4.8). After training the model, the program should take dev.csv and test.csv as input, write the predictions to files, and print the accuracy on the test set.

The output should be csv files (named output_[inputFilename].csv) containing the following 3 columns: original_label, classifier_assigned_label, and row_id. As an example, the output file should be named output_test.csv when running the test data through your classifier. There are no original labels in dev.csv, and the column should be left blank.

For the test set, report a confusion matrix with precision and recall as in Fig 4.5. Also, report the aggregated (pooled) micro-averaged and macro-averaged precision as in Fig 4.6.

Document and justify your design decisions. The description in the textbook leaves some of the options for building an NB classifier and feature selection open (e.g., tokenization, model parameters, handling stop words, and handling unknown words). It is up to you to decide what to do (if needed) with the head and tail entities. You are expected to make choices with respect to these issues, explain those choices, and justify them in your report.

Part 2: Error analysis

Take a look at text excerpts that are incorrectly classified. Which classes of text have a tendency to be misclassified as another (e.g., is performer misclassified as characters)? Do the misclassified texts share common attributes? If so, what are they?

Take a look at the output from running eval.csv through your classifier. Do the assigned labels make sense? When they don’t, do the texts share attributes with the test.csv cases that were misclassified?

Note: the questions listed above are not exhaustive.

Output: CSV files, code, report, and documentation

Your final repository should have this structure:

code/

data/

output/output_dev.csv

output/output_test.csv

.gitignore

README.md

report.pdf

Predictions

Include the output from when you test your model using dev.csv and test.csv. Each sample should be output with its predicted label and true label. The true label for dev.csv will be blank as the input file does not contain a true label.

Report

●The report should be in PDF format.

●All of the decisions that you make should be justified within the report.

●It should contain a table describing model performance (e.g., accuracy).

●It should contain any of the other information that is requested above.

●Please use appropriate sub-headings so that it is easier for the marker to find key information.

Code and README

Include all source files in the code/ folder, along with a readme file (in Markdown or plain text) describing necessary steps to run the code. You should also acknowledge all the sources consulted in the readme file.

Suggestions

●Chapter 4 of the textbook provides detailed instructions for building a NB text classifier, as well as the methods for evaluation.

●Chapter 18 describes the relation extraction task, which you can read if you are interested.

●For input and output, use Python’s built-in csv module or pandas to ensure the format is correct.

Description of the relations

Relation Description

director director(s) of film, TV-series, stageplay, video game or similar

performer actor, musician, band or other performer associated with this role or musical work

publisher organization or person responsible for publishing books, periodicals, games or software

characters characters which appear in this item (like plays, operas, operettas, books, comics, films, TV series, video games)

Further reading

The method for relation extraction described in this assignment, i.e., training a text classifier to classify the relations, requires a substantial amount of training data for each of the relations. However, the method is not very scalable as it is not easy to obtain a large amount of labelled data to support supervised learning. To alleviate this issue, some clever ideas emerged. Distant supervision could generate a “silver-standard” training set by first finding many pairs of entities that belong to the relation of interest, and then using all sentences containing those entities as training data for that relation. Few-shot learning tries to train ML models that could easily generalize to unseen classes by only looking at a few examples from the unseen class.

联系我们