CC 03
The game rock, paper, scissors is a classic tool used to make important decisions among two friends (read https://www.wikihow.com/Play-Rock,-Paper,-Scissors).
1. Create a function named winner_RPS. The function winner_RPS takes two parameters, p1 and p2.
It returns p1 (the parameter) if p1 wins; p2 if p2 wins; and None if there is no winner.
Try to write your solution such that the winner_RPS has a single return statement.
This code needs to be in the lesson.py module.
2. p1 and p2 can only be one of the following values:
"rock", "paper", "scissors"
3. Test your code. You can write your own tests and not rely on the testing framework. In main.py write a test function named test_RPS to verify that your code is working. The function doesn't have any parameters and should test your winner_RPS at least a few times (more than twice!). Try to figure out how to return True if it passes all your tests (False otherwise).
Use the import statement in main.py (i.e. import lesson)
For example the following could be a test:
import lesson
t1 = 'rock'
t2 = 'paper'
if lesson.winner_RPS(t1, t2) != t2:
print('Test FAIL')
You can use also random.choice to model the selection part of the game play as well:
values = "rock,paper,scissors".split(',')
p1 = random.choice(values)
print(p1)
How would you test all possible cases? Is that even possible ?
CC 04
🐍 Coding Challenges: Lucky 777
Prerequisites:
Python Regular Expressions, Parts 1,2,3
Python Remote I/O
DSP: Jupyter Lesson
Hapaxe more thing...
When a word occurs only once in a body of work or entire written record, it's called a hapaxe. However, there are disagreements on how narrow the set of works can be. Usually, a hapaxe can only appear once for an author's entire collection rather than just within a specific piece.
For example, Hamlet has a famous hapaxe 'Hebenon', a poison. It is said that this is Shakespeare's only use of the word. However if you look for hapaxes (aka hapax legomenon) in a single piece of text, there are many: Hamlet has over 2700 words that occur only once. Let's extend this fun fact to find a unique set of words within a body of text that do share some very specific attributes.
Let's classify all words in a body of text by how often they occur. A body of text is a lucky winner if it contains 7 words each that occur 7 times and each word is 7 characters long. For this project you will create a notebook, import some text and determine if the text is a 'winner'.
However you will write your solution to be generic so that any number could be passed in (e.g. 4 letter words that only occur 4 times and there are a total of 4 of them).
All code will be in your Colaboratory Notebook and it will be graded using Gradescope.
Step 0: Starting Point, New Notebook
https://colab.research.google.com/notebooks/welcome.ipynb
Be sure you are logged into Google using your @gmail address so your notebook will be saved to your drive.
Open a new colab notebook via File->New Python 3 notebook
Name it INFO490-777
Your notebook will be saved to your google drive in a special folder:
Step 1: Paste in Starter Code
In lesson.py there is some starter code for this project. Put this code into a new code cell in your notebook.
Step 2: Make Hamlet's text available via Google Drive
This step is a bit superfluous in that we moving data from Project Gutenberg to your Google Drive and then accessing Hamlet from there. Why? Because it's useful to know the steps involved to make data accessible via Google. You can also use this method to access any data (csv files, images, etc) that are located in your personal drive.
Many versions
A previous lesson also used a specific text of Hamlet (RemoteIO); however, there are many editions/versions of this famous play (you can even take classes that study the different versions). On Project Gutenberg you can see different versions:
http://www.gutenberg.org/ebooks/search/?query=hamlet
For this project we will use this version.
Please read the Director's and Scanner's note to learn some of the details of this specific version of Hamlet.
Here's the easiest workflow (you are free to use any other method as well) to move that document into your Google Drive space:
Open a new tab in your browser, go to http://www.gutenberg.org/ebooks/2265
Save the UTF-8 version to your computer:
Save to your computer (name it hamlet.txt)
Go to your Google Drive account and select the New Button and the 'File upload' option to upload hamlet.txt from your computer.
Get the share link.
The main thing is you need the ID of the document. For example,
https://drive.google.com/open?id=19pOCDIXak04cTs7TLiEA3TKUCESU10ZM
Note that this is NOT the url you can use to fetch via the remote I/O in Python. It is a 'browser' friendly URL.
Step 3: Define the following function which returns the ID of Hamlet on your Google Drive:
def get_book_id():
# replace this with your resource Id
return '19pOCDIXak04cTs7TLiEA3TKUCESU10ZM'
Step 4: finish the implementation for build_google_drive_url() (see lesson.py for the code)
This function builds the url to fetch a document saved on Google Drive. You will then add to the baseurl the request parameters properly encoded. If this sounds difficult, go back to the RemoteIO lesson -- the answer is there.
You can use the current implementation (which returns the Project Gutenberg url) for partial credit.
Step 5: TEST it
test your solution by downloading and reading your novel:
def get_hamlet():
g_id = get_book_id()
url = build_google_drive_url(g_id)
return read_remote(url)
hamlet = get_hamlet()
print(hamlet[0:100])
Step 6: Take a break to find the answer to life.
It's common? knowledge that the answer to life is 42: https://www.independent.co.uk/life-style/history/42-the-answer-to-life-the-universe-and-everything-2205734.html.
Shakespeare must have known this as well. Run the following code (get_hamlet needs to be working):
ANSWER_TO_LIFE = 42
def answer_to_life():
text = get_hamlet()
idx = text.find('To be,')
ans = text[idx:idx+ANSWER_TO_LIFE]
return ans
print(answer_to_life())
Step 7: Implement the following:
def clean_hamlet(text):
return text
Remove everything before the start of the play (i.e. the play starts with the line: The Tragedie of Hamlet)
Remove everything after the end of the play (i.e. the play ends after the final line (hint: the final line starts with FINIS)
Hint: use the search method for regular expressions (see Regular Expressions Part 3)
Remove any leading or trailing whitespace
Be sure to test your code before moving on
Do not hard code indices (e.g. return text[2345:4509])
If you find yourself using \n\r\t, you're on the wrong path. The auto-grader uses the same version of Hamlet, but the whitespace is not the same as what's on Project Gutenberg -- and this was not done on purpose, it's the result of what happens when you download/upload text documents between different architectures.
Step 8: Implement the following:
def find_lucky(text, num):
lucky = []
return sorted(lucky)
The function find_lucky parses/tokenizes text (see rules below) and returns a sorted list of words if the text is 'lucky' (see above definition). Otherwise, return the empty list.
The following rules apply to tokenize and classify words:
Use the re module to tokenize the text
a token is a word that contains only letters and/or apostrophes (e.g. who, do's, wrong'd).
normalize the token to lower case. For this lesson you can keep quoted words (it won't affect the answer) but ideally, you would remove them (e.g. 'happy' would become happy).
For example, if the parameter num is 7 then it returns an array of words ONLY if all the following conditions are true:
each word has 7 characters
each word occurs 7 times in the text
there are 7 of these words
For example:
text = """
A boy, a cat, a rat and a dog were friends.
But the cat ate the rat. The dog ate the cat.
The boy? The boy and dog were friends.
"""
print(find_lucky(text, 3))
Should return 3 words ('boy', 'cat', 'dog')
Step 9: Test your function:
def test_777():
hamlet = clean_hamlet(get_hamlet())
print(find_lucky(hamlet, 7))
# comment me out before submitting!!
test_777()
See if Hamlet has any lucky numbers: (put this code inside the function test_777):
for n in range(2,10):
print(n, find_lucky(hamlet, n))
Step 10: Submit notebook to Gradescope:
Go to gradescope.com and signup (or login for those who have already signed up). You MUST use your @illinois.edu address. Hit the "Sign Up" button at the top of the page.
The class code is 9YGP8E
Comment out any testing code that exists outside of any function.
Download your notebook as .py file
Rename the file as solution.py
submit that file to gradescope assignment named Lucky777
Final Submission:
Submission Process for Repl.it Credit:
To submit this repl.it lesson, the ONLY code that needs to pass is get_book_id (lesson.py) The testing framework will attempt to download and read it.
def get_book_id:
# return the id of Hamlet stored on your Google Drive
You can tag any question you have with py777 on Piazza
10.25.2019
All rights reserved
Addendum:
As part of the working out the details for the 777 assignment (the idea came from reading about finding a reference to the hapaxe Hebenon) the following fun fact was found: William Shakespeare had a fascination with the number 7. So the question to ponder (after you finished this assignment) is did Shakespeare hide this fun fact inside of Hamlet or is it purely coincidental?
Readings and References:
https://books.google.com/books?id=rn18DwAAQBAJ&pg=PT154&lpg=PT154&dq=William+Shakespeare++%22number+7
https://books.google.com/books?id=MwBNel_aX0wC&pg=PA67&lpg=PA67&dq=shakespeare+numerology
http://www.richardking.net/numart-ws.htm
https://www.celebrities-galore.com/celebrities/william-shakespeare/lucky-number/ 😉
CC 05
🐍 Coding Challenges: Finding Characters
Prerequisites:
UP: Regular Expressions
DSP: Jupyter
DSP: Ngrams
One of the goals of the Cliff Note Generator was to generate a list of characters that are in a novel. We can actually use our current skill set and include the techniques discussed in the nGrams lesson to extract (with a good level of accuracy) the main characters of a novel. We will also make some improvements with some of the parsing, cleaning, and preparation of the data. It would be best to read this entire lesson before doing any coding. Also note that this lesson is a bit different in that you will be responsible for more of the code writing. What is being specified is a minimum. I highly recommend that you decompose any complex processes into multiple functions.
Step 0: Start a New Colab Notebook and name it INFO490-FindingCharacters
Step 1: Copy your working solution from the DSP Jupyter Lesson into a new code cell. Test it. The required functions are also given in lesson.py.
Step 2: Copy your working solution from the DSP Ngrams Lesson into a new code cell. Test it. These functions are also given in lesson.py. Note that load_stop_words is already finished.
Step 3: Finding the Characters
With this machinery in place, we are ready to find characters in a novel (I hope you are reading this with great anticipation) using different strategies. Each of the strategies has a function to implement that strategy.
Attempt #1
One attribute (or feature) of the text we are analyizing is that proper nouns are capitalized. Let’s capitalize on this and find all single words in the text whose first character is an uppercase letter and the word is NOT a stop word.
Create and define the function find_characters_v1(text, stoplist, top):
Tokenize and clean the text using the function split_text_into_tokens
Filter the tokens so it has no stop words in it (regardless of case). The parameter stoplist is the array returned from load_stop_words
Create a new list of tokens (keep the order) of words that are capitalized. You can test the first character of the token.
Return the top words as a list of tuples (the first element is the word, the second is the count)
For Huck finn, you should get the following (the output is formatted for clarity):
HUCK_ID = "13F68-nA4W-0t3eNuIodh8fxTMZV5Nlpp"
text = read_google_doc(HUCK_ID)
stop = load_stop_words()
v1 = find_characters_v1(text, stop, 15)
print(v1)
You should see:
('Jim', 341),
('Well', 318),
('Tom', 217),
('Huck', 70),
('Yes', 68),
('Oh', 65),
('Miss', 63),
('Mary', 60),
('Aunt', 53),
('Now', 53),
('Sally', 46),
('CHAPTER', 43),
('Sawyer', 43),
('Jane', 43),
('Buck', 38),
Notice with this very simple method we found 8 characters in the top 15. You also found an Aunt and a Miss too. You might be inclined to start fiddling with the stop-words. The one would you could add is 'CHAPTER' and 'Well' -- the interjection, since we know that word does not provide much content in this context. But as we mentioned in the nGrams lesson, that's a dangerous game, since other novels might include some of these:
Attempt #2
Another feature of characters in a novel is that many of them have two names (Tom Sawyer, Aunt Polly, etc).
Create and define the function find_characters_v2(text, stoplist, top):
Tokenize and clean the text using the function split_text_into_tokens
Convert the list of tokens into a list of bigrams (using your bi_grams method)
Filter out all bigrams such that only if both words are capitalized (just the first character) then are they used.
Neither word should (either lower or upper) be in stoplist (remember stoplist could be the empty list)
Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count
Note that we are NOT removing the stopwords from the text (see lesson on ngrams). We are now using the stopwords to make decisions on the text.
With the text of Huckleberry Finn, the following is the output with stopwords being the empty list:
v2 = find_characters_v2(text, [], 15)
print(v2)
(('Mary', 'Jane'), 41),
(('Tom', 'Sawyer'), 40),
(('Aunt', 'Sally'), 39),
(('Miss', 'Watson'), 20),
(('Miss', 'Mary'), 19),
(('Mars', 'Tom'), 16),
(('Huck', 'Finn'), 15),
(('Uncle','Silas'), 15),
(('Aunt', 'Polly'), 11),
(('Judge','Thatcher'), 10),
(('But', 'Tom'), 9),
(('Ben', 'Rogers'), 8),
(('So', 'Tom'), 8),
(('St', 'Louis'), 7),
(('Miss', 'Sophia'), 7)
That found 11 characters in the top 15 bigrams frequency table. This method is pretty good and the method didn't need to consider stop words. What happens if you consider stop words?
Note: in order to match these outputs, use the collections.Counter class. Otherwise, it's possible that your version of sorting will handle those tuples with equal counts differently (unstable sorting).
Titles
Another feature of characters is that many of them have a title (also called honorifics) precede them (Dr. Mr. Mrs. Miss. Ms. Rev. Prof. Sir. etc). We will look for bi-grams that have these titles. However, we will NOT hard code the titles. We will let the data tell us what the 'titles' are.
Here's the process to use to self discover titles:
Let's define a title as a capital letter followed by 1 to 3 lower case letters followed by a period. This is not perfect, but it captures a good majority of them.
create a list named title_tokens that the text matches the above criteria (hint: use regular expressions)
you now have to remove words that might have ended a sentence with those same title characteristics (e.g. Tom. Bill. Pat. Etc. ). Use the same definition as above but instead of ending with a period, the token must end with whitespace. The idea is that hopefully somewhere in the text the same name will appear but without a period. It’s very likely that you would encounter 'Tom' somewhere in the text without a period, but it’s unlikely that Mr., Mrs., Dr., etc would appear without a period. Let's call this list pseudo_titles.
the set of titles is essentially the first list of tokens, title_tokens with all the tokens in the second set (pseudo_titles) removed. For example, the first list might have 'Dr.', 'Tom.' and 'Mr.' in it and the second set might have 'Tom' and 'Ted' in it. The final title list would include 'Dr' and 'Mr'
Name your function get_titles that encapsulates the above logic; it should return a list of titles.
Once you have get_titles working, the following should work:
titles = get_titles(text)
print(titles)
You should get 7 computed titles in Huckleberry Finn:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']
Attempt #3
Create and define the function find_characters_v3(text, stoplist, top):
Tokenize and clean the text
Convert the list of tokens into a list of bigrams
Filter out all bigrams such that the first word in the bigram is a title and the second word is capitalized (hint: use the output of get_titles)
the second word (either lower or upper) should not be in stoplist
Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count
v3 = find_characters_v3(text, load_stop_words(), 15)
print(v3)
For Huck Finn, you should get the following:
(('St', 'Louis'), 7),
(('Mr', 'Lothrops'), 6),
(('Mrs', 'Phelps'), 4),
(('St', 'Petersburg'), 3),
(('Dr', 'Robinson'), 3),
(('Mr', 'Garrick'), 2),
(('Mr', 'Kean'), 2),
(('Mr', 'Wilks'), 2),
(('Mr', 'Mark'), 1),
(('Mrs', 'Judith'), 1),
(('Mr', 'Parker'), 1),
(('Dr', 'Gunn's'), 1),
(('Col', 'Grangerford'), 1),
(('Dr', 'Armand'), 1),
(('St', 'Jacques'), 1)
Clearly, that yields a lot of good information. Although looking at the counts, none of them are that prominent. We also found a few places as well as people.
Machine Learning?
You may have heard of the NLTK Python library that’s a popular choice for processing text. These libraries include models that were built by processing large amounts of text. We will use both the NLTK and SpaCy NLP libraries to do something similar in another lesson. However, these libraries have models built from using large data sets to extract entities (called NER for named entity recognition). These entities include organizations, people, places, money.
The models that were built essentially learned what features (like capitalization or title words) were important when analyzing text and came up with a model that attempts to do the same thing we did here. However, we hard coded the rules (use bigrams, remove stop words, look for capital letters, etc). This is sometimes referred to as a rule-based system. The analysis is built on manually crafted rules.
In machine learning (sometimes referred to as an automatic system), some of the algorithms essentially learn what features are important (or can learn how much weight to apply to each feature) to build a model and then uses the model to classify tokens as named entities. The biggest issue is that these models could be built with a very different text source (e.g. journal articles or twitter feed) than what you are processing. Also the models themselves require a large set of resources (memory, cpu) that you may not have available. What you built in this lesson is efficient, fast and fairly accurate.
Submission Guidelines:
You will upload your notebook to Gradescope.com for grading.
do NOT use any external Python library other than collections and re (nothing else).
do NOT use the zip function (we will soon though)
try to solve all of these problems by yourself with your own brain and a piece of paper. Surely there are solutions available, but copying will not make you a better programmer. This is not the time to copy or share code.
You should test the code you are writing against sample sentences instead of the full text; once you have it working, then try the full data set
you are free to write as many helper functions as you need. The following functions will be tested:
• get_titles
• find_characters_v[1-3]
each of the find_characters_v functions should use your top_n function
the output of find_characters_v should always be a list of tuples AND match the example output before you 'run tests'
Before you submit:
Be sure to comment out all print statements -- especially those inside of loops.
To speed up the grading process, comment out any testing code/cells
When you download your notebook (as Python code), you must name it solution.py before you upload it.
Replit Credit:
Once you submit, return the URL of your shared Google notebook via the jupyter function in lesson.py
You can tag any question on Piazza with FindingChars.
CC 06
🐍 Coding Challenges:
Harry Potter and the Plotting of Characters (part 1)
Prerequisites:
CC 05: Finding Characters
Named Parameters
NLP
Numpy (part 1)
Matplotlib Introduction
Do not start this lesson until all of the above lessons have been submitted successfully.
This project builds on the finding characters project. You will create a new notebook, but you can copy all of the working code from the previous challenge.
Plotting Characters across Chapters
This lesson will bring together your Numpy skills with what you learned about finding characters from the ngrams lesson to build a visualization like the one below. It shows the main characters of The Adventures of Huckleberry Finn and the cumulative count of their occurrences throughout the novel.
Lesson Assignment
We will build a similar graph for Harry Potter and the Sorcerer's Stone. This is part 1 of that process.
1. Create a New Notebook
Be sure you are logged into your Google Account using your @gmail email. Go to https://colab.research.google.com and create a new Python 3 notebook. Name it INFO490HP-P1. Be sure to save it into your personal drive space.
2. Access Remote Resource
In lesson.py is the Google Drive ID for the text for Harry Potter and the Sorcerer's Stone.
Write a function named get_harry_potter() that returns the text of that remote resource. You should use good coding conventions of writing and using single task helper functions (note that you have already done this in previous lessons). The following should work:
hp = get_harry_potter()
print(len(hp))
You must use valid Python code to gain access to remote resources. You cannot use any Jupyter specific code (e.g wget, curl, etc)
3. Clean Data
Write a function named clean_hp that does the following to its incoming string parameter:
remove all header information up until the title of the book
remove all leading and trailing whitespace
you can keep 'THE END' as well as all the page numbers
return the cleaned text
hp = clean_hp(get_harry_potter())
print(len(hp))
4. Find Characters
Copy your working solution for load_stop_words, bi_grams, top_n, find_characters_v1 and find_characters_v2 as well as any helper functions they depend upon.
Make the the following changes:
def load_stop_words
use spacy to load its stopwords
add a named parameter (called add_pronouns) to the function with a default value of False. If add_pronouns is True, add the pronouns (found in lesson.py) to the returned list
def bi_grams
use the nltk ngrams function inside your bi_grams function to turn the incoming list of tokens into a list of tuples
remove your original implementation (or rename it to bi_grams_v1)
def split_text_into_tokens
keep the same solution (using regular expressions to tokenize)
augment the normalization step to strip off the possessive of any token that ends with 's (e.g Harry's becomes Harry)
def find_characters_v1
change the parameter stopwords to have a default value of an empty list
change the parameter top to have a default value of 15
def find_characters_v2
change the parameter stopwords to have a default value of an empty list
change the parameter top to have a default value of 15
return a two elment tuple where the first element is the combined elements of the bigram. So instead of returning, for example,
('Uncle', 'Vernon'), 97) you would return
('Uncle Vernon', 97).
The following code should now work:
hp = clean_hp(get_harry_potter())
stop1 = load_stop_words(True)
stop2 = load_stop_words()
print(find_characters_v1(hp, stop1, 10))
print(find_characters_v2(hp, stop2, 10))
5. NLP for four
Write the function find_characters_nlp that has two parameters: the text to process, and top that has a default value of 15. It does the following:
use spacy's Named Entity recognizer to pull out all people.
return the top list of characters found (just like v1 and v2)
this is the only place you should be using spacy to tokenize text
You should now run the following code and carefully analyze the results:
print(find_characters_nlp(hp))
A few questions for which you should remember the answers:
What did you notice about the running time for v1, v2 and the nlp version?
Which version found Hermione?
Which version found Voldemort?
How much do you have to increase the top parameter to find them?
You can use the time module for simple timing if you want to know the exact time spent on each algorithm:
import time
start = time.time()
print("hello")
end = time.time()
print(end - start)
I think we can agree on that without any human intervention (other than writing and running code), we could build an algorithm that uses the results from above and decide that the following characters are central to the Harry Potter and the Sorcerer's Stone:
Harry
Ron
Hagrid
Hermione
Note that we would probably miss Voldemort even though that character is important to the novel (we might miss that question on our 9th grade english test if we relied on our code to "read" the book for us). Can you think of any analysis that might bring Voldemort to the forefront? Make a post on Piazza of any ideas you have (try to keep all ideas on a single threaded post). This is a conversation starter not a requirement.
6. Data By Chapter
Looking at the graph that we need to build for this lesson, it's clear that we are going to need to get occurrence counts for the four characters for each chapter. Ideally, our data would look like the following (numbers are made up):
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapters = [14, 0, 0, ...] # 17 numbers
Note that each column is the data for each chapter.
Write a function named split_into_chapters that uses a regular expression to split the parameter text into an array of chapters
def split_into_chapters
return an array whose elements are the text for each chapter
each element is trimmed of leading and trailing whitespace
each element can start with the title of the chapter or the first word of the chapter (ideally it would be the latter, but the regex is a bit more complicated)
Note: if you had to split a novel that had to use the title of the chapters (i.e. there is no one pattern that can uniquely captures each of the chapters), you would need to do something like the following (example shows only the first 2 chapters):
def split_into_chapters(data):
# this is not the way you should solve this
m1 = re.search(r"^YOU don't know about me", text, re.M)
m2 = re.search(r"^WE went tiptoeing along", text, re.M)
m3 = re.search(r"^WELL, I got a good going",text, re.M)
chp1 = text[m1.span()[0]: m2.span()[0]]
chp2 = text[m2.span()[0]: m3.span()[0]]
return [chp1, chp2]
The ^ and $ insist you are uniquely capturing the correct text. Clearly this is a last resort solution.
7. Character Counts.
Use Numpy to easily get the counts for the four main characters for each chapter. Create a new cell (we won't use a function for now) to create an array that has a count for the total number of occurrences the name appeared in each chapter.
The Numpy lesson has the function to get the counts from a string (and an example). As mentioned previously, after you are done your arrays should look like this (but not Python lists):
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapter = [14, 0, 0, ...] # 17 numbers
Note:
If a character is referenced using multiple names or nicknames (something our analysis has not done), you could combine them:
harry = [20, 79, 68]
potter = [21, 2, 4]
harry_potter = [5, 2, 0]
hp_counts = harry + potter - harry_potter
# [36, 79, 72]
This adds all 'Harry' and 'Potter' references together but adjusts for the double counting when the full reference to "Harry Potter" is made. Think of properly counting characters in the following sentences: "Harry?" "Is that you Potter? "I'm not kidding Harry Potter, I need to see you NOW". There are 3 references to the same character (not counting pronouns). Do not do this, but it is something to keep in mind.
Finish this implementation
def get_character_counts_v1(chapters):
harry = ...
ron = ...
hagrid = ...
hermione = ...
return np.array([harry, ron, hagrid, hermione])
8. Plotting.
Using the same set up as in the Matplotlib lesson, plot each character:
def simple_graph_v1(plots):
fig = plt.figure()
subplot = fig.add_subplot(1,1,1)
subplot.plot(plots[0])
subplot.plot(plots[1])
subplot.plot(plots[2])
subplot.plot(plots[3])
# this is important for testing
return fig
Note that we are now calling the plot method on the returned subplot (a.k.a axes object) from the add_subplot method. In a previous lesson we called subplot.bar(x_pos, counts) to generate a bar graph. The plot method generates a line graph.
Once that is done the following should work (be sure to test this):
def pipeline_v1():
hp = clean_hp(get_harry_potter())
chapters = split_into_chapters(hp)
plots = get_character_counts_v1(chapters)
fig = simple_graph_v1(plots)
return fig
You should see something like the following:
This doesn't really look like the graph for which we are aiming. But it's a good start. We can see the counts for the four main characters of the novel and we found the characters without reading a single word!! (maybe we shouldn't celebrate this?)
Part 2 of this assignment will use Numpy and Matplotlib to do some data wrangling, fix the visualization, make the pipeline generic, and add some details.
Submission Credit
Notebook Prep:
Before submitting your notebook, be sure to comment out any print statements that print out significant amount of text. Also, comment out any calls to find_characters_* (this will speed up the autograder):
#print(find_characters_v1(hp, stopwords, 10))
#print(find_characters_v2(hp, stopwords, 10))
#print(find_characters_nlp(hp, 10))
1. Save your notebook as a .py file, upload that file to gradescope for grading (be sure you name the saved file solution.py). The gradescope assignment name is HarryPotter-Part1.
Use the same spacy english model used in the NLP lesson.
2. Be sure to share your notebook and have the function jupyter (lesson.py) return the full url. Once that is done, you can hit submit.
You can tag any question with HPP1 on Piazza