CC 03留学生讲解、辅导Python编程设计、讲解Python语言、辅导RPS辅导Web开发|讲解R语言编程

CC 03
The game rock, paper, scissors is a classic tool used to make important decisions among two friends (read https://www.wikihow.com/Play-Rock,-Paper,-Scissors).

1. Create a function named winner_RPS. The function winner_RPS takes two parameters, p1 and p2.
It returns p1 (the parameter) if p1 wins; p2 if p2 wins; and None if there is no winner.
Try to write your solution such that the winner_RPS has a single return statement.
This code needs to be in the lesson.py module.

2. p1 and p2 can only be one of the following values:
"rock", "paper", "scissors"

3. Test your code. You can write your own tests and not rely on the testing framework. In main.py write a test function named test_RPS to verify that your code is working. The function doesn't have any parameters and should test your winner_RPS at least a few times (more than twice!). Try to figure out how to return True if it passes all your tests (False otherwise).
Use the import statement in main.py (i.e. import lesson)

For example the following could be a test:
import lesson
t1 = 'rock'
t2 = 'paper'
if lesson.winner_RPS(t1, t2) != t2:
print('Test FAIL')

You can use also random.choice to model the selection part of the game play as well:
values = "rock,paper,scissors".split(',')
p1 = random.choice(values)
print(p1)

How would you test all possible cases? Is that even possible ?

CC 04
🐍 Coding Challenges: Lucky 777

Prerequisites:
Python Regular Expressions, Parts 1,2,3
Python Remote I/O
DSP: Jupyter Lesson

Hapaxe more thing...
When a word occurs only once in a body of work or entire written record, it's called a hapaxe. However, there are disagreements on how narrow the set of works can be. Usually, a hapaxe can only appear once for an author's entire collection rather than just within a specific piece.

For example, Hamlet has a famous hapaxe 'Hebenon', a poison. It is said that this is Shakespeare's only use of the word. However if you look for hapaxes (aka hapax legomenon) in a single piece of text, there are many: Hamlet has over 2700 words that occur only once. Let's extend this fun fact to find a unique set of words within a body of text that do share some very specific attributes.

Let's classify all words in a body of text by how often they occur. A body of text is a lucky winner if it contains 7 words each that occur 7 times and each word is 7 characters long. For this project you will create a notebook, import some text and determine if the text is a 'winner'.

However you will write your solution to be generic so that any number could be passed in (e.g. 4 letter words that only occur 4 times and there are a total of 4 of them).

All code will be in your Colaboratory Notebook and it will be graded using Gradescope.

Step 0: Starting Point, New Notebook
https://colab.research.google.com/notebooks/welcome.ipynb
Be sure you are logged into Google using your @gmail address so your notebook will be saved to your drive.

Open a new colab notebook via File->New Python 3 notebook
Name it INFO490-777
Your notebook will be saved to your google drive in a special folder:

Step 1: Paste in Starter Code
In lesson.py there is some starter code for this project. Put this code into a new code cell in your notebook.

Step 2: Make Hamlet's text available via Google Drive
This step is a bit superfluous in that we moving data from Project Gutenberg to your Google Drive and then accessing Hamlet from there. Why? Because it's useful to know the steps involved to make data accessible via Google. You can also use this method to access any data (csv files, images, etc) that are located in your personal drive.

Many versions
A previous lesson also used a specific text of Hamlet (RemoteIO); however, there are many editions/versions of this famous play (you can even take classes that study the different versions). On Project Gutenberg you can see different versions:
http://www.gutenberg.org/ebooks/search/?query=hamlet

For this project we will use this version.
Please read the Director's and Scanner's note to learn some of the details of this specific version of Hamlet.

Here's the easiest workflow (you are free to use any other method as well) to move that document into your Google Drive space:
Open a new tab in your browser, go to http://www.gutenberg.org/ebooks/2265
Save the UTF-8 version to your computer:

Save to your computer (name it hamlet.txt)
Go to your Google Drive account and select the New Button and the 'File upload' option to upload hamlet.txt from your computer.

Get the share link.

The main thing is you need the ID of the document. For example,
https://drive.google.com/open?id=19pOCDIXak04cTs7TLiEA3TKUCESU10ZM
Note that this is NOT the url you can use to fetch via the remote I/O in Python. It is a 'browser' friendly URL.

Step 3: Define the following function which returns the ID of Hamlet on your Google Drive:
def get_book_id():
# replace this with your resource Id
return '19pOCDIXak04cTs7TLiEA3TKUCESU10ZM'

Step 4: finish the implementation for build_google_drive_url() (see lesson.py for the code)

This function builds the url to fetch a document saved on Google Drive. You will then add to the baseurl the request parameters properly encoded. If this sounds difficult, go back to the RemoteIO lesson -- the answer is there.
You can use the current implementation (which returns the Project Gutenberg url) for partial credit.

Step 5: TEST it
test your solution by downloading and reading your novel:
def get_hamlet():
g_id = get_book_id()
url = build_google_drive_url(g_id)
return read_remote(url)

hamlet = get_hamlet()
print(hamlet[0:100])

Step 6: Take a break to find the answer to life.
It's common? knowledge that the answer to life is 42: https://www.independent.co.uk/life-style/history/42-the-answer-to-life-the-universe-and-everything-2205734.html.

Shakespeare must have known this as well. Run the following code (get_hamlet needs to be working):
ANSWER_TO_LIFE = 42
def answer_to_life():
text = get_hamlet()
idx = text.find('To be,')
ans = text[idx:idx+ANSWER_TO_LIFE]
return ans

print(answer_to_life())

Step 7: Implement the following:
def clean_hamlet(text):
return text

Remove everything before the start of the play (i.e. the play starts with the line: The Tragedie of Hamlet)
Remove everything after the end of the play (i.e. the play ends after the final line (hint: the final line starts with FINIS)
Hint: use the search method for regular expressions (see Regular Expressions Part 3)
Remove any leading or trailing whitespace
Be sure to test your code before moving on
Do not hard code indices (e.g. return text[2345:4509])
If you find yourself using \n\r\t, you're on the wrong path. The auto-grader uses the same version of Hamlet, but the whitespace is not the same as what's on Project Gutenberg -- and this was not done on purpose, it's the result of what happens when you download/upload text documents between different architectures.

Step 8: Implement the following:
def find_lucky(text, num):
lucky = []
return sorted(lucky)

The function find_lucky parses/tokenizes text (see rules below) and returns a sorted list of words if the text is 'lucky' (see above definition). Otherwise, return the empty list.

The following rules apply to tokenize and classify words:
Use the re module to tokenize the text
a token is a word that contains only letters and/or apostrophes (e.g. who, do's, wrong'd).
normalize the token to lower case. For this lesson you can keep quoted words (it won't affect the answer) but ideally, you would remove them (e.g. 'happy' would become happy).

For example, if the parameter num is 7 then it returns an array of words ONLY if all the following conditions are true:
each word has 7 characters
each word occurs 7 times in the text
there are 7 of these words

For example:
text = """
A boy, a cat, a rat and a dog were friends.
But the cat ate the rat. The dog ate the cat.
The boy? The boy and dog were friends.
"""
print(find_lucky(text, 3))

Should return 3 words ('boy', 'cat', 'dog')

Step 9: Test your function:
def test_777():

hamlet = clean_hamlet(get_hamlet())
print(find_lucky(hamlet, 7))

# comment me out before submitting!!
test_777()

See if Hamlet has any lucky numbers: (put this code inside the function test_777):
for n in range(2,10):
print(n, find_lucky(hamlet, n))

Step 10: Submit notebook to Gradescope:
Go to gradescope.com and signup (or login for those who have already signed up). You MUST use your @illinois.edu address. Hit the "Sign Up" button at the top of the page.
The class code is 9YGP8E
Comment out any testing code that exists outside of any function.
Download your notebook as .py file
Rename the file as solution.py
submit that file to gradescope assignment named Lucky777

Final Submission:
Submission Process for Repl.it Credit:
To submit this repl.it lesson, the ONLY code that needs to pass is get_book_id (lesson.py) The testing framework will attempt to download and read it.
def get_book_id:
# return the id of Hamlet stored on your Google Drive

You can tag any question you have with py777 on Piazza
10.25.2019
All rights reserved

Addendum:
As part of the working out the details for the 777 assignment (the idea came from reading about finding a reference to the hapaxe Hebenon) the following fun fact was found: William Shakespeare had a fascination with the number 7. So the question to ponder (after you finished this assignment) is did Shakespeare hide this fun fact inside of Hamlet or is it purely coincidental?
Readings and References:
https://books.google.com/books?id=rn18DwAAQBAJ&pg=PT154&lpg=PT154&dq=William+Shakespeare++%22number+7
https://books.google.com/books?id=MwBNel_aX0wC&pg=PA67&lpg=PA67&dq=shakespeare+numerology
http://www.richardking.net/numart-ws.htm
https://www.celebrities-galore.com/celebrities/william-shakespeare/lucky-number/ 😉
CC 05
🐍 Coding Challenges: Finding Characters
Prerequisites:
UP: Regular Expressions
DSP: Jupyter
DSP: Ngrams

One of the goals of the Cliff Note Generator was to generate a list of characters that are in a novel. We can actually use our current skill set and include the techniques discussed in the nGrams lesson to extract (with a good level of accuracy) the main characters of a novel. We will also make some improvements with some of the parsing, cleaning, and preparation of the data. It would be best to read this entire lesson before doing any coding. Also note that this lesson is a bit different in that you will be responsible for more of the code writing. What is being specified is a minimum. I highly recommend that you decompose any complex processes into multiple functions.
Step 0: Start a New Colab Notebook and name it INFO490-FindingCharacters

Step 1: Copy your working solution from the DSP Jupyter Lesson into a new code cell. Test it. The required functions are also given in lesson.py.

Step 2: Copy your working solution from the DSP Ngrams Lesson into a new code cell. Test it. These functions are also given in lesson.py. Note that load_stop_words is already finished.

Step 3: Finding the Characters
With this machinery in place, we are ready to find characters in a novel (I hope you are reading this with great anticipation) using different strategies. Each of the strategies has a function to implement that strategy.

Attempt #1
One attribute (or feature) of the text we are analyizing is that proper nouns are capitalized. Let’s capitalize on this and find all single words in the text whose first character is an uppercase letter and the word is NOT a stop word.

Create and define the function find_characters_v1(text, stoplist, top):

Tokenize and clean the text using the function split_text_into_tokens
Filter the tokens so it has no stop words in it (regardless of case). The parameter stoplist is the array returned from load_stop_words
Create a new list of tokens (keep the order) of words that are capitalized. You can test the first character of the token.
Return the top words as a list of tuples (the first element is the word, the second is the count)

For Huck finn, you should get the following (the output is formatted for clarity):
HUCK_ID = "13F68-nA4W-0t3eNuIodh8fxTMZV5Nlpp"
text = read_google_doc(HUCK_ID)
stop = load_stop_words()
v1 = find_characters_v1(text, stop, 15)
print(v1)

You should see:
('Jim', 341),
('Well', 318),
('Tom', 217),
('Huck', 70),
('Yes', 68),
('Oh', 65),
('Miss', 63),
('Mary', 60),
('Aunt', 53),
('Now', 53),
('Sally', 46),
('CHAPTER', 43),
('Sawyer', 43),
('Jane', 43),
('Buck', 38),

Notice with this very simple method we found 8 characters in the top 15. You also found an Aunt and a Miss too. You might be inclined to start fiddling with the stop-words. The one would you could add is 'CHAPTER' and 'Well' -- the interjection, since we know that word does not provide much content in this context. But as we mentioned in the nGrams lesson, that's a dangerous game, since other novels might include some of these:

Attempt #2
Another feature of characters in a novel is that many of them have two names (Tom Sawyer, Aunt Polly, etc).

Create and define the function find_characters_v2(text, stoplist, top):

Tokenize and clean the text using the function split_text_into_tokens
Convert the list of tokens into a list of bigrams (using your bi_grams method)
Filter out all bigrams such that only if both words are capitalized (just the first character) then are they used.
Neither word should (either lower or upper) be in stoplist (remember stoplist could be the empty list)
Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count

Note that we are NOT removing the stopwords from the text (see lesson on ngrams). We are now using the stopwords to make decisions on the text.

With the text of Huckleberry Finn, the following is the output with stopwords being the empty list:
v2 = find_characters_v2(text, [], 15)
print(v2)

(('Mary', 'Jane'), 41),
(('Tom', 'Sawyer'), 40),
(('Aunt', 'Sally'), 39),
(('Miss', 'Watson'), 20),
(('Miss', 'Mary'), 19),
(('Mars', 'Tom'), 16),
(('Huck', 'Finn'), 15),
(('Uncle','Silas'), 15),
(('Aunt', 'Polly'), 11),
(('Judge','Thatcher'), 10),
(('But', 'Tom'), 9),
(('Ben', 'Rogers'), 8),
(('So', 'Tom'), 8),
(('St', 'Louis'), 7),
(('Miss', 'Sophia'), 7)

That found 11 characters in the top 15 bigrams frequency table. This method is pretty good and the method didn't need to consider stop words. What happens if you consider stop words?

Note: in order to match these outputs, use the collections.Counter class. Otherwise, it's possible that your version of sorting will handle those tuples with equal counts differently (unstable sorting).

Titles
Another feature of characters is that many of them have a title (also called honorifics) precede them (Dr. Mr. Mrs. Miss. Ms. Rev. Prof. Sir. etc). We will look for bi-grams that have these titles. However, we will NOT hard code the titles. We will let the data tell us what the 'titles' are.

Here's the process to use to self discover titles:
Let's define a title as a capital letter followed by 1 to 3 lower case letters followed by a period. This is not perfect, but it captures a good majority of them.
create a list named title_tokens that the text matches the above criteria (hint: use regular expressions)
you now have to remove words that might have ended a sentence with those same title characteristics (e.g. Tom. Bill. Pat. Etc. ). Use the same definition as above but instead of ending with a period, the token must end with whitespace. The idea is that hopefully somewhere in the text the same name will appear but without a period. It’s very likely that you would encounter 'Tom' somewhere in the text without a period, but it’s unlikely that Mr., Mrs., Dr., etc would appear without a period. Let's call this list pseudo_titles.
the set of titles is essentially the first list of tokens, title_tokens with all the tokens in the second set (pseudo_titles) removed. For example, the first list might have 'Dr.', 'Tom.' and 'Mr.' in it and the second set might have 'Tom' and 'Ted' in it. The final title list would include 'Dr' and 'Mr'
Name your function get_titles that encapsulates the above logic; it should return a list of titles.

Once you have get_titles working, the following should work:
titles = get_titles(text)
print(titles)

You should get 7 computed titles in Huckleberry Finn:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']

Attempt #3
Create and define the function find_characters_v3(text, stoplist, top):
Tokenize and clean the text
Convert the list of tokens into a list of bigrams
Filter out all bigrams such that the first word in the bigram is a title and the second word is capitalized (hint: use the output of get_titles)
the second word (either lower or upper) should not be in stoplist
Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count

v3 = find_characters_v3(text, load_stop_words(), 15)
print(v3)

For Huck Finn, you should get the following:
(('St', 'Louis'), 7),
(('Mr', 'Lothrops'), 6),
(('Mrs', 'Phelps'), 4),
(('St', 'Petersburg'), 3),
(('Dr', 'Robinson'), 3),
(('Mr', 'Garrick'), 2),
(('Mr', 'Kean'), 2),
(('Mr', 'Wilks'), 2),
(('Mr', 'Mark'), 1),
(('Mrs', 'Judith'), 1),
(('Mr', 'Parker'), 1),
(('Dr', 'Gunn's'), 1),
(('Col', 'Grangerford'), 1),
(('Dr', 'Armand'), 1),
(('St', 'Jacques'), 1)

Clearly, that yields a lot of good information. Although looking at the counts, none of them are that prominent. We also found a few places as well as people.

Machine Learning?
You may have heard of the NLTK Python library that’s a popular choice for processing text. These libraries include models that were built by processing large amounts of text. We will use both the NLTK and SpaCy NLP libraries to do something similar in another lesson. However, these libraries have models built from using large data sets to extract entities (called NER for named entity recognition). These entities include organizations, people, places, money.

The models that were built essentially learned what features (like capitalization or title words) were important when analyzing text and came up with a model that attempts to do the same thing we did here. However, we hard coded the rules (use bigrams, remove stop words, look for capital letters, etc). This is sometimes referred to as a rule-based system. The analysis is built on manually crafted rules.

In machine learning (sometimes referred to as an automatic system), some of the algorithms essentially learn what features are important (or can learn how much weight to apply to each feature) to build a model and then uses the model to classify tokens as named entities. The biggest issue is that these models could be built with a very different text source (e.g. journal articles or twitter feed) than what you are processing. Also the models themselves require a large set of resources (memory, cpu) that you may not have available. What you built in this lesson is efficient, fast and fairly accurate.

Submission Guidelines:
You will upload your notebook to Gradescope.com for grading.
do NOT use any external Python library other than collections and re (nothing else).
do NOT use the zip function (we will soon though)
try to solve all of these problems by yourself with your own brain and a piece of paper. Surely there are solutions available, but copying will not make you a better programmer. This is not the time to copy or share code.
You should test the code you are writing against sample sentences instead of the full text; once you have it working, then try the full data set
you are free to write as many helper functions as you need. The following functions will be tested:
• get_titles
• find_characters_v[1-3]
each of the find_characters_v functions should use your top_n function
the output of find_characters_v should always be a list of tuples AND match the example output before you 'run tests'

Before you submit:
Be sure to comment out all print statements -- especially those inside of loops.
To speed up the grading process, comment out any testing code/cells
When you download your notebook (as Python code), you must name it solution.py before you upload it.

Replit Credit:
Once you submit, return the URL of your shared Google notebook via the jupyter function in lesson.py
You can tag any question on Piazza with FindingChars.
CC 06
🐍 Coding Challenges:
Harry Potter and the Plotting of Characters (part 1)

Prerequisites:
CC 05: Finding Characters
Named Parameters
NLP
Numpy (part 1)
Matplotlib Introduction
Do not start this lesson until all of the above lessons have been submitted successfully.
This project builds on the finding characters project. You will create a new notebook, but you can copy all of the working code from the previous challenge.
Plotting Characters across Chapters
This lesson will bring together your Numpy skills with what you learned about finding characters from the ngrams lesson to build a visualization like the one below. It shows the main characters of The Adventures of Huckleberry Finn and the cumulative count of their occurrences throughout the novel.

Lesson Assignment
We will build a similar graph for Harry Potter and the Sorcerer's Stone. This is part 1 of that process.

1. Create a New Notebook
Be sure you are logged into your Google Account using your @gmail email. Go to https://colab.research.google.com and create a new Python 3 notebook. Name it INFO490HP-P1. Be sure to save it into your personal drive space.

2. Access Remote Resource
In lesson.py is the Google Drive ID for the text for Harry Potter and the Sorcerer's Stone.
Write a function named get_harry_potter() that returns the text of that remote resource. You should use good coding conventions of writing and using single task helper functions (note that you have already done this in previous lessons). The following should work:
hp = get_harry_potter()
print(len(hp))

You must use valid Python code to gain access to remote resources. You cannot use any Jupyter specific code (e.g wget, curl, etc)

3. Clean Data
Write a function named clean_hp that does the following to its incoming string parameter:
remove all header information up until the title of the book
remove all leading and trailing whitespace
you can keep 'THE END' as well as all the page numbers
return the cleaned text
hp = clean_hp(get_harry_potter())
print(len(hp))

4. Find Characters
Copy your working solution for load_stop_words, bi_grams, top_n, find_characters_v1 and find_characters_v2 as well as any helper functions they depend upon.

Make the the following changes:
def load_stop_words
use spacy to load its stopwords
add a named parameter (called add_pronouns) to the function with a default value of False. If add_pronouns is True, add the pronouns (found in lesson.py) to the returned list

def bi_grams
use the nltk ngrams function inside your bi_grams function to turn the incoming list of tokens into a list of tuples
remove your original implementation (or rename it to bi_grams_v1)

def split_text_into_tokens
keep the same solution (using regular expressions to tokenize)
augment the normalization step to strip off the possessive of any token that ends with 's (e.g Harry's becomes Harry)

def find_characters_v1
change the parameter stopwords to have a default value of an empty list
change the parameter top to have a default value of 15

def find_characters_v2
change the parameter stopwords to have a default value of an empty list
change the parameter top to have a default value of 15
return a two elment tuple where the first element is the combined elements of the bigram. So instead of returning, for example,
('Uncle', 'Vernon'), 97) you would return
('Uncle Vernon', 97).

The following code should now work:
hp = clean_hp(get_harry_potter())
stop1 = load_stop_words(True)
stop2 = load_stop_words()
print(find_characters_v1(hp, stop1, 10))
print(find_characters_v2(hp, stop2, 10))

5. NLP for four
Write the function find_characters_nlp that has two parameters: the text to process, and top that has a default value of 15. It does the following:
use spacy's Named Entity recognizer to pull out all people.
return the top list of characters found (just like v1 and v2)
this is the only place you should be using spacy to tokenize text

You should now run the following code and carefully analyze the results:
print(find_characters_nlp(hp))

A few questions for which you should remember the answers:
What did you notice about the running time for v1, v2 and the nlp version?
Which version found Hermione?
Which version found Voldemort?
How much do you have to increase the top parameter to find them?

You can use the time module for simple timing if you want to know the exact time spent on each algorithm:
import time

start = time.time()
print("hello")
end = time.time()
print(end - start)

I think we can agree on that without any human intervention (other than writing and running code), we could build an algorithm that uses the results from above and decide that the following characters are central to the Harry Potter and the Sorcerer's Stone:
Harry
Ron
Hagrid
Hermione

Note that we would probably miss Voldemort even though that character is important to the novel (we might miss that question on our 9th grade english test if we relied on our code to "read" the book for us). Can you think of any analysis that might bring Voldemort to the forefront? Make a post on Piazza of any ideas you have (try to keep all ideas on a single threaded post). This is a conversation starter not a requirement.

6. Data By Chapter
Looking at the graph that we need to build for this lesson, it's clear that we are going to need to get occurrence counts for the four characters for each chapter. Ideally, our data would look like the following (numbers are made up):
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapters = [14, 0, 0, ...] # 17 numbers

Note that each column is the data for each chapter.

Write a function named split_into_chapters that uses a regular expression to split the parameter text into an array of chapters
def split_into_chapters
return an array whose elements are the text for each chapter
each element is trimmed of leading and trailing whitespace
each element can start with the title of the chapter or the first word of the chapter (ideally it would be the latter, but the regex is a bit more complicated)

Note: if you had to split a novel that had to use the title of the chapters (i.e. there is no one pattern that can uniquely captures each of the chapters), you would need to do something like the following (example shows only the first 2 chapters):
def split_into_chapters(data):
# this is not the way you should solve this
m1 = re.search(r"^YOU don't know about me", text, re.M)
m2 = re.search(r"^WE went tiptoeing along", text, re.M)
m3 = re.search(r"^WELL, I got a good going",text, re.M)
chp1 = text[m1.span()[0]: m2.span()[0]]
chp2 = text[m2.span()[0]: m3.span()[0]]
return [chp1, chp2]

The ^ and $ insist you are uniquely capturing the correct text. Clearly this is a last resort solution.

7. Character Counts.
Use Numpy to easily get the counts for the four main characters for each chapter. Create a new cell (we won't use a function for now) to create an array that has a count for the total number of occurrences the name appeared in each chapter.

The Numpy lesson has the function to get the counts from a string (and an example). As mentioned previously, after you are done your arrays should look like this (but not Python lists):
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapter = [14, 0, 0, ...] # 17 numbers

Note:
If a character is referenced using multiple names or nicknames (something our analysis has not done), you could combine them:
harry = [20, 79, 68]
potter = [21, 2, 4]
harry_potter = [5, 2, 0]

hp_counts = harry + potter - harry_potter
# [36, 79, 72]

This adds all 'Harry' and 'Potter' references together but adjusts for the double counting when the full reference to "Harry Potter" is made. Think of properly counting characters in the following sentences: "Harry?" "Is that you Potter? "I'm not kidding Harry Potter, I need to see you NOW". There are 3 references to the same character (not counting pronouns). Do not do this, but it is something to keep in mind.

Finish this implementation
def get_character_counts_v1(chapters):

harry = ...
ron = ...
hagrid = ...
hermione = ...

return np.array([harry, ron, hagrid, hermione])

8. Plotting.
Using the same set up as in the Matplotlib lesson, plot each character:
def simple_graph_v1(plots):

fig = plt.figure()
subplot = fig.add_subplot(1,1,1)

subplot.plot(plots[0])
subplot.plot(plots[1])
subplot.plot(plots[2])
subplot.plot(plots[3])

# this is important for testing
return fig

Note that we are now calling the plot method on the returned subplot (a.k.a axes object) from the add_subplot method. In a previous lesson we called subplot.bar(x_pos, counts) to generate a bar graph. The plot method generates a line graph.

Once that is done the following should work (be sure to test this):
def pipeline_v1():

hp = clean_hp(get_harry_potter())
chapters = split_into_chapters(hp)

plots = get_character_counts_v1(chapters)
fig = simple_graph_v1(plots)
return fig
You should see something like the following:

This doesn't really look like the graph for which we are aiming. But it's a good start. We can see the counts for the four main characters of the novel and we found the characters without reading a single word!! (maybe we shouldn't celebrate this?)

Part 2 of this assignment will use Numpy and Matplotlib to do some data wrangling, fix the visualization, make the pipeline generic, and add some details.

Submission Credit

Notebook Prep:
Before submitting your notebook, be sure to comment out any print statements that print out significant amount of text. Also, comment out any calls to find_characters_* (this will speed up the autograder):
#print(find_characters_v1(hp, stopwords, 10))
#print(find_characters_v2(hp, stopwords, 10))
#print(find_characters_nlp(hp, 10))

1. Save your notebook as a .py file, upload that file to gradescope for grading (be sure you name the saved file solution.py). The gradescope assignment name is HarryPotter-Part1.
Use the same spacy english model used in the NLP lesson.
2. Be sure to share your notebook and have the function jupyter (lesson.py) return the full url. Once that is done, you can hit submit.

You can tag any question with HPP1 on Piazza