CITS1401语言辅导、讲解Python程序、Python语言编程调试讲解留学生Processing|调试Matlab程序

CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020
Page 1 of 11
Project 2: How Good (Positive and Patriotic) is Australia?
Submission deadline: 5:00 pm, Friday 23rd October 2020
Value: 20% of CITS1401
To be completed individually.
You should construct a Python 3 program containing your solution to the following
problem and submit your program electronically on Moodle. No other method of
submission is allowed. Your program will be automatically tested on Moodle. Remember
your first two checks against the tester on Moodle will not have any penalty. However
any further check will carry 10% penalty per check.
You are expected to have read and understood the University's guidelines on academic
conduct. In accordance with this policy, you may discuss with other students the general
principles required to understand this project, but the work you submit must be the
result of your own effort. Plagiarism detection, and other systems for detecting potential
malpractice, will therefore be used. Besides, if what you submit is not your own work
then you will have learnt little and will therefore, likely, fail the final exam.
You must submit your project before the submission deadline listed above. Following
UWA policy, a late penalty of 5% will be deducted for each day (or part day), after the
deadline, that the assignment is submitted. No submissions will be allowed after 7 days
following the deadline except approved special consideration cases.
Context:
For this project, imagine for a moment that you have successfully completed your UWA
course and recently taken up a position for the Department of Prime Minister and
Cabinet in Canberra with the Australian Federal Government. At first you were quite
reluctant to leave Perth to move ‘over east’ and, more generally, wondered what use a
new graduate with a heavy focus on computing, programming and data could be to this
department. Regardless, the opportunity to gain experience in the ‘real world’ was too
good, and although it is not quite your own multi-million dollar technology start-up,
there was no way you weren’t taking up the offer.
Your first few weeks of orientation was a mostly blur. However, one thing you noticed
was that any time you mentioned your skills in programming, and with Python1 in
particular, to any senior bureaucrat, or even some of the savvier politicians, their eyes
seemed to ‘light up’ and they suddenly became much more interested in whatever you
1 Actually their eyes are more likely to light up if / when you mention your skills in data science and machine
learning and big data, for all of which Python is basically the foundational tool for.
CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020
Page 2 of 11
were saying to them. After reflecting on these experiences, maybe there would be some
even more interesting opportunities for you in the near future?
However, for now you decide to put aside these, as it’s not like the work that you have
been doing already has not been interesting, and this is what you need to focus on for
today. At an early morning meeting with your immediate supervisor, you were told that
the Government is very interested in reducing its spend on trying to understand what
(and how) the Australian population currently thinks about it. Instead of spending
millions of dollars calling randomised groups of Australian residents every quarter to
ask about their opinions on various Government services, many senior bureaucrats have
wondered for a while now whether there was any way to use the masses of freely
available data on the internet to provide similar insights at a fraction of the cost.
It is within this context that your supervisor has asked you to develop a program, as a
proof-of-concept, to demonstrate that it is possible to provide some of these insights at
a much lower cost. At your meeting your supervisor noted that, for the proof-of-concept
stage, the use of any ‘live’ internet data will not be possible without approval from the
legal team (as well as possibly many others). This seemed like quite an obstacle until
you thought back to one of your early Python units (maybe this one?) and remembered
that there is an open source, freely available corpus collection of billions of recently
crawled websites called the Common Crawl (http://commoncrawl.org/). More
specifically the Common Crawl corpus consists of tens of thousands of files saved in a
certain format (the WARC format, see below), each of which contains the raw HTML of
tens of thousands of web pages from a web ‘crawl’ performed in the recent past. Being
open source this data is free for you to use so with it you can immediately begin building
your proof-of-concept.
The Project:
As your program is to be a proof-of-concept, both you and your supervisor decided that
its scope should be kept as narrow as possible (but, of course, it must be broad enough
so that it can successfully demonstrate some really good insights). For this reason, it
was decided that your program is to focus only on providing four insights only:
1. How ‘positive’ is Australia generally?
2. How ‘positive’ does Australia feel towards their Government specifically?
3. How ‘patriotic’ is Australia compared with two other major English speaking
countries – UK and Canada?
4. What are the most referred-to websites (domains) by all Australian websites
(your team may want to use this information in the future to better understand
how ‘influential’ each Australian web result is to your insights, i.e. highly-referred
to web domains should be counted as more influential, and lowly-referred to web
domains should be counted as less influential).
As outlined in the ‘context’ section, in order to generate these insights (which will be
discussed in greater detail later in this document), your program will need to examine
CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020
Page 3 of 11
the raw HTML from large quantities of Australian web pages, and such information is
available in WARC format from the Common Crawl.
The Common Crawl and WARC format:
The WARC (Web ARChive) format is a standard format for mass storage of large
amounts of ‘web pages’ within a single file. The Common Crawl makes the results of
their crawl freely available for download in this format (as well as the WAT and WET
formats, which will not be used for this project). For this project we will use WARC files
from the August 2020 crawl (https://commoncrawl.org/2020/08/august-2020-crawlarchive-now-available/).
In order to access these files you need to download the “WARC
files” list – which you can access by clicking on the “CC-MAIN-2020-34/warc.paths.gz”
hyperlink in the table in the August 2020 crawl homepage.
Clicking on this link will download an archive, which, when opened, will contain a text
file. Once you open the text file you can download any of the WARC files from the
common crawl by appending https://commoncrawl.s3.amazonaws.com/ to the front of
any of the lines of this file and pasting this full address into your browser.
A couple of notes about the Common Crawl WARC files as discussed so far:
• The file list and all Common Crawl WARC files are compressed using gzip. These files
can be unzipped automatically if you are using Linux or Mac OSX. For Windows you
will have to download a free application to do this - try 7-Zip: https://www.7-
zip.org/.
• The Common Crawl WARC files are very large – approximately 900MB compressed
and up to 5GB uncompressed. Each file contains approximately 45,000 individual
crawl results.
Due to the size of the files above, this project has made available a massively cut down
sample Common Crawl WARC file on LMS as well as Moodle server. It is expected you
will use this file to get familiar with the format and for your (initial) testing of your
project. However, your submission will be tested with other WARC files.
To start getting familiar with WARC files, it is recommended you download the sample
file and open it in a text editor (for Windows, Wordpad performs better; you can also
use Thonny). You will see that a WARC file consists of an overall file header, beginning
with the text “WARC/1.0”, and the next time you see this text is to describe either a
request (“WARC/1.0\r\nWARC-Type: request”), a response (“WARC/1.0\r\nWARCType:
response”) or possibly a metadata or other type of WARC category (e.g.
“WARC/1.0\r\nWARC-Type: metadata”). For this project we are only interested in
WARC responses (“WARC/1.0\r\nWARC-Type: response”), as these are the only
categories that contains the raw HTML data of the web page we are analysing.2
Looking into more detail at WARC responses, you can see that these are further broken
down into three sections, which are separated by blank lines. The first is the WARC
2 Note the use of ‘\r’ with ‘\n’ to signify a line ending in the WARC (and HTTP) headers. This is a standard line
ending code for text files saved with Microsoft Windows and some other scenarios. You will need to account for
this when processing these headers.
CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020
Page 4 of 11
response header (beginning with “WARC/1.0”). The second is the HTTP header (usually
beginning with “HTTP/1.1 200”) and the third is the raw HTML data (usually but not
necessarily beginning with “”). For the purposes of this project, you
can assume that the first block of text (before the first blank line) is the WARC header,
the second block of text (after the first blank line) is always the HTTP header, and the
third block of text (i.e. anything after the second blank line and before the next
“WARC/1.0” heading) is the raw HTML that we need to analyse.
Taking into account the above, your program will need to be able to open a WARC file,
discard or ignore the overall WARC file header, and then for each result:
1. Extract the URL from the WARC response header (this is stored in the line starting
with “WARC-Target-URI”)
2. Extract the “Content-Type” from the HTTP header. For this project we are only
interested in responses that are of “Content-Type: text/html”. Any other types
of HTTP responses can be ignored.
3. Extract the raw HTML for this result and store it in a data structure so that it is
associated with the URL you extracted (in point 1).
Extracting Raw Text from HTML:
If you were to have a look at the raw HTML you have extracted in detail, you would see
that it doesn’t quite (yet) look like nice words and sentences that you will be able to
analyse to determine its “positivity” and “patriotism” as you are required to do for
insights 1 - 3. In order to get your text to this point, you are going to have to perform
some transformations on it, namely:
Removal of any HTML tags – any text between a ‘<’ character and a ‘>’ character
you can assume is a HTML tag and needs to be removed before completing your analysis
for insights 1, 2 and 3.
Removal of JavaScript code – before you remove your HTML tags above, you will
also need to remove any text that is between the ‘

’ tags
(again only for completing insights 1 - 3).
The Insights Themselves:
Some more details about what is required for each insight is below:
1. How ‘positive’ is Australia generally?
For this insight, both you and your supervisor are keen to understand how much
Australian websites use ‘positive’ words compared to how much Australian websites use
‘negative’ words. It was decided that, for this insight, your program should produce a
list with five items. The first and second items in this list are the total count of positive
words and negative words respectively within the raw text for all Australian web pages
that were in the WARC file provided to your program. The third item of the list should
be the ratio of positive words to negative, which can be calculated by dividing the former
CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020
Page 5 of 11
by the latter. The fourth and fifth items should be the average number of positive words
and negative words respectively found in the typical Australian web page.
To assist you in this duty, your supervisor has provided you with a list of common
positive English words, and a list of common negative English words. You can find these
lists as text files on LMS and Moodle Server. For this project you can assume that any
words that are not in either of these lists should not be included as part of the positive
or negative counts.
Note in order to produce accurate results here, you will have to make sure that your
program counts the appearance of any positive and negative words in your text
regardless of the word’s case (uppercase / lowercase or a combination of the both) and
any punctuation at the start, end or within each word itself (e.g. commas, full stops,
quotation marks, etc.) - "- in fact it is recommended you remove all punctuation from
your text before performing this step (for now assume every non-alphanumeric
character visible on a standard ANSI keyboard is a punctuation character that needs to
be removed, if you have bought your computer in Australia or the US then it is very
likely you will have an ANSI keyboard, if not then you can easily find out what
punctuation an ANSI keyboard contains through a Google image search).
Note the above analysis should be performed on Australian websites only. For this
project assume that a website will always an Australian website if, and only if its domain
name ends in a ‘.au’ (domain names are discussed in more detail in insight 4).
2. How ‘positive’ is Australia towards its Government?
As well as calculating how ‘positive’ Australia is in general, the second outcome of this
project is to determine how ‘positive’ Australia is towards its Government. In order to
determine this your program should examine every sentence that contains the word
‘government’ for any positive or negative words. You and your supervisor decided on
the following rules for any sentences containing the word ‘government’:
• If the sentence has only one or more positive words then it should be counted as
a ‘positive’ sentence.
• If the sentence contains one negative word then it should be counted as a
‘negative’ sentence however if the sentence contains two negative words then
it should be counted as a ‘positive’ sentence (i.e. it is likely that the writer has
used a double negative). If the sentence contains three or more negative words
then it should be counted as a negative sentence.
• If the sentence contains a combination of positive and negative words (or no
positive or negative words) then it should not be counted as either positive or
negative.
As with insight 1 your results should be provided in a list with the first and second items
being your raw positive and negative counts, and the third item being the ratio of
positive to negative counts, and the fourth and fifth items being your average number
of positive sentences and negative sentences per web page respectively.
Also as with insight 1, the above analysis should be performed on Australian websites
only. In addition, the same directions with regards to the word’s case and punctuation
CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020
Page 6 of 11
applies also, but you may wish to delay removing any sentence-ending punctuation
characters (see below) to ensure you are able to split your raw text for the result into
individual sentences first.
For this section you can assume that a sentence is any number of words ended by a
sentence-ending character (a full stop, a question mark or an exclamation mark).
3. How ‘patriotic’ is Australia compared with the other major English speaking
countries
For this insight you are required to determine how often the word “australia” appears
in the raw text of your Australian websites compared with how often the other country’s
names appear in their web sites, specifically focussing on two other major English
speaking countries – Canada and the United Kingdom (who both have their own unique
TLDs):
• For Canada your program should determine how often the word “canada” appears
in the raw text for any URLs whose domain name ends in “.ca”.
• For the United Kingdom your program should determine how often the word “uk”
and the phrases “united kingdom” and “great britain” appear in the raw text
for any URLs whose domain name ends in “.uk”.
All of the insights are to be calculated as percentages with the following formula:
(total number of occurrences of all words / phrases for the country) / (aggregate
number of words of every web result’s raw text for that country) * 100.
These percentages are then to be provided in a list in the order of [Australia, Canada,
United Kingdom].
The same directions from insight 1 with regards to the word’s case and punctuation
removal applies to the words you will examine in this insight also.
4. Web domain links and counts
For every Australian web page in your WARC file, your program should count every
domain name that it links to. A domain name is the part of a URL that refers to the root
site, for example the domain name for the link
“https://www.google.com.au/example/testing.html” is “www.google.com.au”. For
this project you can consider that a link in to web page only exists when it appears
within a