辅导CITS 2401编程、辅导R编程、R语言程序辅导辅导留学生 Statistics统计、回归、迭代|辅导R语言程序

Computer Analysis
and Visualisation
Assignment 1
Tweet Analysis
Worth: 5% of the unit
Submission: Answer the questions on the quiz server.
Deadline: 11 March 2021 5pm
Late submissions: late submissions attract 5% raw penalty per day up to 7 days (i.e., 18 March 2021 5pm). After, the mark
will be 0 (zero). Also, any plagiarised work will be marked zero.
1. Outline
Natural language processing (NLP) is useful yet a difficult task. Our UWA Cybersecurity Research Group
has been focusing on rumour detection and generation in order to prevent rumours causing harm to the
society. As a first step, we built an Automated Rumour Generation Hub (ARGH) that uses various machine
learning (ML) and NLP techniques to generate rumours that are difficult to be identified by both humans
and machines. In particular, Twitter has been used as the source dataset as we often observe different
rumours circulating this social media platform. However, they don't provide analytical functions for us to
summarise the data, so we have to do that ourselves.
In this assignment, you will be carrying out simple data analysis tasks using tweets as outlined is the
Tasks section below, mostly just to test your basic Excel competency. More complex tasks will be carried
out in other assignments (stay tuned!).
Note1: This is an individual assignment, please don't share your solution/code/files with others (only
high-level discussion is allowed, e.g., the syntax of the formula, use of array formula with other examples
etc.). If it is found to be not your original work, then you may be penalised.
Note2: You may use intelligent formatting and colour combinations to display your worksheet in an
understandable manner. However, don't "pimp" up the worksheet.
Note3: You can find ARGH here: https://github.com/argh-rumor-detection/ARGH-Rumor-Generation,
where you can run ARGH yourself using Google Colab.
CITS 2401
Computer Analysis
and Visualisation
2. Tasks
Task 1
Import the original.txt into excel word by word. Here the term "word" refers to any sequence of letters
separated by a space. Note, the text qualifier should be set to {none} when you import the text. This
sheet should be named words_data. Finally, the whole data range should be named words. Figure 1
shows the example output of what it would look like if this task is done correctly.
Figure 1. words_data sheet snippet.
Task 2
Create a new sheet named uniques_data. Import the list of unique words from the uniques.txt file
provided. The words should be located from Cell A1. The whole range should be named uniques.
Task 3
1. In Column B: Calculate the frequency of the unique words from the words_data sheet. You must
use an array formula to do this. Name the cell range as freq.
2. In Column C: Calculate the number of letters used for each word from the words_data. This can
be calculated by simply multiplying the number of letters by its frequency count. Name the cell
range as letters.
3. In Column D: Calculate the rank based on the frequency values. You must use an array formula
to do this. Name the cell range as rank.
In addition, apply conditional formatting on rank where the bottom 10 ranked values (i.e., the 10
smallest values) are formatted with light red filled with dark red text.
Task 4
Create a new sheet named stats. Add the following columns From A2 to A7:
CITS 2401
Computer Analysis
and Visualisation
1. Average
2. Max
3. Min
4. Median
5. Mode
6. SD
Note, SD stands for standard deviation. Also, Average and SD should be rounded to 2 decimal places.
Next, add labels as follows (Cell: Value):
1. B1: Frequency
2. C1: Letters
3. D1: >Average
4. E1: