讲解留学生php语言、php调试、讲解php程序、Social and Cultural Analytics 语言讲解留学生

7AAVBCS1: Social and Cultural Analytics
To what extent are cultural analytics techniques useful o the literary criticism of novel series? A case study of the Harry Potter series.Social and Cultural Analytics 28 April 2017
I. INTRODUCTION
The emergence of the discipline of digital humanities has often been associated with the shift from reading a
single book 'on paper' to the possibility of browsing thousands digital texts (Jänicke, 2015), hence creating
multiple opportunities for the field of literary criticism but also creating new challenges.
Using digital analytics in the field of literary criticism would represent the promise of new research paths.
Assisted content analysis could make possible 'the systematic analysis of large-scale text collections without
massive funding support' (Grimmer and Stewart, 2013, p. 267), facilitating critical inferences on large
collections and allowing researchers to understand underlying cultural phenomena.
Novel series are a potential object of research in the digital humanities, and should be considered as a cultural
object worth being investigated by scholars. However, novel series are usually large documents and cannot be
analysed as a whole entity using current close reading practices.
This study addresses the use of computer assisted content analysis methods for the literary criticism of novel
series, and more specifically for the analysis of the narrative structure of novel series. I aim to identify the
limitations of these techniques in the context of literary analysis and therefore to understand to what extent
they can be used as a criticism tool.
Using R scripts and computer assisted content analysis techniques, I will investigate the narrative structure of
the Harry Potter series, written by J.K. Rowling between 1997 and 2007, looking more specifically at the
evolution of topics, of the narrative arcs and of the characters throughout the whole series. I will finally critically
engage with the findings of the analysis, assessing how the techniques applied and their results can help us
understanding better the work of J.K. Rowling.
! of !6 35
II. METHODOLOGICAL BACKGROUND
The analysis of the Harry Potter series will rely on three main techniques:
Topic modelling to uncover the evolution of topics and the evolution of the characters throughout the series;
Sentiment analysis to analyse the evolution of narrative arcs;
Visualisations to convey efficiently the results.
1. TOPIC MODELLING
Topic modelling is defined by Alexander et al. as 'a type of text processing that determines major themes of a
collection of texts through statistical analysis' (2015, p. 174). This technique can help us identify the main
themes of a novel series, but most importantly it can uncover how these topics relate to each other. The
approach of Rule et al. (2015) to topic modelling is able to account for the fluidity of discursive categories over
time, by identifying and analysing dynamic discursive streams in large textual corpora. Chui et al. (2011) takes
this idea further by designing TextFlow, a program that not only differentiate topics but also identify critical
events within the text and their causes.
Both these studies acknowledge the limitations of topic modelling, as the output can often seem arbitrary and
be difficult to interpret. However, Alexander et al. (2015) show with their program Serendip, a topic model
driven exploration tool for text analysis, that by focusing on the evolution of topics rather than the topics
themselves, and by always looking back to the original text, topic modelling has the potential to be a useful
tool for investigating the narrative structure of large collections of texts.
2. SENTIMENT ANALYSIS
According to Zhu et al., sentiment analysis 'aims at user’s attitude and opinions by investigating, analysing and
extracting subjective texts involving users’ opinions, preferences and sentiment' (2012, p. 572), and is a
technique currently being mostly applied for marketing purposes. Jockers (2015) shows that sentiment analysis can help the analysis of the narrative structure of a plot. Drawing
on Vonnegut's early theory ('There is no good reason why the simple shapes of stories cannot be fed into
computers'), he created the R package 'Syuzhet', designed to extract sentiment and plot information from
texts. Jockers then succeeded in developing a systematic way of extracting plot arcs from fiction arguing that
plot structure is being informed by the ratio of positive and negative events, a method also echoed by Silge's
work and her sentiment analysis of Jane Austen's novels (Silge, 2016). While these methods provide an efficient way of visualising narratives, it is important to acknowledged the
limitations of these methods (Bannister, 2015):
It is difficult to apprehend contextual understanding and tone, such as sarcasm;
Linguistic evolution should be taken into account when analysing documents that span over a long period of
time;
Sentiment cannot always be classified in set categories.
3. VISUALISATIONS
All the studies referenced above promote the use of visualisation when applying digital analytics to literary
criticism. As defined by Manovich (2009), visualisations rely on the principle of both reduction of information
and the use of spatial variables to represent key differences in the data and reveal patterns and structures.
Visualisations can therefore be a very important tool in text mining as they will synthesise and convey the
results of the analysis in both a meaningful and convincing way.
! of !8 35
III. METHODS
1. AIM OF THE STUDY
This study aims to investigate, analyse and visualise the narrative structure of the Harry Potter novel series.
More specifically, I will focus on the analysis of the perception of kernel events, and on the evolution of
characters and topics throughout the seven novels.
2. DATASET USED
To pursue my analysis, I made use of the ebooks available online in format EPUB and converted the seven
novels to TXT format, using the following online tool: http://www.epubconverter.com/epub-to-txt-converter/.
Before loading the text files to R Studio, I proceeded to manually delete the metadata and edition specific
information (publisher and copyright information) present at the beginning and end of each novels. The result
product of this manipulation is seven text files containing only the text written by J.K. Rowling.
3. GENERAL METHODOLOGY
To critically assess the validity of text mining techniques in the context of literary criticism and the results of the
analysis, it was very important to use a diverse range of techniques, as a mean of comparison and as a mean
of validation:
I first followed tidy text mining principles, as laid out by Wickham (2009), and Silge and Robinson (2017).
Tidy data has a specific structure, where each variable is a column, each observation is a row and each type
of observational unit is a table (Wickham, 2009). Tidy text should therefore be in a table with a one-token-
per-row format — a token being a unit of meaning, e.g. a word, an n-gram or a string (Silge and Robinson,
2017). Using the R package tidytext, Silge and Robinson (2017) argue that tidy data principles allows to
! of !9 35
manipulate, summarise and visualise the characteristics of text easily while as well optimising existing text
mining processing techniques.
I also used the tm package and the topicmodel package to carry a complementary analysis and to
process topic models, following Blanke's lecture notes (2017).
This methodology was adapted in R Studio as two different projects:
In e project, the data was formatted according to tidy data principles;
In the second project, the data was enclosed in a corpus and later redefined as a document term matrix,
and a term document matrix.
This organisation allowed me to compare the results of text mining queries, performed on different data format.
! of !10 35
4. DATA PREPARATION
A. Data tidy up using tidy data principles
To comply with tidy data principles, I loaded the text files in R Studio in a one-line-per-row format:
I noticed that the data inherited symbols during the format conversion, such as the pattern "\f". I performed a
string transformation using the function sub() to clean the texts:
! of !11 35
I created the original_books data frame, containing the seven novels and their title:
Following Silge and Robinson's recipe, I added more information to this data frame.
•
the line number for each line of text;
•
the chapter number for each line of text.
! of !12 35
To convert this data frame. to tidy format, I used the function unnest_token(), available in the tidytext
package. The output is a data frame. with one word per document, per row. It has the following advantages:
The other columns, here the line number and the chapter number, are retained; The punctuation is automatically stripped;
The wor are converted to lowercase.
To complete the data tidying and preparation process, it was necessary to remove common English stop
words. Silge and Robinson offer to use the anti_join()function for that effect:
! of !13 35
B. Creation of the corpus, using the tm package
In my other R project, I first loaded the seven text files into seven character vectors, using the function
read_file():
I had to create another data frame. containing the seven novels and related information, called as well
original_books for consistency:
! of !14 35
The corpus hp_corpus was created, and I proceeded to remove punctuation, white spaces and numbers
using tm_map():
To remove common stop words, I extended the list already loaded in the tm package by creating a larger
custom list of common stop words, containing over a thousand entries. This list was loaded as a character
vector, custom_stopwords:
! of !15 35
5. EXPLORING THE DOCUMENTS:
The data now being formatted, it is possible to start the analysis. In this section, I will explore word frequencies
in the Harry Potter series.
A. Exploiting the tidy data format
Tidy format allows us to easily calculate the most common words in the entire corpus, by counting the words
frequency:
To get a more relevant result, I decided to investigate the most common words, per book, and to compare this
number to the total number of words in each volume:
! of !16 35
The hp_words data frame. shows that the last three novels are much larger than the first volumes of the
series, and therefore the words in these novels have a higher frequency. It also shows that the most frequent
words are not relevant to the analysis of the narrative structure, as they refer only to the three main characters,
Harry, Ron and Hermione.
One way to remediate to this limitations is to calculate the statistic tf-idf, intended to measure the importance
of a word to a document or corpus, depending on how rarely this word is used (Silge and Robinson, 2017).
! of !17 35
Using the arrange() function, it was possible to determine the 10 most characteristic words of the entire
series. These 10 words all refer to important characters, that shaped the narrative structure of the novels.
B. Validating the results
To confirm these results, I created a document term matrix on my other project. A document term matrix
contains the documents in rows, and the words in columns:
Following Blanke's lecture notes (2017), I used a weighting function to calculate the tf-idf:
! of !18 35
We notice that the two queries returned similar results, to the exception of four words ("luna", "kreacher",
"eater" and "dobby").
To confront this discrepancy, I plotted the highest tf-idf from the tidy data frame. hp_books, grouping the
words per book title (Figure 1). We can see in Figure 1 that the words not represented in the first ten words of
the data frame. hp_words (which contains the entire corpus) all appear in the list of the words with highest tf-
idf per novel. Therefore, it is possible to validate these results.
Figure 1: Words with highest tf-idf per novel, from the tidy data frame. hp_words
! of !19 35
6. SENTIMENT ANALYSIS
A. Sentiment analysis using the bing lexicon
Jockers (2015) argues that sentiment analysis can represent the narrative structure of a text, and therefore can
highlight critical patterns and relationships within the text. I followed Silge and Robinson's detailed instructions
on how to process sentiment analysis using tidy text mining principles (Silge and Robinson, 2017).
I used the bing lexicon, which categorises words into positive or negative categories. To calculate the
sentiment progression of the novels, I first divided the texts into sections of 80 lines (here called index). In
each section, words are categorised in a positive or negative categories, then the average number of words in
the different categories is used to determine the overall sentiment score for this section.
Using ggplot2, it is possible to plot the overall sentiment progression for each novel. Figure 2 shows that
most sections of the seven novels have been associated with negative sentiments. The results appearing as
unbalanced, I decided to apply another method for sentiment analysis to the text before validating the output
of the sentiment analysis.
! of !20 35! of !21 35
B. Validating the results
To use as a comparison, I applied a different method to the first volume of the series, The Sorcerer's Stone.
This method, also designed by Silge (2017), uses character vectors with one line of text per row and a more
detailed function to process sentiment.
Basing my analysis on Silge's instructions (2017), I first created a function to process sentiment and generate
a sentiment score for each section of text (process_sentiment()) and a function to plot the overall
sentiment structure (plot_sentiment()).
Figure 4 represents the overall sentimental structure of Harry Potter and the Sorcerer's Stone:
Figure 4: Sentiment in Harry Potter and the Sorcerer's Stone
! of !22 35Although it is possible to interpret a narrative structure with this plot, it remains noisy and challenging to identify
narrative patterns. Silge performs on the Jane Austen corpus a low pass Fourier transformation to scale the
novels and optimise the visualisations.
I applied the same method to our corpus, resulting in Figure 6, which shows that it is possible to identify the
overall narrative structure of a text using sentiment analysis. Therefore it can enable the analysis of the
narrative structure of a novel series as a whole, using communicative visualisations. 8. TOPIC MODELLING
Another way to investigate the narrative structure of the novels is to look at the evolution of topics throughout
the novels, using topic modelling.
I determined that limiting the study to 7 topics provided more relevant results:
I then used the posterior() function from the topicmodel package to calculate the contribution of each
topic to the novels:
! of !27 35
It was when possible to plot the topic evolution throughout the seven novels using a heat map, as displayed in
Figure 7:
Figure 7: Topic distribution in the Harry Potter series
I then identified the following topics:
IV. FINDINGS
1. THE EVOLUTION OF THE CHARACTERS
The text mining techniques used in this study not only highlighted the centrality of the main characters (Harry,
Ron and Hermione), but most importantly put forward important characters that are often considered as
secondary (Ludo Bagman, Gilderoy Lockhart or even Xenopholius Lovegood). Characters such as Winky, the
house elf present mostly in the fourth volume, are often dismissed as unnecessary to the story line and do not
even feature in the film adaptation of the novel. However, the tf-idf of Winky's name, ranked in the top ten
words of the corpus, proves otherwise.
Figure 8: Words with highest tf-idf in the novel series
Moreover, digital analytics can be used to understand the influence of some characters throughout the
narrative. By composing a simple comparative word cloud (Figure 9), I was able to identify three names who
played an important role in the development of Harry's character.
Figure 9 shows that Hagrid was a central figure in the first two novels. Sirius Black then became a form. of
father figure in the third, fourth and fifth book. Finally, after Sirius' death, Dumbledore becomes one of the last
! of !29 35
ally of Harry, and the centrality of his name in the word cloud highlights the critical influence this character had
over the story. Finally, it is worth noting that no major character is highlighted in the last quartile of Figure 9, dedicated to the
last novel, to the exception of Voldemort. This element illustrates the loneliness Harry experiences in The
Deathly Hallows and contrasts sharply with the many character names present in The Sorcerer's Stone.
! of !30 35
2. THE EVOLUTION OF TOPICS
Topic modelling enables us to visualise the evolution of topics in the novel series. Figure 7 shows that the
novels evolves towards darker themes, with two convergent topics. Indeed, topic 7, associated with Harry's
muggle life, is more present in the first two novels and gradually fades. Inversely, topic 6, which represent the
threat of Voldemort, is gradually more present throughout the series, reaching its peak in the last volume —
where Harry finally faces his enemy.
!
Topic modelling can also help to identify narrative periods in the series. Indeed, there is a differentiation
between the life at Hogwarts before Voldemort's return (topic 2, which is characterised by a vocabulary
revolving around school, and mostly present in the first three novels), and the life at Hogwarts after Voldemort's
return (topic 5, which uses the same vocabulary but with darker connotations, and is mostly present in the
sixth volume).
!
Topic modelling therefore enables us to identify a general narrative structure of the series:
•
The first three novels are characterised as a time of innocence (topics 2 and 7);
•
The fourth and fifth novels represent a period of tipping point, just before and after Voldemort's return (topics
1 and 3);
•
The return to the life at Hogwarts, with the threat of Lord Voldemort (topic 5), essentially in the sixth volume;
•
Harry's last fight against Voldemort in the last novel (topics 4 and 6).
3. THE PERCEPTION OF NARRATIVE EVENTS
The sentiment analysis performed in this study confirms this initial narrative structure, by representing the
tipping point moment of Voldemort's return, in The Goblet of Fire (cf. Figure 6).
!
In addition, it enables us to identify common narrative patterns to the novels: the novels all start within the
positive sphere until Harry has to face a form. of challenge, where the sentiment scores drop. Sentiment scores
rise at the end of the first three books but stay low for the rest of the series. We also notice that the sentiment
scores of each novel reach lower scores as the series progress.
!
This analysis confirms that the story evolves towards darker themes and topics, therefore challenging the
popular categorisation of Harry Potter as children literature.
! of !31 35
V. CONCLUSION
Text mining techniques have proven to be useful in the context of narrative analysis of novel series. We were
able to visualise and analyse the narrative structure of the Harry Potter series, and more specifically to
investigate the evolution of topics and characters throughout the series.
However, I identified three main limitations to these techniques: