讲解Visualizing Data、辅导R编程设计、讲解ggplot留学生、R语言讲解讲解SPSS|讲解R语言编程

---
title: 'Lab 2: Visualizing Data with ggplot2'
output:
pdf_document: default
html_document: default
editor_options:
chunk_output_type: inline
---

# Introduction to ggplot2

This lab will introduce you to **ggplot**, a great package for doing plots. You'll also notice that the source code for this lab looks different. This code is written in *R Markdown*, a simple layout language that helps you combine text and code. Markdown is designed to create text that's readable in a text editor, but also can be transformed into HTML or a PDF with section headings, bold, italics, etc.

Most of our labs will be written in R Markdown from here on out. The important thing to note is that actual R code is contained within sections delimited by three single back-quotes. Outside of those sections, you'll write regular text to document what you're doing.

You can compile a whole R Markdown file at once using the **Knit** command (knit text and R output into a single document), or you can run a single code section at a time in the console by hitting **CTRL-SHIFT-ENTER**. The menu attached to the green arrow at the right of each code chunk also gives you options for how to run the code.

This first section of code loads in Hadley Wickham's **tidyverse** package, which draws in many of the graphing and data-manipulation commands we'll need. Try running it with **CTRL-SHIFT-ENTER** to see the output appear below the code block.

You should also try compiling the entire document by clicking the **Knit** command at the top of the editor window. Does your output appear in a separate window, or does it appear in the **Viewer** tab at right? You can set that option in the gear drop-down menu. I like the **Viewer** tab myself.

```{r}
library(tidyverse)
```

# The Data We'll Be Using

Dr. Hyun-Joo Kim has given a short survey to her STAT 190 classes here at Truman for the past few years, and she then uses that data throughout the semester. We'll be using **ggplot** to take a look at some of this data. First, we'll load it into R and take a look at what variables it contains.

```{r}
# The read.csv command reads the data file into memory. Don't worry too much
# about it now. You'll need to have the data file Clean-KimData.csv in the
# same directory as this markdown code.

Clean.KimData <- read.csv("U:/_MT Student File Area/Alberts/STAT 220/Clean-KimData.csv")
stat190_raw <- Clean.KimData

# It's not a bad idea to keep one pristine copy of your data (raw_stat190 in this
# case), then make a copy to use (stat190). If you ever need to revert to the
# original data, you have it available.

stat190 <- stat190_raw

# some variables should really be factor variables (categorical), not numeric.
# We'll change them here. I'll talk about this in class, but it's not the main
# thrust of this lab.

stat190$Politically.Liberal <- as.factor(stat190$Politically.Liberal)
stat190$Religiously.C.or.L <- as.factor(stat190$Religiously.C.or.L)
stat190$Socially.C.or.L <- as.factor(stat190$Socially.C.or.L)

summary(stat190)
```
# Relationship Between Height and Weight

Suppose we want to investigate the relationship between height and weight. Here's a simple scatter plot using R's base graphics commands:

```{r}
plot(stat190$Height, stat190$Weight)
```

Unsurprisingly, taller people tend to weigh more. For a first pass, this is fine, but with ggplot, we can make awesome plots.

Remember that according to **ggplot**, we need to specify *data*, *geom* and *mapping* of variables. The same scatter plot looks like this in **ggplot**:

```{r}
ggplot(data=stat190) +
geom_point(mapping = aes(x=Height, y=Weight))
```

We can add a color aesthetic to indicate **Gender**:

```{r}
ggplot(data=stat190) +
geom_point(mapping = aes(x=Height, y=Weight, color=Gender))
```

These are weird colors (bad for people with color-blindness, for instance), but you can see one person specified "other" (in blue).

**Question 1**

> How does the relationship betwen height and weight change by gender?
> **Answer below:**

**Question 2**

> Do you think whether a student is right or left handed (the **Handed**
> variable) is related to height and weight? Try adding a **shape** aesthetic to
> the command below to find out. Can you see much in this graph? **Answer
> below:**

```{r}
ggplot(data=stat190) +
geom_point(mapping = aes(x=Height, y=Weight, color=Gender, shape=Handed))
```

Perhaps we can create separate graphs for each gender. *Faceting* is the aspect of the gramar of graphics that groups data points according the values of one or more faceting variables, then creates graphs for each group. The **facet_wrap** command in ggplot is one of two ways to create faceted graphs.

**Question 3**

> Run this code once as written, then add the appropriate code to indicate
> gender by color. Does it look like handedness is related to height and weight?
> **Answer below:**

```{r}
ggplot(data = stat190) +
geom_point(mapping = aes(x=Height, y=Weight)) +
facet_wrap(facets = c("Handed"))
```

# Adding a **Stat** to a Graph

A **stat** is the connection between your data and a graph's geom: a mathematical rule that converts your data into the numbers that determine how the geom gets plotted. Here are some examples:

- For a scatter plot, the stat sends x and y coordinates unchanged to the plot.
- For a bar graph, the stat is a _count_ of data points in a certain group.
- For a box plot, the stat calculates _quartiles_ from the data to plot the box.

Each stat has a default geom that goes with it, and each geom has a default stat that goes with it. You can, however, add more stats to a graph. The code below fits a line to the scatter plot of height and weight. The stat uses some sort of averaging to find a smooth relationship between height and weight (in this case "lm" stands for linear model).

```{r}
ggplot(data=stat190) +
geom_point(mapping = aes(x=Height, y=Weight, color=Gender)) +
stat_smooth(mapping = aes(x=Height, y=Weight, color=Handed), method=lm)
```

## A Short-Cut

Notice how we had to repeat the mapping specification in the code above? There is a short-cut. Any parameters specified in the first ggplot command are _inherited_ by the following geoms and stats. So, you _could_ write commands for the same graph like this:

```{r}
ggplot(data=stat190, mapping = aes(x=Height, y=Weight, color=Gender)) +
geom_point() +
stat_smooth(method=lm)
```

This flexibility is a double-edged sword. Different sources will write their ggplot commands differently, and it took me a while to realize why that was. Other sources will drop the "data" and "mapping" labels as another short-cut, which I feel also makes commands harder to understand.

**Question 4**

> Here's another way to see if you can see a difference between left and
> right-handed people. Add a **color** mapping to the stat in the code
> below. Does the line for left-handed people look like the line for
> right-handed people? **Answer Below:**

```{r}
ggplot(data=stat190) +
geom_point(mapping = aes(x=Height, y=Weight)) +
stat_smooth(mapping = aes(x=Height, y=Weight, color=Handed), method=lm)
```

# Conservative or Liberal? (and the **position** attribute)

We've looked at the effects of _mapping_ and _faceting_ on the graphs we can make. _Position_ is another aspect of the graphs we can make. Let's look at a comparison of categorical variables to see the effect of the **position** argument.

We'll first make a bar graph of a single variable: **Religiously.C.or.L**. Note that the y axis is determined by the **stat** attribute of the graph, which by default just counts number of data points. You can read more about the **stat** attribute in our text, but we'll leave it alone here.

```{r}
ggplot(data = stat190) +
geom_bar(mapping = aes(x=Religiously.C.or.L))
```

It's pretty easy to see that people in the middle are the most numerous, with either "liberal" or "conservative" being relatively less common.

Now if we want to compare **Religiously.C.or.L** to another variable, say **Socially.C.or.L**, we can add the second variable as a **fill** mapping. We don't add it as a **y** mapping because y is determined by the count of data points.

```{r}
ggplot(data = stat190) +
geom_bar(mapping = aes(x=Religiously.C.or.L, fill=Socially.C.or.L))
```

Stacked bar graphs are often a good choice, but another option is to create a clustered bar graph, using the **position = "dodge"** option. This may make it slightly easier to compare prevalence of Social sentiment within each Relgious category.

```{r}
ggplot(data = stat190) +
geom_bar(mapping = aes(x=Religiously.C.or.L, fill=Socially.C.or.L), position="dodge")
```

Neither of these graphs is necessarily best if the goal is to compare the percentage of each "social" category within each "religious" category. Often we care less about the _number_ of data points within each group, and more about the _relative percentages_ within each group. In this case, the **position = "fill"** option is what we want.

```{r}
ggplot(data = stat190) +
geom_bar(mapping = aes(x=Religiously.C.or.L, fill=Socially.C.or.L), position="fill")
```

Now it becomes pretty easy to see that the percentage of students who are socially liberal is higher among religiously liberal students than among religiously conservative students--a result that isn't too surprising.

**Question 5**

> Now, see if you can make a similar graph to answer this question: "Is there
> evidence that students who have been here more semesters have a higher
> percentage who are socially liberal? Note: This graph will treat
> **Semester** as a categorical variable. You can ignore the 9 and 10-semester
> bars becaue they represent very few students. **Answer and code below:**

讲解Visualizing Data、辅导R编程设计、讲解ggplot留学生、R语言讲解 讲解SPSS|讲解R语言编程

讲解Visualizing Data、辅导R编程设计、讲解ggplot留学生、R语言讲解讲解SPSS|讲解R语言编程