Introduction to ggplot2

Introducting how to use ggplot2 to generate effective data visualizations in R by identifying the key components of a graph, such as geometry, aesthetic mapping, and scales, and utilizing these to create visually appealing and informative visualizations.
Data Visualization
GGplot2
Author

NING LI

Published

Dec 03, 2022

Introduction to ggplot2

Overview

After completing ggplot2, you will:

  • be able to use ggplot2 to create data visualizations in R.

  • be able to explain what the data component of a graph is.

  • be able to identify the geometry component of a graph and know when to use which type of geometry. be able to explain what the aesthetic mapping component of a graph is.

  • be able to understand the scale component of a graph and select an appropriate scale component to use.

ggplot

ggplot2

key Points:

  • Throughout the series, we will create plots with the ggplot2 package. ggplot2 is part of the tidyverse suite of package, which you can load with library(tidyverse).

  • Note that you can also load ggplot2 alone using the command library(ggplot2), instead of loading the entire tidyverse.

  • ggplot2 uses a grammar of graphics to break plots into building blocks that have intuitive syntax, making it easy to create relatively complex and aesthetically pleasing plots with relatively simple and readable code.

  • ggplot2 is designed to work excusively with tidy data (rows are observations and columns are variables).

Graph Components

Key Points:

  • Plots in ggplot2 consist of 3 main components:
    • Data: The dataset being summarized
    • Geometry: The type of plot(scatterplot, boxplot, barplot, histogram, qqplot, smooth desity, etc.)
    • Aesthetic mapping: Variable mapped to visual cues, such as x-axis and y-axis values and color.

Code:

library(dslabs)
data(murders)

Creating a New Plot

Key Points:

  • You can associated a dataset x with a ggplot object with any of the 3 commands:

    • ggplot(data = x)
    • ggplot(x)
    • x %>% ggplot()
  • You can assign a ggplot object to a variable. If the object is not assigned to a variable, it will automatically be displayed.

  • You can display a ggplot object assigned to a variable by printing that variable.

Code:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
data(murders)

ggplot(data = murders)

murders %>% ggplot()

p <- ggplot(data = murders)

class(p)
[1] "gg"     "ggplot"
print(p) # this is equivalent to simply typing p

p 

Layers

Key Points:

  • In ggplot2, graphs are created by adding layers to the ggplot object: DATA %>% ggplot() + LAYER_1 + LAYER_2 + … + LAYER_N

  • The geometry layer defines that plot type and takes the format geom_x where x is the plot type.

  • Aesthetic mappings describe how properties of the data connect with features of the graph (axis position, color, size, etc.) define aesthetic mapping with aes() function.

  • aes() uses variable names from the object component (for example, total rather than murders$total).

  • geom_point() creates a scatterplot and requires x and y aesthetic mappings.

  • geom_text() and geom_label add text to a scatterplot and require x, y, and label aesthetic mappings.

  • To determine which aesthetic mappings are required for a geometry, read the help file for that geometry.

  • You can add layers with different aesthetic mappings to the same graph.

Code: Adding layers to a plot

library(tidyverse)
library(dslabs)
data(murders)

murders %>% ggplot() +
  geom_point(aes(x = population/10^6, y = total))
# add points layer to predefined ggplot object
p <- ggplot(data = murders)
p + geom_point(aes(population/10^6, total))

# add text layer to scatterplot
p + geom_point(aes(population/10^6, total)) +
  geom_text(aes(population/10^6, total, label = abb))

Code: Example of aes behavior

# no error from this call
p_test <- p + geom_text(aes(population/10^6, total, lable = abb))

# error - "abb" is not a globally defined variable and cannot be found outside of aes
p_test <- p + geom_text(aes(population/10^6, total), label = abb)

Thinkering

Key Points:

  • You can modify arguments to geometry functions others than aes() and the data.

  • These arguments are not aesthetic mappings: the affect all data points the same way.

  • Global aesthetic mappings apply to all geometries and can be defined when you initially call ggplot(). All the geometries added as layers will default to this mapping. Local aesthetic mapping add additional information or override the default mappings.

Nudge points a fixed distance

position_nudge(x = 0, y = 0) is generally useful for adjusting the position of items on discrete scales by a small amount. Nudging is built in to geom_text() because it’s so useful for moving labels a small distance from what they’re labeling.

Code:

# change the size of the points
p + geom_point(aes(population/10^6, total), size = 3) +
    geom_text(aes(population/10^6, total, label = abb))

# move text labels slightly to the right
p + geom_point(aes(population/10^6, total), size = 3) +
    geom_text(aes(population/10^6, total, label = abb), nudge_x = 1)

# simplify code by adding global aesthetic
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size = 3) +
    geom_text(nudge_x = 1.5)

# local aesthetics override global aesthetics
p + geom_point(size = 3) +
  geom_text(aes(x = 10, y = 800, label = "Hello there!"))

Scales, Labels, and Colors

Key Points:

  • Convert the x-axis to log scale with scale_x_continuous(trans = "log10") or scale_x_log10(). Similar function exist for the y-axis.

  • Add axis title with xlab() and ylab() function. Add a plot title with the ggtitle() function.

  • Add a color mapping that colors points by a varaibale by defining col argument within aes(). To color all pints the same way, define col outside of aes().

  • Add a line with the geom_abline() geometry. geom_abline() takes arguments slop (default = 1) and intercept(default = 0). Change the color with col or color and line type with lty.

  • Placing the line layer after the point layer will overlay the the line on top of the points. To overlay points on the line, place the line layer before the point layer.

  • There are many additional ways to tweak your graph that can be found in the ggplot2 documentation, cheat sheet or on the internet. For example, you can change the legend title with scale_color_discrete.

Code: Log-scale the x-axis and y-axis

# define p
library(tidyverse)
library(dslabs)
data(murders)

p <- murders %>% ggplot(aes(population/10^6, total, label = abb))

# log base 10 scale the x-axis and y-axis
p + geom_point(size = 3) +
    geom_text(nudge_x = 0.05) +
    scale_x_continuous(trans = "log10") +
    scale_y_continuous(trans = "log10")
# efficient log scaling of the axes
p + geom_point(size = 3) +
    geom_text(nudge_x = 0.05) +
    scale_x_log10() +
    scale_y_log10()

Code: Add labels and title

p + geom_point(size = 3) +
    geom_text(nudge_x = 0.05) +
    scale_x_log10() +
    scale_y_log10() +
    xlab("Population in million(log scale)") +
    ylab("Total number of murders(log scale)") +
    ggtitle("US Gun Murders in 2010")

Code: Change color of the points

# redefine p to be everything except the points layer
p <- murders %>% 
     ggplot(aes(population/10^6, total, label = abb)) +
     geom_text(nudge_x = 0.075) +
     scale_x_log10() +
     scale_y_log10() +
     xlab("Population in million(log scale)") +
     ylab("Total number of murders(log scale)") +
     ggtitle("US Gun Murders in 2010")
# make all points blue
p + geom_point(size = 3, color = "blue")

# color points by region
p + geom_point(aes(col = region), size = 3)

Code: Add a line with average murder rate

r <- murders %>% 
     summarize(rate = sum(total) / sum(population) * 10^6) %>%      pull(rate)

p <- p + geom_point(aes(col = region), size = 3) +
         geom_abline(intercept = log10(r)) # slop is default of 1

# change line to dashed and dark grey, line under points
p + geom_abline(intercept = log(r), lty = 2, color = "darkgrey") +
    geom_point(aes(col = region), size = 3)

Line types in R: Ity

The different line types available in R are shown in the figure hereafter. The argument lty can be used to specify the line type. To change line width, the argument lwd can be used.

Code: Change legend title

# capitalize legend title
p <- p + scale_color_discrete(name = "Region")
p

Add-on packages

Key Points

  • The style of a ggplot graph can be changed using the theme() function.

  • The ggthemes package adds additional themes.

  • The ggrepel package includes a geometry that repels text labels, ensuring they do not overlap with each other: geom_text_repel().

Code: Adding themes

# theme used for graphs in the textbook and course
library(dslabs)
ds_theme_set()
# themes from ggthemes
library(ggthemes)
p + theme_economist()    # style of the Economist magazine

p + theme_fivethirtyeight()    # style of the FiveThirtyEight website

Code: Putting it all together to assemble the plot

# load libraries
library(tidyverse)
library(ggrepel)
library(ggthemes)
library(dslabs)
data(murders)
# define the intercept
r <- murders %>%
    summarize(rate = sum(total) / sum(population) * 10^6) %>%
    .$rate
    
# make the plot, combining all elements
murders %>%
    ggplot(aes(population/10^6, total, label = abb)) +
    geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
    geom_point(aes(col = region), size = 3) +
    geom_text_repel() +
    scale_x_log10() +
    scale_y_log10() +
    xlab("Population in millions (log scale)") +
    ylab("Total number of murders (log scale)") +
    ggtitle("US Gun Murders in 2010") +
    scale_color_discrete(name = "Region") +
    theme_economist()

Other Examples

Key points

  • geom_histogram() creates a histogram. Use the binwidth argument to change the width of bins, the fill argument to change the bar fill color, and the col argument to change bar outline color.

  • geom_density() creates smooth density plots. Change the fill color of the plot with the fill argument.

  • geom_qq() creates a quantile-quantile plot. This geometry requires the sample argument. By default, the data are compared to a standard normal distribution with a mean of 0 and standard deviation of 1. This can be changed with the dparams argument, or the sample data can be scaled.

  • Plots can be arranged adjacent to each other using the grid.arrange() function from the gridExtra package. First, create the plots and save them to objects (p1, p2, …). Then pass the plot objects to grid.arrange().

Code: Histograms in ggplot2

library(tidyverse)
library(dslabs)
data(heights)

# define p
p <- heights %>% 
  filter(sex == "Male") %>% 
  ggplot(aes(x=height))
# basic histograms
p + geom_histogram() + ggtitle("binwidth is default")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram(binwidth = 1) + ggtitle("binwidth is 1")

# histogram with blue fill, black outline, labels and title
p + geom_histogram(binwidth = 1, fill ="blue", col = "black") + 
  xlab("Male heights in inches") +
  ggtitle("histogram")

Code: Smooth density plots in ggplot2

p + geom_density(fill = "blue", col = "red") +
  xlab("Male heights in inches") +
  ylab("proportion of Male heights") +
  ggtitle("Male heights distribution")

Code: Quantile-quantile plots in ggplot2

# basic QQ-plot
p <- heights %>% filter(sex == "Male") %>% 
  ggplot(aes(sample = height))
p + geom_qq()

# QQ-plot against a normal distribution with same mean/sd as data
params <- heights %>% 
  filter(sex == "Male") %>% 
  summarize(mean = mean(height), sd = sd(height))
p + geom_qq(dparams = params) +
  geom_abline()

# QQ-plot of scaled data against the standard normal distribution
heights %>% 
  ggplot(aes(sample = scale(height))) +
  geom_qq() +
  geom_abline()

# define plots p1, p2, p3
p <- heights %>% filter(sex == "Male") %>% ggplot(aes(x = height))
p1 <- p + geom_histogram(binwidth = 1, fill = "blue", col = "black")
p2 <- p + geom_histogram(binwidth = 2, fill = "blue", col = "black")
p3 <- p + geom_histogram(binwidth = 3, fill = "blue", col = "black")
# arrange plots next to each other in 1 row, 3 columns
library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3)

Back to top