+ - 0:00:00
Notes for current slide
Notes for next slide

Lemme introduce myself a little bit. I'm currently doing my PhD at Monash, working with Di and Rob. My research is about visualising temporal data.

This is what I'm going to cover for the next one hour and half. First, I'll talk about why tidyverse is great. tidyverse is a collection of R packages that center around the "tidy data" concept. We'll learn some verbs/functions that do the data wrangling to get the tidy temporal data. As data analysts ... Second, we'll learn about the ggplot2 that handles with plotting. You may wonder why it's called ggplot2 and what gg means. It's grammar of graphics for plots. I'll talk about how the ggplot2 functions powered by the grammar help us to visualise time series data. In the final bit, I'm going to show you some demos about interactive graphics using plotly and shiny.

Tidy data and statistical visualisation
to support exploration of temporal data with R

Earo Wang

May 29, 2017

Slides on http://bit.ly/tsvis2017

1 / 45

Agenda

  • Why tidyverse? Tidy data
  • What does gg in ggplot2 stand for? Grammar of graphics
  • Interactive data visualisation on the web
    • plotly: ggplotly(), plot_ly, animation
    • shiny app: wanderer4melb
2 / 45

Lemme introduce myself a little bit. I'm currently doing my PhD at Monash, working with Di and Rob. My research is about visualising temporal data.

This is what I'm going to cover for the next one hour and half. First, I'll talk about why tidyverse is great. tidyverse is a collection of R packages that center around the "tidy data" concept. We'll learn some verbs/functions that do the data wrangling to get the tidy temporal data. As data analysts ... Second, we'll learn about the ggplot2 that handles with plotting. You may wonder why it's called ggplot2 and what gg means. It's grammar of graphics for plots. I'll talk about how the ggplot2 functions powered by the grammar help us to visualise time series data. In the final bit, I'm going to show you some demos about interactive graphics using plotly and shiny.

R packages

  • tidyverse: a collection of R packages surrounding "tidy data"
  • stringr: handle string manipulation
  • forcats: handle categorical variables
  • lubridate: lubricate date-times process
  • plotly: create web-based visualisation
  • shiny: build interactive web applications
  • knitr: provide tools for dynamic report generation
  • devtools: help with R packages development
3 / 45

Here's a list of R packages that we're going to use for this part. As I said before, tidyverse is a set of packages including ggplot2, dplyr, readr etc. stringr for strings and forcats for categorical variables. lubridate is for making dealing with dates easier. plotly and shiny for interactive graphics on the web.

Tidy data

4 / 45

Okay. Now, let's start with tidy data.

What is tidy data?

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.
5 / 45

What is tidy data? What makes a dataset tidy?

1.2.3

The data structure is a rectangular cases-by-variables data layout that underlines the tidyverse.

I'll use three datasets to explain what tidy data actually mean. How important is the tidy data for further data analysis and vis.

Melbourne pedestrian data 🚶‍♀️

- sensor data

ped_loc <- read_csv("data/sensor_locations.csv")
ped_loc %>%
select(
`Sensor ID`, `Sensor Description`,
Longitude, Latitude
)
#> # A tibble: 43 x 4
#> `Sensor ID` `Sensor Description` Longitude
#> <int> <chr> <dbl>
#> 1 22 Flinders St-Elizabeth St (East) 144.9651
#> 2 34 Flinders St-Spark La 144.9742
#> 3 11 Waterfront City 144.9396
#> 4 8 Webb Bridge 144.9472
#> 5 7 Birrarung Marr 144.9714
#> 6 13 Flagstaff Station 144.9566
#> 7 15 State Library 144.9645
#> 8 27 QV Market-Peel St 144.9566
#> 9 12 New Quay 144.9429
#> 10 24 Spencer St-Collins St (North) 144.9545
#> # ... with 33 more rows, and 1 more variables:
#> # Latitude <dbl>

* source: the city of Melbourne

6 / 45

The first dataset we're going to look at is ...

The dataset is sourced from Melbourne Open Data Portal. You can actually click here to check out the web page that hosts the dataset. Since 2009, the city of Melb started to install some sensors that capture the foot traffic every hour. This data can be used for urban planning or business management. For example, if you're a cafe owner, by looking at the hourly traffic to decide the trading hours. Until today, there are 43 sensors have been installed across the city.

Here, I read the sensor locations data into R using read_csv from readr pkg. Rob has talked about the pipe operator. I pass the data to the select function and select four colnames, that is ...

Melbourne pedestrian data 🚶‍♀️

- sensor data

- sensor map

7 / 45

As I know sensor's long and lat, I could plot these sensors on the map of Melb. It clearly shows the locations of these sensors.

Melbourne pedestrian data 🚶‍♀️

- sensor data

- sensor map

- hourly counts

ped_2017 <- read_csv("data/pedestrian_03_2017.csv")
ped_2017
#> # A tibble: 744 x 45
#> Date Hour `State Library` `Collins Place (South)`
#> <chr> <int> <int> <int>
#> 1 01/03/2017 0 140 36
#> 2 01/03/2017 1 64 17
#> 3 01/03/2017 2 29 11
#> 4 01/03/2017 3 13 9
#> 5 01/03/2017 4 13 10
#> 6 01/03/2017 5 31 84
#> 7 01/03/2017 6 92 252
#> 8 01/03/2017 7 327 767
#> 9 01/03/2017 8 908 1997
#> 10 01/03/2017 9 775 1319
#> # ... with 734 more rows, and 41 more variables: `Collins
#> # Place (North)` <int>, `Flagstaff Station` <int>,
#> # `Melbourne Central` <int>, `Town Hall (West)` <int>,
#> # `Bourke Street Mall (North)` <int>, `Bourke Street Mall
#> # (South)` <int>, `Australia on Collins` <int>, `Southern
#> # Cross Station` <int>, `Victoria Point` <int>, `New
#> # Quay` <int>, `Waterfront City` <int>, `Webb
#> # Bridge` <int>, `Princes Bridge` <int>, `Flinders St
#> # Station Underpass` <int>, `Sandridge Bridge` <int>,
#> # `Birrarung Marr` <int>, `QV Market-Elizabeth
#> # (West)` <int>, `Flinders St-Elizabeth St (East)` <int>,
#> # `Spencer St-Collins St (North)` <int>, `Spencer
#> # St-Collins St (South)` <int>, `Bourke St-Russell St
#> # (West)` <int>, `Convention/Exhibition Centre` <int>,
#> # `Chinatown-Swanston St (North)` <int>, `Chinatown-Lt
#> # Bourke St (South)` <int>, `QV Market-Peel St` <int>,
#> # `Vic Arts Centre` <int>, `Lonsdale St (South)` <int>,
#> # `Lygon St (West)` <int>, `Flinders St-Spring St
#> # (West)` <int>, `Flinders St-Spark Lane` <int>, `Alfred
#> # Place` <int>, `Queen Street (West)` <int>, `Lygon
#> # Street (East)` <int>, `Flinders St-Swanston St
#> # (West)` <int>, `Spring St-Lonsdale St (South)` <int>,
#> # `City Square` <int>, `St. Kilda-Alexandra
#> # Gardens` <int>, `Grattan St-Swanston St (West)` <int>,
#> # `Monash Rd-Swanston St (West)` <int>, `Tin
#> # Alley-Swanston St (West)` <int>, Southbank <int>
8 / 45

Besides the locations, we're more interested in learning about the hourly pedestrian counts at every sensor. I read the second csv file that contains the pedestrian counts in March.

For this data, there are 744 obs and 45 columns. ...

Date is read into as character ...

Lemme me refer to this kind of format as wide format by contrast to long form. I'll explain what I mean by wide and long later.

Melbourne pedestrian data 🚶‍♀️

- sensor data

- sensor map

- hourly counts

- tidy data

9 / 45

The remedy is converting the wide format to the long data.

The top data table is what we have, wide

gather the headers to one key variable, and the counts to the value variable.

By having more than 40 columns, we have four vars to work with. Looking at this long form, what are the variables become clearer compared to the wide format. Each variable forms a column.

from tidyr

organise the same data in two different ways. To be consistent, tidy long data used.

Melbourne pedestrian data 🚶‍♀️

- sensor data

- sensor map

- hourly counts

- tidy data

ped_long <- ped_2017 %>%
gather(
key = Sensor_Name, value = Counts,
`State Library`:Southbank
) %>%
mutate(
Date_Time = dmy_hms(paste(Date, Hour, "00:00")),
Date = dmy(Date)
)
ped_long
#> # A tibble: 31,992 x 5
#> Date Hour Sensor_Name Counts
#> <date> <int> <chr> <int>
#> 1 2017-03-01 0 State Library 140
#> 2 2017-03-01 1 State Library 64
#> 3 2017-03-01 2 State Library 29
#> 4 2017-03-01 3 State Library 13
#> 5 2017-03-01 4 State Library 13
#> 6 2017-03-01 5 State Library 31
#> 7 2017-03-01 6 State Library 92
#> 8 2017-03-01 7 State Library 327
#> 9 2017-03-01 8 State Library 908
#> 10 2017-03-01 9 State Library 775
#> # ... with 31,982 more rows, and 1 more variables:
#> # Date_Time <dttm>
10 / 45

Any questions so far?

Otway weather data ☀️

- what variables?

otway_weather <- read_csv("data/weather_2016.csv")
head(otway_weather)
#> # A tibble: 6 x 35
#> ID YEAR MONTH ELEMENT VALUE1 VALUE2 VALUE3
#> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 ASN00090015 2016 01 TMAX 209 195 193
#> 2 ASN00090015 2016 01 TMIN 175 145 162
#> 3 ASN00090015 2016 01 PRCP 0 0 0
#> 4 ASN00090015 2016 01 TAVG 166 174 175
#> 5 ASN00090015 2016 02 TMAX 217 239 185
#> 6 ASN00090015 2016 02 TMIN 120 146 149
#> # ... with 28 more variables: VALUE4 <int>, VALUE5 <int>,
#> # VALUE6 <int>, VALUE7 <int>, VALUE8 <int>, VALUE9 <int>,
#> # VALUE10 <int>, VALUE11 <int>, VALUE12 <int>,
#> # VALUE13 <int>, VALUE14 <int>, VALUE15 <int>,
#> # VALUE16 <int>, VALUE17 <int>, VALUE18 <int>,
#> # VALUE19 <int>, VALUE20 <int>, VALUE21 <int>,
#> # VALUE22 <int>, VALUE23 <int>, VALUE24 <int>,
#> # VALUE25 <int>, VALUE26 <int>, VALUE27 <int>,
#> # VALUE28 <int>, VALUE29 <int>, VALUE30 <int>,
#> # VALUE31 <int>

* source: global historical climatology network

11 / 45

Otway weather data ☀️

- what variables?

- tidying 🛠

otway_weather %>%
gather(DAY, VALUE, VALUE1:VALUE31)
#> # A tibble: 1,488 x 6
#> ID YEAR MONTH ELEMENT DAY VALUE
#> <chr> <int> <chr> <chr> <chr> <int>
#> 1 ASN00090015 2016 01 TMAX VALUE1 209
#> 2 ASN00090015 2016 01 TMIN VALUE1 175
#> 3 ASN00090015 2016 01 PRCP VALUE1 0
#> 4 ASN00090015 2016 01 TAVG VALUE1 166
#> 5 ASN00090015 2016 02 TMAX VALUE1 217
#> 6 ASN00090015 2016 02 TMIN VALUE1 120
#> 7 ASN00090015 2016 02 PRCP VALUE1 2
#> 8 ASN00090015 2016 02 TAVG VALUE1 187
#> 9 ASN00090015 2016 03 TMAX VALUE1 243
#> 10 ASN00090015 2016 03 TMIN VALUE1 172
#> # ... with 1,478 more rows
12 / 45

Otway weather data ☀️

- what variables?

- tidying 🛠

otway_weather %>%
gather(DAY, VALUE, VALUE1:VALUE31) %>%
mutate(
DAY = str_sub(DAY, start = 6),
DATE = ymd(paste(YEAR, MONTH, DAY, sep = "-"))
) %>%
arrange(DATE) %>%
select(ID, DATE, ELEMENT, VALUE) %>%
filter(!(is.na(DATE)))
#> # A tibble: 1,464 x 4
#> ID DATE ELEMENT VALUE
#> <chr> <date> <chr> <int>
#> 1 ASN00090015 2016-01-01 TMAX 209
#> 2 ASN00090015 2016-01-01 TMIN 175
#> 3 ASN00090015 2016-01-01 PRCP 0
#> 4 ASN00090015 2016-01-01 TAVG 166
#> 5 ASN00090015 2016-01-02 TMAX 195
#> 6 ASN00090015 2016-01-02 TMIN 145
#> 7 ASN00090015 2016-01-02 PRCP 0
#> 8 ASN00090015 2016-01-02 TAVG 174
#> 9 ASN00090015 2016-01-03 TMAX 193
#> 10 ASN00090015 2016-01-03 TMIN 162
#> # ... with 1,454 more rows
13 / 45

Otway weather data ☀️

- what variables?

- tidying 🛠

otway_weather %>%
gather(DAY, VALUE, VALUE1:VALUE31) %>%
mutate(
DAY = str_sub(DAY, start = 6),
DATE = ymd(paste(YEAR, MONTH, DAY, sep = "-"))
) %>%
arrange(DATE) %>%
select(ID, DATE, ELEMENT, VALUE) %>%
filter(!(is.na(DATE))) %>%
mutate(
VALUE = if_else(VALUE < -999, NA_integer_, VALUE),
VALUE = VALUE / 10
) %>%
spread(ELEMENT, VALUE)
#> # A tibble: 366 x 6
#> ID DATE PRCP TAVG TMAX TMIN
#> * <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 ASN00090015 2016-01-01 0 16.6 20.9 17.5
#> 2 ASN00090015 2016-01-02 0 17.4 19.5 14.5
#> 3 ASN00090015 2016-01-03 0 17.5 19.3 16.2
#> 4 ASN00090015 2016-01-04 0 17.7 20.2 16.7
#> 5 ASN00090015 2016-01-05 0 17.8 20.6 16.1
#> 6 ASN00090015 2016-01-06 0 17.1 20.3 16.5
#> 7 ASN00090015 2016-01-07 0 15.8 19.7 14.8
#> 8 ASN00090015 2016-01-08 0 15.6 18.8 14.2
#> 9 ASN00090015 2016-01-09 0 15.7 19.0 11.3
#> 10 ASN00090015 2016-01-10 0 18.6 25.4 11.9
#> # ... with 356 more rows
14 / 45

Otway weather data ☀️

- what variables?

- tidying 🛠

otway_tidy <- otway_weather %>%
gather(DAY, VALUE, VALUE1:VALUE31) %>%
mutate(
DAY = str_sub(DAY, start = 6),
DATE = ymd(paste(YEAR, MONTH, DAY, sep = "-"))
) %>%
arrange(DATE) %>%
select(ID, DATE, ELEMENT, VALUE) %>%
filter(!(is.na(DATE))) %>%
mutate(
VALUE = if_else(VALUE < -999, NA_integer_, VALUE),
VALUE = VALUE / 10
) %>%
spread(ELEMENT, VALUE) %>%
mutate(NAVG = (TMAX + TMIN) / 2)
head(otway_tidy)
#> # A tibble: 6 x 7
#> ID DATE PRCP TAVG TMAX TMIN NAVG
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ASN00090015 2016-01-01 0 16.6 20.9 17.5 19.20
#> 2 ASN00090015 2016-01-02 0 17.4 19.5 14.5 17.00
#> 3 ASN00090015 2016-01-03 0 17.5 19.3 16.2 17.75
#> 4 ASN00090015 2016-01-04 0 17.7 20.2 16.7 18.45
#> 5 ASN00090015 2016-01-05 0 17.8 20.6 16.1 18.35
#> 6 ASN00090015 2016-01-06 0 17.1 20.3 16.5 18.40
15 / 45

TAVG: UTC time zone rather than local time

Your turn

- billboard data

billboard.csv records the date a song first entered the Billboard Top 100 in 2000 and its rank over 76 weeks.

#> # A tibble: 6 x 81
#> year artist track
#> <int> <chr> <chr>
#> 1 2000 Backstreet Boys, The Shape Of My Heart
#> 2 2000 Backstreet Boys, The Show Me The Meaning ...
#> 3 2000 Backstreet Boys, The The One
#> 4 2000 N'Sync Bye Bye Bye
#> 5 2000 N'Sync It's Gonna Be Me
#> 6 2000 N'Sync This I Promise You
#> # ... with 78 more variables: time <time>,
#> # date.entered <date>, `1` <int>, `2` <int>, `3` <int>,
#> # `4` <int>, `5` <int>, `6` <int>, `7` <int>, `8` <int>,
#> # `9` <int>, `10` <int>, `11` <int>, `12` <int>,
#> # `13` <int>, `14` <int>, `15` <int>, `16` <int>,
#> # `17` <int>, `18` <int>, `19` <int>, `20` <int>,
#> # `21` <int>, `22` <int>, `23` <int>, `24` <int>,
#> # `25` <int>, `26` <int>, `27` <chr>, `28` <chr>,
#> # `29` <chr>, `30` <chr>, `31` <chr>, `32` <chr>,
#> # `33` <chr>, `34` <chr>, `35` <chr>, `36` <chr>,
#> # `37` <chr>, `38` <chr>, `39` <chr>, `40` <chr>,
#> # `41` <chr>, `42` <chr>, `43` <chr>, `44` <chr>,
#> # `45` <chr>, `46` <chr>, `47` <chr>, `48` <chr>,
#> # `49` <chr>, `50` <chr>, `51` <chr>, `52` <chr>,
#> # `53` <chr>, `54` <chr>, `55` <chr>, `56` <chr>,
#> # `57` <chr>, `58` <chr>, `59` <chr>, `60` <chr>,
#> # `61` <chr>, `62` <chr>, `63` <chr>, `64` <chr>,
#> # `65` <chr>, `66` <chr>, `67` <chr>, `68` <chr>,
#> # `69` <chr>, `70` <chr>, `71` <chr>, `72` <chr>,
#> # `73` <chr>, `74` <chr>, `75` <chr>, `76` <chr>
16 / 45
  • Old dataset
  • A song first time entered to the billboard Top 100, when it's entered
  • The position and if NA, dropped from the Top 100

Your turn

- billboard data

- end products

song
#> # A tibble: 6 x 4
#> id artist track
#> <int> <chr> <chr>
#> 1 1 Backstreet Boys, The Shape Of My Heart
#> 2 2 Backstreet Boys, The Show Me The Meaning ...
#> 3 3 Backstreet Boys, The The One
#> 4 4 N'Sync Bye Bye Bye
#> 5 5 N'Sync It's Gonna Be Me
#> 6 6 N'Sync This I Promise You
#> # ... with 1 more variables: time <time>
rank
#> # A tibble: 456 x 4
#> id date.entered week rank
#> <int> <date> <chr> <chr>
#> 1 1 2000-10-14 1 39
#> 2 2 2000-01-01 1 74
#> 3 3 2000-05-27 1 58
#> 4 4 2000-01-29 1 42
#> 5 5 2000-05-06 1 82
#> 6 6 2000-09-30 1 68
#> 7 1 2000-10-14 2 25
#> 8 2 2000-01-01 2 62
#> 9 3 2000-05-27 2 50
#> 10 4 2000-01-29 2 20
#> # ... with 446 more rows
17 / 45
  • convert wide to long
  • spit to two data tables
  • Follows the third point of the tidy data principles: Each type of observational unit forms a table.
  • 40 mins get done

Grammar of graphics

18 / 45
  • basic/good understanding of tidy data structure
  • bc it builds a solid foundation to statistical vis.

What is grammar of graphics?

  • In 1999, "The grammar of graphics" by Leland Wilkinson
  • In 2008, ggplot2 by Hadley Wickham
  • A conceptual framework for mapping variables to graphical elements
  • The analogue of doing a functional mapping of random variables. For example, n independent and identically distributed RVs X1,,Xn, the mean and the standard deviation are defined as ˉX=1nni=1XiSn1=1n1ni=1(XiˉX)2.
19 / 45

we're going to use some short but comprehensive vocabulary to describe different sorts of graphs.

More on pedestrian data

- revisit

ped_long
#> # A tibble: 31,992 x 5
#> Date Hour Sensor_Name Counts
#> <date> <int> <chr> <int>
#> 1 2017-03-01 0 State Library 140
#> 2 2017-03-01 1 State Library 64
#> 3 2017-03-01 2 State Library 29
#> 4 2017-03-01 3 State Library 13
#> 5 2017-03-01 4 State Library 13
#> 6 2017-03-01 5 State Library 31
#> 7 2017-03-01 6 State Library 92
#> 8 2017-03-01 7 State Library 327
#> 9 2017-03-01 8 State Library 908
#> 10 2017-03-01 9 State Library 775
#> # ... with 31,982 more rows, and 1 more variables:
#> # Date_Time <dttm>
20 / 45

More on pedestrian data

- revisit

- ggplot2

data: ped_long
layer:
mapping: x = Date_Time, y = Counts
geom: line, point
facet: Sensor_Name

ggplot(ped_long, aes(x = Date_Time, y = Counts)) +
geom_line() +
geom_point() +
facet_grid(Sensor_Name ~ ., scale = "free_y")
21 / 45
  • autoplot: against time index
  • This is how we use graphic gramma to describe the plot
  • Start with a tidy dataset with clearly defined variables.
  • geom and other aes make the graph perceivable.
  • facet: trellis plots or lattice plots.
  • scale: local for comparing shapes instead of magnitudes.

More on pedestrian data

- revisit

- ggplot2

data: ped_long
layer:
mapping: x = Date_Time, y = Counts, colour = Sensor_Name
geom: line, point
facet: Sensor_Name

ggplot(ped_long, aes(x = Date_Time, y = Counts)) +
geom_line(aes(colour = Sensor_Name)) +
geom_point(aes(colour = Sensor_Name)) +
facet_grid(Sensor_Name ~ ., scale = "free_y")
22 / 45
  • colour is also the part of aes mapping.
  • not adding additional info in this case.
  • pedestrian data interesting and rich
  • southern x as an example to explore
  • seems that sx has distinctive patterns b/t weekdays and weekend

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

wday <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday")
sx <- ped_long %>%
filter(Sensor_Name == "Southern Cross Station") %>%
mutate(
Wday = wday(Date, label = TRUE, abbr = FALSE),
Wday = if_else(Wday %in% wday, "Weekday", "Weekend"),
Wday = ordered(Wday)
)
sx
#> # A tibble: 744 x 6
#> Date Hour Sensor_Name Counts
#> <date> <int> <chr> <int>
#> 1 2017-03-01 0 Southern Cross Station 16
#> 2 2017-03-01 1 Southern Cross Station 8
#> 3 2017-03-01 2 Southern Cross Station 3
#> 4 2017-03-01 3 Southern Cross Station 4
#> 5 2017-03-01 4 Southern Cross Station 1
#> 6 2017-03-01 5 Southern Cross Station 96
#> 7 2017-03-01 6 Southern Cross Station 581
#> 8 2017-03-01 7 Southern Cross Station 1847
#> 9 2017-03-01 8 Southern Cross Station 3863
#> 10 2017-03-01 9 Southern Cross Station 2063
#> # ... with 734 more rows, and 2 more variables:
#> # Date_Time <dttm>, Wday <ord>
23 / 45

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

data: southern-cross
layer:
mapping: x = Hour, y = Counts, colour = Wday
geom: line

ggplot(sx, aes(Hour, Counts, group = Date)) +
geom_line(aes(colour = Wday))
24 / 45
  • explain activities at some time points.
  • more spread out in afternoon than in morning.
  • All the daily series are overlaid each other and use color to differentiate weekday and weekends.

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

data: southern-cross
layer:
mapping: x = Hour, y = Counts
geom: line
facet: Wday

ggplot(sx, aes(Hour, Counts, group = Date)) +
geom_line() +
facet_wrap(~ Wday, ncol = 2)
25 / 45
  • weekday and weekends are not yet sufficient.
  • ?what day would make this behave like a weekend. ---> labour day

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

Expertise and Google in action

26 / 45

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

labour <- "Labour Day" # 2013-03-13
adele <- "Adele Day" # 2017-03-18 to 19
# Justin Bieber's gig 2017-03-10
sx_more <- sx %>%
mutate(
Wday = fct_expand(Wday, labour, adele),
Wday = if_else(
Date == ymd("2017-03-13"),
ordered(labour, levels(Wday)), Wday
),
Wday = if_else(
Date %in% ymd(c("2017-03-18", "2017-03-19")),
ordered(adele, levels(Wday)), Wday
)
)
head(sx_more)
#> # A tibble: 6 x 6
#> Date Hour Sensor_Name Counts
#> <date> <int> <chr> <int>
#> 1 2017-03-01 0 Southern Cross Station 16
#> 2 2017-03-01 1 Southern Cross Station 8
#> 3 2017-03-01 2 Southern Cross Station 3
#> 4 2017-03-01 3 Southern Cross Station 4
#> 5 2017-03-01 4 Southern Cross Station 1
#> 6 2017-03-01 5 Southern Cross Station 96
#> # ... with 2 more variables: Date_Time <dttm>, Wday <ord>
27 / 45

transform the data

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

data: southern-cross-more
layer:
mapping: x = Hour, y = Counts
geom: line
facet: Wday

ggplot(sx_more, aes(Hour, Counts, group = Date)) +
geom_line() +
facet_wrap(~ Wday, ncol = 2)
28 / 45

grammar remains the same but using transformed data

I've just showed you how to use the graphics to explore the data. Notice something unexpected and use your expertise or other resources to explain and produce another graph.

Has anyone noticed there's another weired day that I haven't explained/exploited?

More on pedestrian data

- revisit

- ggplot2

- southern x 🚉

data: southern-cross-more
layer:
mapping: x = Hour, y = Counts
geom: line
facet: Wday
coord: polar

ggplot(sx_more, aes(Hour, Counts, group = Date)) +
geom_line() +
facet_wrap(~ Wday, ncol = 2) +
coord_polar()
29 / 45

known as rose plot

This slide also shows the advantage of using the grammar. Instead of referring to these as line and rose, the grammar informs you of the difference between the two plots is simply one linear and one polar coordinates.

polar: periodic behaviour

When you need to decide which plot you should pick up for you presentation, it should be conducted under the statistical hypothesis testing framework, known as visual inference.

Be aware of that it provides scientific tools to help with which display is more powerful.

Your turn

- weather data

otway_more <- otway_tidy %>%
mutate(
MONTH = month(DATE, label = TRUE),
DAY = mday(DATE)
)
otway_more
#> # A tibble: 366 x 9
#> ID DATE PRCP TAVG TMAX TMIN NAVG
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ASN00090015 2016-01-01 0 16.6 20.9 17.5 19.20
#> 2 ASN00090015 2016-01-02 0 17.4 19.5 14.5 17.00
#> 3 ASN00090015 2016-01-03 0 17.5 19.3 16.2 17.75
#> 4 ASN00090015 2016-01-04 0 17.7 20.2 16.7 18.45
#> 5 ASN00090015 2016-01-05 0 17.8 20.6 16.1 18.35
#> 6 ASN00090015 2016-01-06 0 17.1 20.3 16.5 18.40
#> 7 ASN00090015 2016-01-07 0 15.8 19.7 14.8 17.25
#> 8 ASN00090015 2016-01-08 0 15.6 18.8 14.2 16.50
#> 9 ASN00090015 2016-01-09 0 15.7 19.0 11.3 15.15
#> 10 ASN00090015 2016-01-10 0 18.6 25.4 11.9 18.65
#> # ... with 356 more rows, and 2 more variables:
#> # MONTH <ord>, DAY <int>
30 / 45

Your turn

- weather data

- grammar here?

31 / 45

Discuss with your neighbour what's the graph about and what's the grammar used?

Your turn

- weather data

- grammar here?

- it's ...

data: otway_more
layer:
1. yintercept: year_average
geom: hline
2. mapping: xmin = DAY-, xmax = DAY+, ymin = TMIN, ymax = TMAX
geom: rect
3. mapping: x = DAY, y = NAVG
geom: line
facet: MONTH

32 / 45

Let them do the lab exercise first without explaining the grammar.

70 mins get done

Workflow

* source: R for Data Science

33 / 45
  • This is the workflow that we followed in the first two parts
  • The first two sections: transform and visualise
  • missing component is "model", leave it to Rob
  • report the findings: communicate using interactive tools

Interactive data visualisation
on the web

34 / 45
  • bringing interactivity to the graphs provides a different dim to look at the data
  • serves as an exploratory and communication tool

plotly

- ggplotly

p <- sx %>%
ggplot(aes(Hour, Counts, group = Date)) +
geom_line() +
facet_wrap(~ Wday, ncol = 2)
ggplotly(p)
051015200100020003000400005101520
HourCountsWeekdayWeekend
35 / 45
  • how do I know when these "weired" behaviour happened? If by looking at static plot.
  • plotly
  • instead of having two different copies of code, ggplotly only needs one.
  • hover info is quite comprehensive using ggplotly.

plotly

- ggplotly

- plot_ly

p1 <- sx %>% filter(Wday == "Weekday") %>%
group_by(Date) %>%
plot_ly(x = ~ Hour, y = ~ Counts) %>%
add_lines()
p2 <- sx %>% filter(Wday == "Weekend") %>%
group_by(Date) %>%
plot_ly(x = ~ Hour, y = ~ Counts) %>%
add_lines()
layout(subplot(p1, p2, shareY = TRUE), showlegend = FALSE)
051015200100020003000400005101520
Counts
36 / 45
  • ggplotly shortcut
  • authentic plot_ly replicate the plot.
  • hover info needs to be customised.

plotly

- ggplotly

- plot_ly

- animation

a10_df <- broom::tidy(zoo::as.zoo(fpp2::a10)) %>%
mutate(
year = year(index),
month = month(index)
)
p3 <- a10_df %>%
ggplot(aes(month, value)) +
geom_line(aes(group = year), alpha = 0.2) +
geom_line(aes(frame = year, colour = as.factor(year))) +
animation_opts(
ggplotly(p3), frame = 1000, easing = "elastic"
)
JFMAMJJASOND102030
monthvalue1991year: 1991199119931995199719992001200320052007Playas.factor(year)
37 / 45
  • new release of plotly 4.6 comes with a new feature, animation.
  • explain the code.

bind all together

- shiny app

wanderer4melb(click me) is a shiny app for visualising Melbourne pedestrian and weather data in 2016.

demo

# install.packages("devtools")
devtools::install_github("earowang/wanderer4melb")
library(wanderer4melb)
launch_app()
38 / 45
  • let's bind what you've learned before together into a shiny app.
  • tidy data, grammar of graphics, and plotly
  • I'm showing a demo
  • inspiration: we know foot traffic is largely determined by temporal components, for example ... and special events. Any other factors affect the number of ped too? like temperature and rainfall? If it's extremely cold and heavy rain/storm, would you hand out in the city?
  • explain the layout and the graphs.

bind all together

- shiny app

- sugrrants

sugrrants 🐜 is an R package (under development) that supports graphics for analysing time series data.

devtools::install_github("earowang/sugrrants")
library(sugrrants)
  • It fits into the tidyverse and grammar of graphics suite of methods for exploring temporal data.
  • Calendar-based graphics are useful for visualising people's daily schedules, and frame_calendar 🗓 is made available for this.

If you find a bug or wanna suggest a new feature, please report/propose it on the Github page. Thanks.

39 / 45

bind all together

- shiny app

- sugrrants

- calendar plot

frame_calendar rearranges the data into a calendar format using linear algebra tools.

sx_cal <- sx %>%
frame_calendar(
x = Hour, y = Counts, date = Date, nrow = 1, ncol = 1
)
sx_cal
#> # A tibble: 744 x 9
#> Date Hour Sensor_Name Counts
#> <date> <int> <chr> <int>
#> 1 2017-03-01 0 Southern Cross Station 16
#> 2 2017-03-01 1 Southern Cross Station 8
#> 3 2017-03-01 2 Southern Cross Station 3
#> 4 2017-03-01 3 Southern Cross Station 4
#> 5 2017-03-01 4 Southern Cross Station 1
#> 6 2017-03-01 5 Southern Cross Station 96
#> 7 2017-03-01 6 Southern Cross Station 581
#> 8 2017-03-01 7 Southern Cross Station 1847
#> 9 2017-03-01 8 Southern Cross Station 3863
#> 10 2017-03-01 9 Southern Cross Station 2063
#> # ... with 734 more rows, and 5 more variables:
#> # Date_Time <dttm>, Wday <ord>, .group_id <dbl>,
#> # .x <dbl>, .y <dbl>
40 / 45

bind all together

- shiny app

- sugrrants

- calendar plot

ggplot2 takes care of plotting a data.frame or tibble as usual.

p_sx <- sx_cal %>%
ggplot(aes(.x, .y, group = .group_id, colour = Wday)) +
geom_line()
p_sx

41 / 45

bind all together

- shiny app

- sugrrants

- calendar plot

prettify takes a ggplot object and then makes the calendar plot more readable.

prettify(p_sx)

42 / 45

Done! 🎉 🎊 hmm... not yet?

43 / 45

Acknowledgements

  • Slides powered by the R package xaringan for remark.js and R Markdown
    • The source files to reproduce the slides are available here.
    • Yihui is giving a tutorial on R Markdown Ecosystem (check it out).
  • The R packages used for the slides to make analysis easier and more fun: ggmap, plotly, stringr, forcats, forecast, tidyverse, lubridate, broom, zoo, shiny, emo
44 / 45

This work is licensed under
Creative Commons

45 / 45

Agenda

  • Why tidyverse? Tidy data
  • What does gg in ggplot2 stand for? Grammar of graphics
  • Interactive data visualisation on the web
    • plotly: ggplotly(), plot_ly, animation
    • shiny app: wanderer4melb
2 / 45

Lemme introduce myself a little bit. I'm currently doing my PhD at Monash, working with Di and Rob. My research is about visualising temporal data.

This is what I'm going to cover for the next one hour and half. First, I'll talk about why tidyverse is great. tidyverse is a collection of R packages that center around the "tidy data" concept. We'll learn some verbs/functions that do the data wrangling to get the tidy temporal data. As data analysts ... Second, we'll learn about the ggplot2 that handles with plotting. You may wonder why it's called ggplot2 and what gg means. It's grammar of graphics for plots. I'll talk about how the ggplot2 functions powered by the grammar help us to visualise time series data. In the final bit, I'm going to show you some demos about interactive graphics using plotly and shiny.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow