class: center, middle, inverse, title-slide # Tidy data and statistical visualisation
to support exploration of temporal data with R ### Earo Wang ### May 29, 2017
Slides on
http://bit.ly/tsvis2017
--- ## Agenda * Why `tidyverse`? Tidy data * What does `gg` in `ggplot2` stand for? Grammar of graphics * Interactive data visualisation on the web + plotly: `ggplotly()`, `plot_ly`, animation + shiny app: `wanderer4melb` ??? Lemme introduce myself a little bit. I'm currently doing my PhD at Monash, working with Di and Rob. My research is about visualising temporal data. This is what I'm going to cover for the next one hour and half. First, I'll talk about why `tidyverse` is great. `tidyverse` is a collection of R packages that center around the "tidy data" concept. We'll learn some verbs/functions that do the data wrangling to get the tidy temporal data. As data analysts ... Second, we'll learn about the `ggplot2` that handles with plotting. You may wonder why it's called `ggplot2` and what `gg` means. It's grammar of graphics for plots. I'll talk about how the `ggplot2` functions powered by the grammar help us to visualise time series data. In the final bit, I'm going to show you some demos about interactive graphics using `plotly` and `shiny`. --- background-image: url(img/stickers.001.png) background-size: 80% background-position: 60% 90% ## R packages * `tidyverse`: a collection of R packages surrounding "tidy data" * `stringr`: handle string manipulation * `forcats`: handle categorical variables * `lubridate`: lubricate date-times process * `plotly`: create web-based visualisation * `shiny`: build interactive web applications * `knitr`: provide tools for dynamic report generation * `devtools`: help with R packages development ??? Here's a list of R packages that we're going to use for this part. As I said before, `tidyverse` is a set of packages including `ggplot2`, `dplyr`, `readr` etc. `stringr` for strings and `forcats` for categorical variables. `lubridate` is for making dealing with dates easier. `plotly` and `shiny` for interactive graphics on the web. --- class: inverse middle center # Tidy data ??? Okay. Now, let's start with tidy data. --- class: middle ## What is tidy data? * Each variable forms a column. * Each observation forms a row. * Each type of observational unit forms a table. ??? What is tidy data? What makes a dataset tidy? 1.2.3 The data structure is a rectangular cases-by-variables data layout that underlines the `tidyverse`. I'll use three datasets to explain what tidy data actually mean. How important is the tidy data for further data analysis and vis. --- .left-column[ ## Melbourne pedestrian data πΆββοΈ ### - sensor data ] .right-column[ ```r ped_loc <- read_csv("data/sensor_locations.csv") ped_loc %>% select( `Sensor ID`, `Sensor Description`, Longitude, Latitude ) ``` ``` #> # A tibble: 43 x 4 #> `Sensor ID` `Sensor Description` Longitude #> <int> <chr> <dbl> #> 1 22 Flinders St-Elizabeth St (East) 144.9651 #> 2 34 Flinders St-Spark La 144.9742 #> 3 11 Waterfront City 144.9396 #> 4 8 Webb Bridge 144.9472 #> 5 7 Birrarung Marr 144.9714 #> 6 13 Flagstaff Station 144.9566 #> 7 15 State Library 144.9645 #> 8 27 QV Market-Peel St 144.9566 #> 9 12 New Quay 144.9429 #> 10 24 Spencer St-Collins St (North) 144.9545 #> # ... with 33 more rows, and 1 more variables: #> # Latitude <dbl> ``` .footnote[.red[*] source: [the city of Melbourne](http://www.pedestrian.melbourne.vic.gov.au)] ] ??? The first dataset we're going to look at is ... The dataset is sourced from Melbourne Open Data Portal. You can actually click here to check out the web page that hosts the dataset. Since 2009, the city of Melb started to install some sensors that capture the foot traffic every hour. This data can be used for urban planning or business management. For example, if you're a cafe owner, by looking at the hourly traffic to decide the trading hours. Until today, there are 43 sensors have been installed across the city. Here, I read the sensor locations data into R using `read_csv` from `readr` pkg. Rob has talked about the pipe operator. I pass the data to the `select` function and select four colnames, that is ... --- .left-column[ ## Melbourne pedestrian data πΆββοΈ ### - sensor data ### - sensor map ] .right-column[ <img src="figure/ped-map-1.png" style="display: block; margin: auto;" /> ] ??? As I know sensor's long and lat, I could plot these sensors on the map of Melb. It clearly shows the locations of these sensors. --- .left-column[ ## Melbourne pedestrian data πΆββοΈ ### - sensor data ### - sensor map ### - hourly counts ] .right-column[ ```r ped_2017 <- read_csv("data/pedestrian_03_2017.csv") ped_2017 ``` ``` #> # A tibble: 744 x 45 #> Date Hour `State Library` `Collins Place (South)` #> <chr> <int> <int> <int> #> 1 01/03/2017 0 140 36 #> 2 01/03/2017 1 64 17 #> 3 01/03/2017 2 29 11 #> 4 01/03/2017 3 13 9 #> 5 01/03/2017 4 13 10 #> 6 01/03/2017 5 31 84 #> 7 01/03/2017 6 92 252 #> 8 01/03/2017 7 327 767 #> 9 01/03/2017 8 908 1997 #> 10 01/03/2017 9 775 1319 #> # ... with 734 more rows, and 41 more variables: `Collins #> # Place (North)` <int>, `Flagstaff Station` <int>, #> # `Melbourne Central` <int>, `Town Hall (West)` <int>, #> # `Bourke Street Mall (North)` <int>, `Bourke Street Mall #> # (South)` <int>, `Australia on Collins` <int>, `Southern #> # Cross Station` <int>, `Victoria Point` <int>, `New #> # Quay` <int>, `Waterfront City` <int>, `Webb #> # Bridge` <int>, `Princes Bridge` <int>, `Flinders St #> # Station Underpass` <int>, `Sandridge Bridge` <int>, #> # `Birrarung Marr` <int>, `QV Market-Elizabeth #> # (West)` <int>, `Flinders St-Elizabeth St (East)` <int>, #> # `Spencer St-Collins St (North)` <int>, `Spencer #> # St-Collins St (South)` <int>, `Bourke St-Russell St #> # (West)` <int>, `Convention/Exhibition Centre` <int>, #> # `Chinatown-Swanston St (North)` <int>, `Chinatown-Lt #> # Bourke St (South)` <int>, `QV Market-Peel St` <int>, #> # `Vic Arts Centre` <int>, `Lonsdale St (South)` <int>, #> # `Lygon St (West)` <int>, `Flinders St-Spring St #> # (West)` <int>, `Flinders St-Spark Lane` <int>, `Alfred #> # Place` <int>, `Queen Street (West)` <int>, `Lygon #> # Street (East)` <int>, `Flinders St-Swanston St #> # (West)` <int>, `Spring St-Lonsdale St (South)` <int>, #> # `City Square` <int>, `St. Kilda-Alexandra #> # Gardens` <int>, `Grattan St-Swanston St (West)` <int>, #> # `Monash Rd-Swanston St (West)` <int>, `Tin #> # Alley-Swanston St (West)` <int>, Southbank <int> ``` ] ??? Besides the locations, we're more interested in learning about the hourly pedestrian counts at every sensor. I read the second csv file that contains the pedestrian counts in March. For this data, there are 744 obs and 45 columns. ... Date is read into as character ... Lemme me refer to this kind of format as wide format by contrast to long form. I'll explain what I mean by wide and long later. --- background-image: url(img/verb.001.png) background-size: 85% .left-column[ ## Melbourne pedestrian data πΆββοΈ ### - sensor data ### - sensor map ### - hourly counts ### - tidy data ] .right-column[ ] ??? The remedy is converting the wide format to the long data. The top data table is what we have, wide gather the headers to one key variable, and the counts to the value variable. By having more than 40 columns, we have four vars to work with. Looking at this long form, what are the variables become clearer compared to the wide format. Each variable forms a column. from tidyr organise the same data in two different ways. To be consistent, tidy long data used. --- .left-column[ ## Melbourne pedestrian data πΆββοΈ ### - sensor data ### - sensor map ### - hourly counts ### - tidy data ] .right-column[ ```r ped_long <- ped_2017 %>% gather( key = Sensor_Name, value = Counts, `State Library`:Southbank ) %>% mutate( Date_Time = dmy_hms(paste(Date, Hour, "00:00")), Date = dmy(Date) ) ped_long ``` ``` #> # A tibble: 31,992 x 5 #> Date Hour Sensor_Name Counts #> <date> <int> <chr> <int> #> 1 2017-03-01 0 State Library 140 #> 2 2017-03-01 1 State Library 64 #> 3 2017-03-01 2 State Library 29 #> 4 2017-03-01 3 State Library 13 #> 5 2017-03-01 4 State Library 13 #> 6 2017-03-01 5 State Library 31 #> 7 2017-03-01 6 State Library 92 #> 8 2017-03-01 7 State Library 327 #> 9 2017-03-01 8 State Library 908 #> 10 2017-03-01 9 State Library 775 #> # ... with 31,982 more rows, and 1 more variables: #> # Date_Time <dttm> ``` ] ??? Any questions so far? --- .left-column[ ## Otway weather data βοΈ ### - what variables? ] .right-column[ ```r otway_weather <- read_csv("data/weather_2016.csv") head(otway_weather) ``` ``` #> # A tibble: 6 x 35 #> ID YEAR MONTH ELEMENT VALUE1 VALUE2 VALUE3 #> <chr> <int> <chr> <chr> <int> <int> <int> #> 1 ASN00090015 2016 01 TMAX 209 195 193 #> 2 ASN00090015 2016 01 TMIN 175 145 162 #> 3 ASN00090015 2016 01 PRCP 0 0 0 #> 4 ASN00090015 2016 01 TAVG 166 174 175 #> 5 ASN00090015 2016 02 TMAX 217 239 185 #> 6 ASN00090015 2016 02 TMIN 120 146 149 #> # ... with 28 more variables: VALUE4 <int>, VALUE5 <int>, #> # VALUE6 <int>, VALUE7 <int>, VALUE8 <int>, VALUE9 <int>, #> # VALUE10 <int>, VALUE11 <int>, VALUE12 <int>, #> # VALUE13 <int>, VALUE14 <int>, VALUE15 <int>, #> # VALUE16 <int>, VALUE17 <int>, VALUE18 <int>, #> # VALUE19 <int>, VALUE20 <int>, VALUE21 <int>, #> # VALUE22 <int>, VALUE23 <int>, VALUE24 <int>, #> # VALUE25 <int>, VALUE26 <int>, VALUE27 <int>, #> # VALUE28 <int>, VALUE29 <int>, VALUE30 <int>, #> # VALUE31 <int> ``` .footnote[.red[*] source: [global historical climatology network](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/gsn/)] ] --- .left-column[ ## Otway weather data βοΈ ### - what variables? ### - tidying π ] .right-column[ ```r otway_weather %>% gather(DAY, VALUE, VALUE1:VALUE31) ``` ``` #> # A tibble: 1,488 x 6 #> ID YEAR MONTH ELEMENT DAY VALUE #> <chr> <int> <chr> <chr> <chr> <int> #> 1 ASN00090015 2016 01 TMAX VALUE1 209 #> 2 ASN00090015 2016 01 TMIN VALUE1 175 #> 3 ASN00090015 2016 01 PRCP VALUE1 0 #> 4 ASN00090015 2016 01 TAVG VALUE1 166 #> 5 ASN00090015 2016 02 TMAX VALUE1 217 #> 6 ASN00090015 2016 02 TMIN VALUE1 120 #> 7 ASN00090015 2016 02 PRCP VALUE1 2 #> 8 ASN00090015 2016 02 TAVG VALUE1 187 #> 9 ASN00090015 2016 03 TMAX VALUE1 243 #> 10 ASN00090015 2016 03 TMIN VALUE1 172 #> # ... with 1,478 more rows ``` ] --- .left-column[ ## Otway weather data βοΈ ### - what variables? ### - tidying π ] .right-column[ ```r otway_weather %>% gather(DAY, VALUE, VALUE1:VALUE31) %>% mutate( DAY = str_sub(DAY, start = 6), DATE = ymd(paste(YEAR, MONTH, DAY, sep = "-")) ) %>% arrange(DATE) %>% select(ID, DATE, ELEMENT, VALUE) %>% filter(!(is.na(DATE))) ``` ``` #> # A tibble: 1,464 x 4 #> ID DATE ELEMENT VALUE #> <chr> <date> <chr> <int> #> 1 ASN00090015 2016-01-01 TMAX 209 #> 2 ASN00090015 2016-01-01 TMIN 175 #> 3 ASN00090015 2016-01-01 PRCP 0 #> 4 ASN00090015 2016-01-01 TAVG 166 #> 5 ASN00090015 2016-01-02 TMAX 195 #> 6 ASN00090015 2016-01-02 TMIN 145 #> 7 ASN00090015 2016-01-02 PRCP 0 #> 8 ASN00090015 2016-01-02 TAVG 174 #> 9 ASN00090015 2016-01-03 TMAX 193 #> 10 ASN00090015 2016-01-03 TMIN 162 #> # ... with 1,454 more rows ``` ] --- .left-column[ ## Otway weather data βοΈ ### - what variables? ### - tidying π ] .right-column[ ```r otway_weather %>% gather(DAY, VALUE, VALUE1:VALUE31) %>% mutate( DAY = str_sub(DAY, start = 6), DATE = ymd(paste(YEAR, MONTH, DAY, sep = "-")) ) %>% arrange(DATE) %>% select(ID, DATE, ELEMENT, VALUE) %>% filter(!(is.na(DATE))) %>% mutate( VALUE = if_else(VALUE < -999, NA_integer_, VALUE), VALUE = VALUE / 10 ) %>% spread(ELEMENT, VALUE) ``` ``` #> # A tibble: 366 x 6 #> ID DATE PRCP TAVG TMAX TMIN #> * <chr> <date> <dbl> <dbl> <dbl> <dbl> #> 1 ASN00090015 2016-01-01 0 16.6 20.9 17.5 #> 2 ASN00090015 2016-01-02 0 17.4 19.5 14.5 #> 3 ASN00090015 2016-01-03 0 17.5 19.3 16.2 #> 4 ASN00090015 2016-01-04 0 17.7 20.2 16.7 #> 5 ASN00090015 2016-01-05 0 17.8 20.6 16.1 #> 6 ASN00090015 2016-01-06 0 17.1 20.3 16.5 #> 7 ASN00090015 2016-01-07 0 15.8 19.7 14.8 #> 8 ASN00090015 2016-01-08 0 15.6 18.8 14.2 #> 9 ASN00090015 2016-01-09 0 15.7 19.0 11.3 #> 10 ASN00090015 2016-01-10 0 18.6 25.4 11.9 #> # ... with 356 more rows ``` ] --- .left-column[ ## Otway weather data βοΈ ### - what variables? ### - tidying π ] .right-column[ ```r otway_tidy <- otway_weather %>% gather(DAY, VALUE, VALUE1:VALUE31) %>% mutate( DAY = str_sub(DAY, start = 6), DATE = ymd(paste(YEAR, MONTH, DAY, sep = "-")) ) %>% arrange(DATE) %>% select(ID, DATE, ELEMENT, VALUE) %>% filter(!(is.na(DATE))) %>% mutate( VALUE = if_else(VALUE < -999, NA_integer_, VALUE), VALUE = VALUE / 10 ) %>% spread(ELEMENT, VALUE) %>% mutate(NAVG = (TMAX + TMIN) / 2) head(otway_tidy) ``` ``` #> # A tibble: 6 x 7 #> ID DATE PRCP TAVG TMAX TMIN NAVG #> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 ASN00090015 2016-01-01 0 16.6 20.9 17.5 19.20 #> 2 ASN00090015 2016-01-02 0 17.4 19.5 14.5 17.00 #> 3 ASN00090015 2016-01-03 0 17.5 19.3 16.2 17.75 #> 4 ASN00090015 2016-01-04 0 17.7 20.2 16.7 18.45 #> 5 ASN00090015 2016-01-05 0 17.8 20.6 16.1 18.35 #> 6 ASN00090015 2016-01-06 0 17.1 20.3 16.5 18.40 ``` ] ??? TAVG: UTC time zone rather than local time --- .left-column[ ## Your turn ### - billboard data ] .right-column[ `billboard.csv` records the date a song first entered the Billboard Top 100 in 2000 and its rank over 76 weeks. ``` #> # A tibble: 6 x 81 #> year artist track #> <int> <chr> <chr> #> 1 2000 Backstreet Boys, The Shape Of My Heart #> 2 2000 Backstreet Boys, The Show Me The Meaning ... #> 3 2000 Backstreet Boys, The The One #> 4 2000 N'Sync Bye Bye Bye #> 5 2000 N'Sync It's Gonna Be Me #> 6 2000 N'Sync This I Promise You #> # ... with 78 more variables: time <time>, #> # date.entered <date>, `1` <int>, `2` <int>, `3` <int>, #> # `4` <int>, `5` <int>, `6` <int>, `7` <int>, `8` <int>, #> # `9` <int>, `10` <int>, `11` <int>, `12` <int>, #> # `13` <int>, `14` <int>, `15` <int>, `16` <int>, #> # `17` <int>, `18` <int>, `19` <int>, `20` <int>, #> # `21` <int>, `22` <int>, `23` <int>, `24` <int>, #> # `25` <int>, `26` <int>, `27` <chr>, `28` <chr>, #> # `29` <chr>, `30` <chr>, `31` <chr>, `32` <chr>, #> # `33` <chr>, `34` <chr>, `35` <chr>, `36` <chr>, #> # `37` <chr>, `38` <chr>, `39` <chr>, `40` <chr>, #> # `41` <chr>, `42` <chr>, `43` <chr>, `44` <chr>, #> # `45` <chr>, `46` <chr>, `47` <chr>, `48` <chr>, #> # `49` <chr>, `50` <chr>, `51` <chr>, `52` <chr>, #> # `53` <chr>, `54` <chr>, `55` <chr>, `56` <chr>, #> # `57` <chr>, `58` <chr>, `59` <chr>, `60` <chr>, #> # `61` <chr>, `62` <chr>, `63` <chr>, `64` <chr>, #> # `65` <chr>, `66` <chr>, `67` <chr>, `68` <chr>, #> # `69` <chr>, `70` <chr>, `71` <chr>, `72` <chr>, #> # `73` <chr>, `74` <chr>, `75` <chr>, `76` <chr> ``` ] ??? * Old dataset * A song first time entered to the billboard Top 100, when it's entered * The position and if NA, dropped from the Top 100 --- .left-column[ ## Your turn ### - billboard data ### - end products ] .right-column[ ```r song ``` ``` #> # A tibble: 6 x 4 #> id artist track #> <int> <chr> <chr> #> 1 1 Backstreet Boys, The Shape Of My Heart #> 2 2 Backstreet Boys, The Show Me The Meaning ... #> 3 3 Backstreet Boys, The The One #> 4 4 N'Sync Bye Bye Bye #> 5 5 N'Sync It's Gonna Be Me #> 6 6 N'Sync This I Promise You #> # ... with 1 more variables: time <time> ``` ```r rank ``` ``` #> # A tibble: 456 x 4 #> id date.entered week rank #> <int> <date> <chr> <chr> #> 1 1 2000-10-14 1 39 #> 2 2 2000-01-01 1 74 #> 3 3 2000-05-27 1 58 #> 4 4 2000-01-29 1 42 #> 5 5 2000-05-06 1 82 #> 6 6 2000-09-30 1 68 #> 7 1 2000-10-14 2 25 #> 8 2 2000-01-01 2 62 #> 9 3 2000-05-27 2 50 #> 10 4 2000-01-29 2 20 #> # ... with 446 more rows ``` ] ??? * convert wide to long * spit to two data tables * Follows the third point of the tidy data principles: Each type of observational unit forms a table. * 40 mins get done --- class: inverse middle center # Grammar of graphics ??? * basic/good understanding of tidy data structure * bc it builds a solid foundation to statistical vis. --- class: middle ## What is grammar of graphics? * In 1999, "The grammar of graphics" by Leland Wilkinson * In 2008, `ggplot2` by Hadley Wickham * A conceptual framework for mapping variables to graphical elements * The analogue of doing a functional mapping of random variables. For example, `\(n\)` independent and identically distributed RVs `\(X_1, \ldots, X_n\)`, the mean and the standard deviation are defined as `$$\bar{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} \qquad S_{n-1} = \frac{1}{n-1} \sum_{i = 1}^{n} \left(X_i - \bar{X} \right)^2.$$` ??? we're going to use some short but comprehensive vocabulary to describe different sorts of graphs. --- .left-column[ ## More on pedestrian data ### - revisit ] .right-column[ ```r ped_long ``` ``` #> # A tibble: 31,992 x 5 #> Date Hour Sensor_Name Counts #> <date> <int> <chr> <int> #> 1 2017-03-01 0 State Library 140 #> 2 2017-03-01 1 State Library 64 #> 3 2017-03-01 2 State Library 29 #> 4 2017-03-01 3 State Library 13 #> 5 2017-03-01 4 State Library 13 #> 6 2017-03-01 5 State Library 31 #> 7 2017-03-01 6 State Library 92 #> 8 2017-03-01 7 State Library 327 #> 9 2017-03-01 8 State Library 908 #> 10 2017-03-01 9 State Library 775 #> # ... with 31,982 more rows, and 1 more variables: #> # Date_Time <dttm> ``` ] --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ] .right-column[ ``` data: ped_long layer: mapping: x = Date_Time, y = Counts geom: line, point facet: Sensor_Name ``` <img src="figure/ped-ggplot-1-1.svg" style="display: block; margin: auto;" /> ```r ggplot(ped_long, aes(x = Date_Time, y = Counts)) + geom_line() + geom_point() + facet_grid(Sensor_Name ~ ., scale = "free_y") ``` ] ??? * `autoplot`: against time index * This is how we use graphic gramma to describe the plot * Start with a tidy dataset with clearly defined variables. * geom and other aes make the graph perceivable. * facet: trellis plots or lattice plots. * scale: local for comparing shapes instead of magnitudes. --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ] .right-column[ ``` data: ped_long layer: mapping: x = Date_Time, y = Counts, colour = Sensor_Name geom: line, point facet: Sensor_Name ``` <img src="figure/ped-ggplot-2-1.svg" style="display: block; margin: auto;" /> ```r ggplot(ped_long, aes(x = Date_Time, y = Counts)) + geom_line(aes(colour = Sensor_Name)) + geom_point(aes(colour = Sensor_Name)) + facet_grid(Sensor_Name ~ ., scale = "free_y") ``` ] ??? * colour is also the part of aes mapping. * not adding additional info in this case. * pedestrian data interesting and rich * southern x as an example to explore * seems that sx has distinctive patterns b/t weekdays and weekend --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ ```r wday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") sx <- ped_long %>% filter(Sensor_Name == "Southern Cross Station") %>% mutate( Wday = wday(Date, label = TRUE, abbr = FALSE), Wday = if_else(Wday %in% wday, "Weekday", "Weekend"), Wday = ordered(Wday) ) sx ``` ``` #> # A tibble: 744 x 6 #> Date Hour Sensor_Name Counts #> <date> <int> <chr> <int> #> 1 2017-03-01 0 Southern Cross Station 16 #> 2 2017-03-01 1 Southern Cross Station 8 #> 3 2017-03-01 2 Southern Cross Station 3 #> 4 2017-03-01 3 Southern Cross Station 4 #> 5 2017-03-01 4 Southern Cross Station 1 #> 6 2017-03-01 5 Southern Cross Station 96 #> 7 2017-03-01 6 Southern Cross Station 581 #> 8 2017-03-01 7 Southern Cross Station 1847 #> 9 2017-03-01 8 Southern Cross Station 3863 #> 10 2017-03-01 9 Southern Cross Station 2063 #> # ... with 734 more rows, and 2 more variables: #> # Date_Time <dttm>, Wday <ord> ``` ] --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ ``` data: southern-cross layer: mapping: x = Hour, y = Counts, colour = Wday geom: line ``` <img src="figure/x-colour-1.svg" style="display: block; margin: auto;" /> ```r ggplot(sx, aes(Hour, Counts, group = Date)) + geom_line(aes(colour = Wday)) ``` ] ??? * explain activities at some time points. * more spread out in afternoon than in morning. * All the daily series are overlaid each other and use color to differentiate weekday and weekends. --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ ``` data: southern-cross layer: mapping: x = Hour, y = Counts geom: line facet: Wday ``` <img src="figure/x-facet-1.svg" style="display: block; margin: auto;" /> ```r ggplot(sx, aes(Hour, Counts, group = Date)) + geom_line() + facet_wrap(~ Wday, ncol = 2) ``` ] ??? * weekday and weekends are not yet sufficient. * ?what day would make this behave like a weekend. ---> labour day --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ Expertise and Google in action ![](img/adele.png) ] --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ ```r labour <- "Labour Day" # 2013-03-13 adele <- "Adele Day" # 2017-03-18 to 19 # Justin Bieber's gig 2017-03-10 sx_more <- sx %>% mutate( Wday = fct_expand(Wday, labour, adele), Wday = if_else( Date == ymd("2017-03-13"), ordered(labour, levels(Wday)), Wday ), Wday = if_else( Date %in% ymd(c("2017-03-18", "2017-03-19")), ordered(adele, levels(Wday)), Wday ) ) head(sx_more) ``` ``` #> # A tibble: 6 x 6 #> Date Hour Sensor_Name Counts #> <date> <int> <chr> <int> #> 1 2017-03-01 0 Southern Cross Station 16 #> 2 2017-03-01 1 Southern Cross Station 8 #> 3 2017-03-01 2 Southern Cross Station 3 #> 4 2017-03-01 3 Southern Cross Station 4 #> 5 2017-03-01 4 Southern Cross Station 1 #> 6 2017-03-01 5 Southern Cross Station 96 #> # ... with 2 more variables: Date_Time <dttm>, Wday <ord> ``` ] ??? transform the data --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ ``` data: southern-cross-more layer: mapping: x = Hour, y = Counts geom: line facet: Wday ``` <img src="figure/x-more-p-1.svg" style="display: block; margin: auto;" /> ```r ggplot(sx_more, aes(Hour, Counts, group = Date)) + geom_line() + facet_wrap(~ Wday, ncol = 2) ``` ] ??? grammar remains the same but using transformed data I've just showed you how to use the graphics to explore the data. Notice something unexpected and use your expertise or other resources to explain and produce another graph. Has anyone noticed there's another weired day that I haven't explained/exploited? --- .left-column[ ## More on pedestrian data ### - revisit ### - `ggplot2` ### - southern x π ] .right-column[ ``` data: southern-cross-more layer: mapping: x = Hour, y = Counts geom: line facet: Wday coord: polar ``` <img src="figure/x-polar-1.svg" style="display: block; margin: auto;" /> ```r ggplot(sx_more, aes(Hour, Counts, group = Date)) + geom_line() + facet_wrap(~ Wday, ncol = 2) + coord_polar() ``` ] ??? known as rose plot This slide also shows the advantage of using the grammar. Instead of referring to these as line and rose, the grammar informs you of the difference between the two plots is simply one linear and one polar coordinates. polar: periodic behaviour When you need to decide which plot you should pick up for you presentation, it should be conducted under the statistical hypothesis testing framework, known as visual inference. Be aware of that it provides scientific tools to help with which display is more powerful. --- .left-column[ ## Your turn ### - weather data ] .right-column[ ```r otway_more <- otway_tidy %>% mutate( MONTH = month(DATE, label = TRUE), DAY = mday(DATE) ) otway_more ``` ``` #> # A tibble: 366 x 9 #> ID DATE PRCP TAVG TMAX TMIN NAVG #> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 ASN00090015 2016-01-01 0 16.6 20.9 17.5 19.20 #> 2 ASN00090015 2016-01-02 0 17.4 19.5 14.5 17.00 #> 3 ASN00090015 2016-01-03 0 17.5 19.3 16.2 17.75 #> 4 ASN00090015 2016-01-04 0 17.7 20.2 16.7 18.45 #> 5 ASN00090015 2016-01-05 0 17.8 20.6 16.1 18.35 #> 6 ASN00090015 2016-01-06 0 17.1 20.3 16.5 18.40 #> 7 ASN00090015 2016-01-07 0 15.8 19.7 14.8 17.25 #> 8 ASN00090015 2016-01-08 0 15.6 18.8 14.2 16.50 #> 9 ASN00090015 2016-01-09 0 15.7 19.0 11.3 15.15 #> 10 ASN00090015 2016-01-10 0 18.6 25.4 11.9 18.65 #> # ... with 356 more rows, and 2 more variables: #> # MONTH <ord>, DAY <int> ``` ] --- .left-column[ ## Your turn ### - weather data ### - grammar here? ] .right-column[ <img src="figure/otway-month-a-1.svg" style="display: block; margin: auto;" /> ] ??? Discuss with your neighbour what's the graph about and what's the grammar used? --- .left-column[ ## Your turn ### - weather data ### - grammar here? ### - it's ... ] .right-column[ ``` data: otway_more layer: 1. yintercept: year_average geom: hline 2. mapping: xmin = DAY-, xmax = DAY+, ymin = TMIN, ymax = TMAX geom: rect 3. mapping: x = DAY, y = NAVG geom: line facet: MONTH ``` <img src="figure/otway-month-b-1.svg" style="display: block; margin: auto;" /> ] ??? Let them do the lab exercise first without explaining the grammar. 70 mins get done --- background-image: url(img/data-science.png) background-size: 70% background-position: center middle ## Workflow .footnote[.red[*] source: [R for Data Science](http://r4ds.had.co.nz/introduction.html)] ??? * This is the workflow that we followed in the first two parts * The first two sections: transform and visualise * missing component is "model", leave it to Rob * report the findings: communicate using interactive tools --- class: inverse middle center # Interactive data visualisation <br> on the web ??? * bringing interactivity to the graphs provides a different dim to look at the data * serves as an exploratory and communication tool --- .left-column[ ## plotly ### - `ggplotly` ] .right-column[ ```r p <- sx %>% ggplot(aes(Hour, Counts, group = Date)) + geom_line() + facet_wrap(~ Wday, ncol = 2) ggplotly(p) ```
] ??? * how do I know when these "weired" behaviour happened? If by looking at static plot. * `plotly` * instead of having two different copies of code, `ggplotly` only needs one. * hover info is quite comprehensive using `ggplotly`. --- .left-column[ ## plotly ### - `ggplotly` ### - `plot_ly` ] .right-column[ ```r p1 <- sx %>% filter(Wday == "Weekday") %>% group_by(Date) %>% plot_ly(x = ~ Hour, y = ~ Counts) %>% add_lines() p2 <- sx %>% filter(Wday == "Weekend") %>% group_by(Date) %>% plot_ly(x = ~ Hour, y = ~ Counts) %>% add_lines() layout(subplot(p1, p2, shareY = TRUE), showlegend = FALSE) ```
] ??? * `ggplotly` shortcut * authentic `plot_ly` replicate the plot. * hover info needs to be customised. --- .left-column[ ## plotly ### - `ggplotly` ### - `plot_ly` ### - animation ] .right-column[ ```r a10_df <- broom::tidy(zoo::as.zoo(fpp2::a10)) %>% mutate( year = year(index), month = month(index) ) p3 <- a10_df %>% ggplot(aes(month, value)) + geom_line(aes(group = year), alpha = 0.2) + geom_line(aes(frame = year, colour = as.factor(year))) + animation_opts( ggplotly(p3), frame = 1000, easing = "elastic" ) ```
] ??? * new release of plotly 4.6 comes with a new feature, animation. * explain the code. --- .left-column[ ## bind all together ### - shiny app ] .right-column[ `wanderer4melb`([click me](https://github.com/earowang/wanderer4melb)) is a shiny app for visualising Melbourne pedestrian and weather data in 2016. ![demo](img/wanderer4melb.png) ```r # install.packages("devtools") devtools::install_github("earowang/wanderer4melb") library(wanderer4melb) launch_app() ``` ] ??? * let's bind what you've learned before together into a shiny app. * tidy data, grammar of graphics, and plotly * I'm showing a demo * inspiration: we know foot traffic is largely determined by temporal components, for example ... and special events. Any other factors affect the number of ped too? like temperature and rainfall? If it's extremely cold and heavy rain/storm, would you hand out in the city? * explain the layout and the graphs. --- .left-column[ ## bind all together ### - shiny app ### - sugrrants ] .right-column[ `sugrrants` π is an R package (under development) that supports graphics for analysing time series data. ```r devtools::install_github("earowang/sugrrants") library(sugrrants) ``` * It fits into the tidyverse and grammar of graphics suite of methods for exploring temporal data. * Calendar-based graphics are useful for visualising people's daily schedules, and `frame_calendar` π is made available for this. If you find a bug or wanna suggest a new feature, please report/propose it on the [Github page](https://github.com/earowang/sugrrants/issues). Thanks. ] ??? --- .left-column[ ## bind all together ### - shiny app ### - sugrrants ### - calendar plot ] .right-column[ `frame_calendar` rearranges the data into a calendar format using linear algebra tools. ```r sx_cal <- sx %>% frame_calendar( x = Hour, y = Counts, date = Date, nrow = 1, ncol = 1 ) sx_cal ``` ``` #> # A tibble: 744 x 9 #> Date Hour Sensor_Name Counts #> <date> <int> <chr> <int> #> 1 2017-03-01 0 Southern Cross Station 16 #> 2 2017-03-01 1 Southern Cross Station 8 #> 3 2017-03-01 2 Southern Cross Station 3 #> 4 2017-03-01 3 Southern Cross Station 4 #> 5 2017-03-01 4 Southern Cross Station 1 #> 6 2017-03-01 5 Southern Cross Station 96 #> 7 2017-03-01 6 Southern Cross Station 581 #> 8 2017-03-01 7 Southern Cross Station 1847 #> 9 2017-03-01 8 Southern Cross Station 3863 #> 10 2017-03-01 9 Southern Cross Station 2063 #> # ... with 734 more rows, and 5 more variables: #> # Date_Time <dttm>, Wday <ord>, .group_id <dbl>, #> # .x <dbl>, .y <dbl> ``` ] --- .left-column[ ## bind all together ### - shiny app ### - sugrrants ### - calendar plot ] .right-column[ `ggplot2` takes care of plotting a `data.frame` or `tibble` as usual. ```r p_sx <- sx_cal %>% ggplot(aes(.x, .y, group = .group_id, colour = Wday)) + geom_line() p_sx ``` <img src="figure/calendar-ggplot-1.svg" style="display: block; margin: auto;" /> ] --- .left-column[ ## bind all together ### - shiny app ### - sugrrants ### - calendar plot ] .right-column[ `prettify` takes a `ggplot` object and then makes the calendar plot more readable. ```r prettify(p_sx) ``` <img src="figure/calendar-prettify-1.svg" style="display: block; margin: auto;" /> ] --- class: inverse middle center # Done! π π hmm... not yet? --- class: middle ## Acknowledgements * Slides powered by the R package [xaringan](https://github.com/yihui/xaringan) for [remark.js](https://remarkjs.com/) and [R Markdown](https://rmarkdown.rstudio.com) + The source files to reproduce the slides are available [here](https://github.com/earowang/medascin17-tsvis). + Yihui is giving a tutorial on [R Markdown Ecosystem](http://www.datasciencemelbourne.com/medascin2017/session/r-markdown-ecosystem-reports-papers-dashboards-books-websites-and-presentations/) (check it out). * The R packages used for the slides to make analysis easier and more fun: `ggmap`, `plotly`, `stringr`, `forcats`, `forecast`, `tidyverse`, `lubridate`, `broom`, `zoo`, `shiny`, `emo` --- class: inverse middle center # This work is licensed under <br> [<i class="fa fa-creative-commons" aria-hidden="true"></i> Creative Commons](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)