class: center, middle, inverse, title-slide # Tidy your time series analysis
with tsibble ###
Earo Wang
@earowang
### 4 October 2018
slides at
http://slides.earo.me/bigapple
--- background-image: url(figure/map-airlines-1.svg) background-size: cover class: bottom center # 2017 US airline traffic: 5,548,445 flights *data source: [US Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236)* ??? we're looking at time series analysis from EDA perspective, what data we're going to look at? --- .left-column[ <br> <br> <br> <br> <br> <br> <br> ## A glimpse ] .right-column[ ```r glimpse(flights) ``` ``` #> Observations: 5,548,445 #> Variables: 22 #> $ flight <chr> "AA494", "AA494", "AA494", ... #> $ sched_dep_datetime <dttm> 2017-01-17 16:19:00, 2017-... #> $ sched_arr_datetime <dttm> 2017-01-17 18:56:00, 2017-... #> $ dep_delay <dbl> -3, -5, -8, 37, 13, 17, -3,... #> $ arr_delay <dbl> -14, -35, -30, 33, 2, 25, 1... #> $ carrier <chr> "AA", "AA", "AA", "AA", "AA... #> $ tailnum <chr> "N583AA", "N544AA", "N553AA... #> $ origin <chr> "CLT", "CLT", "CLT", "CLT",... #> $ dest <chr> "PHX", "PHX", "PHX", "PHX",... #> $ air_time <dbl> 244, 228, 236, 252, 245, 25... #> $ distance <dbl> 1773, 1773, 1773, 1773, 177... #> $ origin_city_name <chr> "Charlotte", "Charlotte", "... #> $ origin_state <chr> "NC", "NC", "NC", "NC", "NC... #> $ dest_city_name <chr> "Phoenix", "Phoenix", "Phoe... #> $ dest_state <chr> "AZ", "AZ", "AZ", "AZ", "AZ... #> $ taxi_out <dbl> 17, 13, 17, 18, 17, 27, 20,... #> $ taxi_in <dbl> 5, 6, 2, 3, 4, 4, 7, 6, 5, ... #> $ carrier_delay <dbl> NA, NA, NA, 33, NA, 0, NA, ... #> $ weather_delay <dbl> NA, NA, NA, 0, NA, 0, NA, N... #> $ nas_delay <dbl> NA, NA, NA, 0, NA, 8, NA, N... #> $ security_delay <dbl> NA, NA, NA, 0, NA, 0, NA, N... #> $ late_aircraft_delay <dbl> NA, NA, NA, 0, NA, 17, NA, ... ``` ] ??? detailed, and richness --- background-image: url(figure/n938dn-1.svg) background-size: cover class: bottom center ### .bg-col[.white[When a plane goes somewhere, it comes back.]] # Plane N938DN <br> <br> ??? tailnum --- background-image: url(figure/dl771-1.svg) background-size: cover class: bottom center ### .bg-col[.white[The flight number isn't tied to a certain path.]] # Flight DL771 <br> <br> ??? * flight number is such a weir variable. --- .pull-left[ <br> <br> <br> <br> <br> <br> <br> ## Timestamped data: sliding into time series analysis ] .pull-right[ <br> <br> <br> <br> <br> ``` #> # A tibble: 5,548,445 x 3 #> flight origin sched_dep_datetime #> <chr> <chr> <dttm> #> 1 AA494 CLT 2017-01-17 16:19:00 #> 2 AA494 CLT 2017-01-18 16:19:00 #> 3 AA494 CLT 2017-01-19 16:19:00 #> 4 AA494 CLT 2017-01-20 16:19:00 #> 5 AA494 CLT 2017-01-21 16:19:00 #> 6 AA494 CLT 2017-01-22 16:19:00 #> # ... with 5.548e+06 more rows ``` ] ??? * event data --> slide into time series analysis --- .pull-left[ <br> <br> <br> <br> <br> <br> <br> ## Get ready for time series analysis? ] .pull-right[ <br> <br> <br> <br> <br> This data set features: * heterogeneous data types * irregular time interval * multiple measured variables * multiple grouping variables ] --- .pull-left[ <br> <br> <br> <br> <br> <br> <br> ## Get ready for time series analysis? ] .pull-right[ <br> <br> <br> <br> <br> But it doesn't fit into: .x[ * `ts()` * `zoo()` * `xts()` ] The underlying structure is a numeric matrix with implicit time indices. ] --- .pull-left[ <br> <br> <br> <br> <br> <br> <br> ## Get ready for time series analysis? ] .pull-right[ <br> <br> <br> <br> <br> .animated.hinge[ But it doesn't fit into: .x[ * `ts()` * `zoo()` * `xts()` ] The underlying structure is a numeric matrix with implicit time indices. ] ] --- .pull-left[ <br> <br> <br> <br> <br> <br> <br> ## Get ready for time series analysis? ] .pull-right[ <br> <br> <br> <br> <br> .animated.bounceInDown[ .center[<img src="img/tsibble.png" height=230px>] ] ] --- class: inverse middle center .pull-left[ <img src="https://imgs.xkcd.com/comics/standards.png"> .footnote[ .red[reference:] [XKCD on "standards"](https://xkcd.com/927/) ] ] -- .pull-right[ .center[<img src="img/tsibble.png" height=230px>] ### The 15th time series standard ] ??? not sure if tsibble actually covers everyone's use case, but it is more accessible. --- .left-column[ <br> <br> <br> <br> <br> <br> ## What makes a tsibble? ] .right-column[ .pull-left[ <br> .center[ <img src="img/tibble.png" height=180px> <i class="fas fa-arrow-down"></i> <img src="img/tsibble.png" height=180px> ] ] .pull-right[ <br> <br> <br> <br> <br> <br> tsibble leverages tibble along with time domain semantics: **index** and **key**. ] ] --- .left-column[ ## What makes a tsibble? ### - index ] .right-column[ ```r us_flights <- flights %>% as_tsibble( index = sched_dep_datetime, key = id(flight, origin), regular = FALSE ) ``` * Supported index types: + **base**: `POSIXct` & `Date` + **tsibble**: `yearweek`, `yearmonth` & `yearquarter` + **zoo**: `yearmth` & `yearqtr` + **nanotime**: `nanotime` + `integer`/`double` * Extensible: define `index_valid()` for custom types ] --- .left-column[ ## What makes a tsibble? ### - index ### - key ] .right-column[ What's your observational unit over time? ```r us_flights <- flights %>% as_tsibble( index = sched_dep_datetime, key = id(flight, origin), regular = FALSE ) ``` <img src="figure/dl771b-1.svg" style="display: block; margin: auto;" /> ] --- .left-column[ ## What makes a tsibble? ### - index ### - key ] .right-column[ What's your observational unit over time? ```r us_flights <- flights %>% as_tsibble( index = sched_dep_datetime, key = id(flight, origin), regular = FALSE ) ``` * Key created via `id()` (identifier) + *empty:* univariate time series (implicit key) `id()` + *single:* univariate or multivariate time series `id(flight)` + *multiple:* relational structures (nesting or crossing) <br> `id(flight, origin | origin_city_name | origin_state)` **tsibble is strict: distinct rows identified by index and key.** Use `validate = FALSE` to skip the checks, if you're sure it's a valid input. ] --- .left-column[ ## What makes a tsibble? ### - index ### - key ### - interval ] .right-column[ Free from specifying time frequency ```r us_flights <- flights %>% as_tsibble( index = sched_dep_datetime, key = id(flight, origin), regular = FALSE ) ``` By default `regular = TRUE`, it automatically recognizes the time interval based on the index representation from year to day, to nanosecond. But if you wanna specify the interval, check out the low-level constructor `build_tsibble()`. ] --- .left-column[ ## What makes a tsibble? ### - index ### - key ### - interval ### - tsibble ] .right-column[ **Arranging the key and index from past to future** ```r us_flights ``` ``` #> # A tsibble: 5,548,445 x 22 [!] #> # Key: flight, origin [161,976] #> flight sched_dep_datetime sched_arr_datetime dep_delay #> <chr> <dttm> <dttm> <dbl> #> 1 AA1 2017-01-01 08:00:00 2017-01-01 11:42:00 31 #> 2 AA1 2017-01-02 08:00:00 2017-01-02 11:42:00 -3 #> 3 AA1 2017-01-03 08:00:00 2017-01-03 11:42:00 -6 #> 4 AA1 2017-01-04 08:00:00 2017-01-04 11:42:00 -3 #> 5 AA1 2017-01-05 08:00:00 2017-01-05 11:42:00 -7 #> 6 AA1 2017-01-06 08:00:00 2017-01-06 11:42:00 -3 #> # ... with 5.548e+06 more rows, and 18 more variables: #> # arr_delay <dbl>, carrier <chr>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, origin_city_name <chr>, #> # origin_state <chr>, dest_city_name <chr>, #> # dest_state <chr>, taxi_out <dbl>, taxi_in <dbl>, #> # carrier_delay <dbl>, weather_delay <dbl>, #> # nas_delay <dbl>, security_delay <dbl>, #> # late_aircraft_delay <dbl> ``` ] --- class: middle .large[A tsibble consists of a *time index*, *key* and other *measured variables* in a **data-centric** format, which is built on top of the tibble.] * arbitrary index support * nesting and crossing data structures * heterogeneous data types * list-column ??? We learned a new time series standard, but what can we do with it? --- class: inverse middle center <img src="img/tsibble.png" height=150px> ## standing on the shoulders of giants .animated.bounceInRight[ ``` ⬢ __ _ __ . ⬡ ⬢ . / /_(_)__/ /_ ___ _____ _______ ___ / __/ / _ / // / |/ / -_) __(_-</ -_) \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ ⬢ . /___/ ⬡ . ⬢ ``` ] --- .left-column[ ## Verbs ### - filter ] .right-column[ ```r us_flights %>% filter(sched_dep_datetime < yearmonth("201703")) ``` ``` #> # A tsibble: 821,912 x 22 [!] #> # Key: flight, origin [46,960] #> flight sched_dep_datetime sched_arr_datetime dep_delay #> <chr> <dttm> <dttm> <dbl> #> 1 AA1 2017-01-01 08:00:00 2017-01-01 11:42:00 31 #> 2 AA1 2017-01-02 08:00:00 2017-01-02 11:42:00 -3 #> 3 AA1 2017-01-03 08:00:00 2017-01-03 11:42:00 -6 #> 4 AA1 2017-01-04 08:00:00 2017-01-04 11:42:00 -3 #> 5 AA1 2017-01-05 08:00:00 2017-01-05 11:42:00 -7 #> 6 AA1 2017-01-06 08:00:00 2017-01-06 11:42:00 -3 #> # ... with 8.219e+05 more rows, and 18 more variables: #> # arr_delay <dbl>, carrier <chr>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, origin_city_name <chr>, #> # origin_state <chr>, dest_city_name <chr>, #> # dest_state <chr>, taxi_out <dbl>, taxi_in <dbl>, #> # carrier_delay <dbl>, weather_delay <dbl>, #> # nas_delay <dbl>, security_delay <dbl>, #> # late_aircraft_delay <dbl> ``` ] ??? If you're familiar with the tidyverse, no much learning curve. --- .left-column[ ## Verbs ### - filter ### - select ] .right-column[ Keep the index hanging around with the tsibble ```r us_flights %>% select(flight, origin, dep_delay) ``` ``` #> # A tsibble: 5,548,445 x 4 [!] #> # Key: flight, origin [161,976] #> flight origin dep_delay sched_dep_datetime #> <chr> <chr> <dbl> <dttm> #> 1 AA1 JFK 31 2017-01-01 08:00:00 #> 2 AA1 JFK -3 2017-01-02 08:00:00 #> 3 AA1 JFK -6 2017-01-03 08:00:00 #> 4 AA1 JFK -3 2017-01-04 08:00:00 #> 5 AA1 JFK -7 2017-01-05 08:00:00 #> 6 AA1 JFK -3 2017-01-06 08:00:00 #> # ... with 5.548e+06 more rows ``` OR *consciously* leave off the time context by using `.drop = TRUE` or `as_tibble()`. ] --- .left-column[ ## Verbs ### - filter ### - select ### - summarise ] .right-column[ Aggregate over time ```r us_flights %>% summarise(avg_delay = mean(dep_delay)) ``` ``` #> # A tsibble: 393,009 x 2 [!] #> sched_dep_datetime avg_delay #> <dttm> <dbl> #> 1 2017-01-01 00:14:00 -5 #> 2 2017-01-01 00:15:00 -6.67 #> 3 2017-01-01 00:20:00 -3 #> 4 2017-01-01 00:21:00 17 #> 5 2017-01-01 00:30:00 -1 #> 6 2017-01-01 00:35:00 -4.5 #> # ... with 3.93e+05 more rows ``` Utilize the context and reduce typing, compared to: ```r flights %>% group_by(sched_dep_datetime) %>% summarise(avg_delay = mean(dep_delay)) ``` ] --- .left-column[ ## Verbs ### - filter ### - select ### - summarise ### - index_by* ] .right-column[ ```r us_flights %>% index_by(dep_date = as_date(sched_dep_datetime)) ``` ``` #> # A tsibble: 5,548,445 x 23 [!] #> # Key: flight, origin [161,976] #> # Groups: @ dep_date [365] #> flight sched_dep_datetime sched_arr_datetime dep_delay #> <chr> <dttm> <dttm> <dbl> #> 1 AA1 2017-01-01 08:00:00 2017-01-01 11:42:00 31 #> 2 AA1 2017-01-02 08:00:00 2017-01-02 11:42:00 -3 #> 3 AA1 2017-01-03 08:00:00 2017-01-03 11:42:00 -6 #> 4 AA1 2017-01-04 08:00:00 2017-01-04 11:42:00 -3 #> 5 AA1 2017-01-05 08:00:00 2017-01-05 11:42:00 -7 #> 6 AA1 2017-01-06 08:00:00 2017-01-06 11:42:00 -3 #> # ... with 5.548e+06 more rows, and 19 more variables: #> # arr_delay <dbl>, carrier <chr>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, origin_city_name <chr>, #> # origin_state <chr>, dest_city_name <chr>, #> # dest_state <chr>, taxi_out <dbl>, taxi_in <dbl>, #> # carrier_delay <dbl>, weather_delay <dbl>, #> # nas_delay <dbl>, security_delay <dbl>, #> # late_aircraft_delay <dbl>, dep_date <date> ``` ] --- .left-column[ ## Verbs ### - filter ### - select ### - summarise ### - index_by* ] .right-column[ ```r us_flights %>% index_by(dep_date = as_date(sched_dep_datetime)) %>% summarise(avg_delay = mean(dep_delay)) ``` ``` #> # A tsibble: 365 x 2 [1D] #> dep_date avg_delay #> <date> <dbl> #> 1 2017-01-01 11.7 #> 2 2017-01-02 25.2 #> 3 2017-01-03 18.3 #> 4 2017-01-04 13.1 #> 5 2017-01-05 17.6 #> 6 2017-01-06 16.6 #> # ... with 359 more rows ``` * `year()` for yearly aggregation, `yearquarter()` for quarterly, `yearmonth()` for monthly, `yearweek()` for weekly * other lubridate friends: `floor_date()`, `ceiling_date()` ] --- class: middle inverse background-image: url(img/lego.jpg) background: cover ### .bg-col[Each verb is self-explanatory and helps reason about data analysis.] ### .bg-col[Common tidyverse verbs help reduce cognitive load and make easy transition to time series analysis.] ### .bg-col[The pipe operator %>% chains a sequence of verbs to compose a sound pipeline.] --- background-image: url(figure/carrier-mosaic-bg-1.svg) background-size: 130% background-position: -200px -50px class: inverse middle center ## .bg-col[Annual carrier on-time performance] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ] .right-column[ ```r delayed_carrier <- us_flights %>% mutate(delayed = dep_delay > 15) %>% group_by(carrier) %>% index_by(year = year(sched_dep_datetime)) %>% summarise( Ontime = sum(delayed == 0), Delayed = sum(delayed) ) %>% gather(delayed, n_flights, Ontime:Delayed) %>% print() ``` ``` #> # A tsibble: 24 x 4 [?] #> # Key: carrier, delayed [24] #> carrier year delayed n_flights #> <chr> <dbl> <chr> <int> #> 1 AA 2017 Ontime 738556 #> 2 AS 2017 Ontime 160386 #> 3 B6 2017 Ontime 206185 #> 4 DL 2017 Ontime 780785 #> 5 EV 2017 Ontime 267154 #> 6 F9 2017 Ontime 81163 #> # ... with 18 more rows ``` ] --- .block[ .flowchart[ * `mutate()` <br> create dummy variable ] ] ```r us_flights %>% * mutate(delayed = dep_delay > 15) ``` ``` #> # A tsibble: 5,548,445 x 23 [!] #> # Key: flight, origin [161,976] #> flight sched_dep_datetime sched_arr_datetime dep_delay #> <chr> <dttm> <dttm> <dbl> #> 1 AA1 2017-01-01 08:00:00 2017-01-01 11:42:00 31 #> 2 AA1 2017-01-02 08:00:00 2017-01-02 11:42:00 -3 #> 3 AA1 2017-01-03 08:00:00 2017-01-03 11:42:00 -6 #> 4 AA1 2017-01-04 08:00:00 2017-01-04 11:42:00 -3 #> 5 AA1 2017-01-05 08:00:00 2017-01-05 11:42:00 -7 #> 6 AA1 2017-01-06 08:00:00 2017-01-06 11:42:00 -3 #> # ... with 5.548e+06 more rows, and 19 more variables: #> # arr_delay <dbl>, carrier <chr>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, origin_city_name <chr>, #> # origin_state <chr>, dest_city_name <chr>, #> # dest_state <chr>, taxi_out <dbl>, taxi_in <dbl>, #> # carrier_delay <dbl>, weather_delay <dbl>, #> # nas_delay <dbl>, security_delay <dbl>, #> # late_aircraft_delay <dbl>, delayed <lgl> ``` --- .block[ .flowchart[ * `mutate()` <br> create dummy variable * `index_by() + summarise()` <br> aggregate to annual data ] ] ```r us_flights %>% mutate(delayed = dep_delay > 15) %>% * group_by(carrier) %>% * index_by(year = year(sched_dep_datetime)) %>% * summarise( * Ontime = sum(delayed == 0), * Delayed = sum(delayed) * ) ``` ``` #> # A tsibble: 12 x 4 [?] #> # Key: carrier [12] #> carrier year Ontime Delayed #> <chr> <dbl> <int> <int> #> 1 AA 2017 738556 136344 #> 2 AS 2017 160386 22670 #> 3 B6 2017 206185 72034 #> 4 DL 2017 780785 128534 #> 5 EV 2017 267154 61420 #> 6 F9 2017 81163 20365 #> 7 HA 2017 73498 6245 #> 8 NK 2017 122141 28028 #> 9 OO 2017 582553 111461 #> 10 UA 2017 480623 93485 #> 11 VX 2017 52575 17107 #> 12 WN 2017 1035170 269963 ``` --- .block[ .flowchart[ * `mutate()` <br> create dummy variable * `index_by() + summarise()` <br> aggregate to annual data * `gather()` <br> melt to long form ] ] ```r us_flights %>% mutate(delayed = dep_delay > 15) %>% group_by(carrier) %>% index_by(year = year(sched_dep_datetime)) %>% summarise( Ontime = sum(delayed == 0), Delayed = sum(delayed) ) %>% * gather(delayed, n_flights, Ontime:Delayed) ``` ``` #> # A tsibble: 24 x 4 [?] #> # Key: carrier, delayed [24] #> carrier year delayed n_flights #> <chr> <dbl> <chr> <int> #> 1 AA 2017 Ontime 738556 #> 2 AS 2017 Ontime 160386 #> 3 B6 2017 Ontime 206185 #> 4 DL 2017 Ontime 780785 #> 5 EV 2017 Ontime 267154 #> 6 F9 2017 Ontime 81163 #> # ... with 18 more rows ``` --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ] .right-column[ <img src="figure/carrier-mosaic-1.svg" style="display: block; margin: auto;" /> > Delta operates most efficiently among those big carriers. ] --- background-image: url(figure/nyc-monthly-plot-bg-1.svg) background-size: 130% background-position: -200px -50px class: inverse middle center ## .bg-col[Daily delay percentages in New York] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ ```r nyc_flights <- us_flights %>% filter(origin %in% c("JFK", "LGA", "EWR")) ``` ```r nyc_delay <- nyc_flights %>% mutate(delayed = dep_delay > 15) %>% group_by(origin) %>% index_by(sched_dep_date = as_date(sched_dep_datetime)) %>% summarise( n_flights = n(), n_delayed = sum(delayed) ) %>% mutate(pct_delay = n_delayed / n_flights) %>% print() ``` ``` #> # A tsibble: 1,095 x 5 [1D] #> # Key: origin [3] #> origin sched_dep_date n_flights n_delayed pct_delay #> <chr> <date> <int> <int> <dbl> #> 1 EWR 2017-01-01 294 63 0.214 #> 2 EWR 2017-01-02 327 118 0.361 #> 3 EWR 2017-01-03 322 153 0.475 #> 4 EWR 2017-01-04 315 95 0.302 #> 5 EWR 2017-01-05 317 91 0.287 #> 6 EWR 2017-01-06 294 97 0.330 #> # ... with 1,089 more rows ``` ] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ <img src="figure/nyc-delay-plot-1.svg" style="display: block; margin: auto;" /> ] --- ## A suite of window functions .pull-left[ * `slide()`/`slide2()`/`pslide()`: sliding window with overlapping observations ] .pull-right[ ![](img/slide.gif) ] --- ## A suite of window functions .pull-left[ * `slide()`/`slide2()`/`pslide()`: sliding window with overlapping observations * `tile()`/`tile2()`/`ptile()`: tiling window without overlapping observations ] .pull-right[ ![](img/slide.gif) ![](img/tile.gif) ] --- ## A suite of window functions .pull-left[ * `slide()`/`slide2()`/`pslide()`: sliding window with overlapping observations * `tile()`/`tile2()`/`ptile()`: tiling window without overlapping observations * `stretch()`/`stretch2()`/`pstretch()`: fixing an initial window and expanding to include more observations ] .pull-right[ ![](img/slide.gif) ![](img/tile.gif) ![](img/stretch.gif) ] -- <hr> `purrr`-like interface: `~ mean(.x)`. Functional programming is always fun! Type-stable: `slide()`/`tile()`/`stretch()` (a list) <i class="fas fa-plus"></i> other variants: `*_dbl()`, `*_int()`, `*_lgl()`, `*_chr()` --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ ## Fixed window size ```r nyc_weekly <- nyc_delay %>% group_by(origin) %>% * mutate(ma_delay = slide_dbl( * pct_delay, mean, .size = 7, .align = "center" * )) nyc_weekly %>% select(origin, ma_delay) ``` ``` #> # A tsibble: 1,095 x 3 [1D] #> # Key: origin [3] #> # Groups: origin [3] #> origin ma_delay sched_dep_date #> <chr> <dbl> <date> #> 1 EWR NA 2017-01-01 #> 2 EWR NA 2017-01-02 #> 3 EWR NA 2017-01-03 #> 4 EWR 0.346 2017-01-04 #> 5 EWR 0.403 2017-01-05 #> 6 EWR 0.412 2017-01-06 #> # ... with 1,089 more rows ``` ] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ <img src="figure/nyc-weekly-plot-1.svg" style="display: block; margin: auto;" /> > Increase the window size to smooth out short-term fluctuations. ] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ ## Flexible calendar period ```r nyc_lst <- nyc_delay %>% mutate(yrmth = yearmonth(sched_dep_date)) %>% group_by(origin, yrmth) %>% nest() %>% print() ``` ``` #> # A tibble: 36 x 3 #> origin yrmth data #> <chr> <mth> <list> #> 1 EWR 2017 Jan <tsibble [31 × 4]> #> 2 EWR 2017 Feb <tsibble [28 × 4]> #> 3 EWR 2017 Mar <tsibble [31 × 4]> #> 4 EWR 2017 Apr <tsibble [30 × 4]> #> 5 EWR 2017 May <tsibble [31 × 4]> #> 6 EWR 2017 Jun <tsibble [30 × 4]> #> # ... with 30 more rows ``` ] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ ## Flexible calendar period ```r nyc_monthly <- nyc_lst %>% group_by(origin) %>% * mutate(monthly_ma = slide_dbl(data, * ~ mean(.$pct_delay), .size = 2, .bind = TRUE * )) %>% unnest(key = id(origin)) %>% print() ``` ``` #> # A tsibble: 1,095 x 7 [1D] #> # Key: origin [3] #> origin yrmth monthly_ma sched_dep_date n_flights #> <chr> <mth> <dbl> <date> <int> #> 1 EWR 2017 Jan NA 2017-01-01 294 #> 2 EWR 2017 Jan NA 2017-01-02 327 #> 3 EWR 2017 Jan NA 2017-01-03 322 #> 4 EWR 2017 Jan NA 2017-01-04 315 #> 5 EWR 2017 Jan NA 2017-01-05 317 #> 6 EWR 2017 Jan NA 2017-01-06 294 #> # ... with 1,089 more rows, and 2 more variables: #> # n_delayed <int>, pct_delay <dbl> ``` ] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ] .right-column[ <img src="figure/nyc-monthly-plot-1.svg" style="display: block; margin: auto;" /> > 2-month moving averages highlight a stable long-term trend. ] --- background-image: url(figure/draw-qtl-bg-1.svg) background-size: 130% background-position: -200px -50px class: inverse middle center ## .bg-col[Hourly upper-tail delay behaviours] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ### - upper tail ] .right-column[ ```r hr_qtl <- us_flights %>% index_by(dep_datehour = floor_date(sched_dep_datetime, "hour")) %>% summarise( qtl50 = quantile(dep_delay, 0.5), qtl80 = quantile(dep_delay, 0.8), qtl95 = quantile(dep_delay, 0.95) ) %>% mutate( hour = hour(dep_datehour), wday = wday(dep_datehour, label = TRUE, week_start = 1), date = as_date(dep_datehour) ) %>% gather(key = qtl, value = dep_delay, qtl50:qtl95) %>% print() ``` ``` #> # A tsibble: 24,012 x 6 [1h] #> # Key: qtl [3] #> dep_datehour hour wday date qtl dep_delay #> <dttm> <int> <ord> <date> <chr> <dbl> #> 1 2017-01-01 00:00:00 0 Sun 2017-01-01 qtl50 -3.5 #> 2 2017-01-01 01:00:00 1 Sun 2017-01-01 qtl50 -2 #> 3 2017-01-01 02:00:00 2 Sun 2017-01-01 qtl50 -7 #> 4 2017-01-01 05:00:00 5 Sun 2017-01-01 qtl50 -2 #> 5 2017-01-01 06:00:00 6 Sun 2017-01-01 qtl50 -3 #> 6 2017-01-01 07:00:00 7 Sun 2017-01-01 qtl50 -2 #> # ... with 2.401e+04 more rows ``` ] --- .left-column[ .center[<img src="img/tsibble.png" height=110px>] ### - carrier ### - nyc ### - upper tail ] .right-column[ <img src="figure/draw-qtl-1.svg" style="display: block; margin: auto;" /> > Travel smart: avoid flying around 6pm during the day. ] --- class: middle inverse background-image: url(img/lego.jpg) background: cover ### .bg-col[Break up a big problem to into manageable blocks] ### .bg-col[Generate human-readable and error-prone analysis workflow] --- class: inverse ## Look ahead <br> <br> .center[ <img src="img/tsibble.png" height=230px> <img src="img/fable.png" height=230px> ### [tidyverts.org](tidyverts.org) ] --- class: inverse middle center ### Joint work with <i class="fas fa-users"></i> [Di Cook](http://dicook.org) & [Rob J Hyndman](http://robjhyndman.com) ### Slides created via xaringan ⚔️ <http://slides.earo.me/bigapple> ### More about tsibble <http://pkg.earo.me/tsibble> ### Open source <https://github.com/earowang/bigapple> ### This work is under licensed [<i class="fab fa-creative-commons"></i> BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).