Melt the clck

tidy time series analysis

Earo Wang
@earowang

slides at slides.earo.me/rstudioconf19

1 / 30

1 tidy unified time series workflow ‼️

2 packages 📦

3 big ideas 💡

tsibble
mable
fable

2 / 30

I want to talk about one streamlined workflow for ts under the tidyverse framework, using 2 packages. I'll introduce you three big ideas behind these 2 pkgs: they are tsibble, mable and fable and how they link together. I hope to explain them concretely with data examples.

Tidy data workflow

The underlying data structure: tibble

3 / 30

I believe this diagram isn't foreign to you. This is the tidyverse model. Each module here is powered by one of the tidyverse packages. The tidyverse plays nicely with each other. And one of the fundamental reasons is they all share the same underlying data structure, which is data frame or tibble. So the data is actually placed in the center of the diagram.

But why we cannot bring this workflow easily into time series?

Current time series workflow

4 / 30

Because the current time series objects in R are model-focused. By saying model, I mean it not only includes statistical models, but also forecasting, decomposition, autocorrelation function, and etc. All the methods expect matrices as inputs. But the data arrives in the right beginning of this process, it rarely comes in the matrix form. We have to write so much ad hoc code get the data into a model-ready time series object. It is a pain, bc of the mismatch

We hope to change that.

Tidy time series workflow 😃

The underlying data structure: tsibble

5 / 30

Some new tools are provided to streamline this workflow and make time series analysis a bit easier and more fun. The tsibble pkg will focus on the tidy and transform part, the fable package will do time series modelling and forecasting. The visualisation is done with ggplot2 and its ext.

To make this workflow work, they're going to share the same underlying data str, which is a new data abstraction for time series.

A modern reimagining of time series

6 / 30

We call it "tsibble", which is a time series tibble. If you know, ts is the native object to represent time series in R, so this is where the name tsibble comes from.

An example: electricity consumption

#> # A tibble: 46,102,229 x 8
#>   customer_id reading_datetime    general_supply_…
#>         <int> <dttm>                         <dbl>
#> 1     8143667 2013-04-03 07:30:00            0.117
#> 2     8143667 2013-04-03 08:00:00            0.051
#> 3     8143667 2013-04-03 08:30:00            0.085
#> 4     8143667 2013-04-03 09:00:00            0.165
#> # … with 4.61e+07 more rows, and 5 more variables:
#> #   event_key <int>, controlled_load_kwh <int>,
#> #   gross_generation_kwh <int>,
#> #   net_generation_kwh <int>, other_kwh <int>

many series many measurements fine time resolutions

Data source: Department of the Environment and Energy, Australia

7 / 30

Now we're going to have some fun with an open dataset. This dataset is about residential electricity consumption in Australia. It's a tibble with 46M obs and 8 variables. The column customer_id includes unique identifiers for each household; thousands of households in this data. And reading_datetime gives the timestamps when the reading is recorded every 30 mins. general_supply is the variable that we're interested in forecasting. There are some other measurements in the table as well. How are we going to turn this data into a tsibble.

Time series has its own semantics

elec_ts <- elec_tbl %>% 
  as_tsibble(
    index = reading_datetime,
    key = id(customer_id),
  )

index: a variable that represents time
key: identifying variables that define series
1. + 2. determine distinct rows in a tsibble.

8 / 30

What makes time series special or different from a tibble? Bc it has its own semantics.

The first is obviously the index variable that represents time. In this case, it's reading_datetime. Tsibble supports a wide range of time classes, from numerics to nanotime, and ordered factors. The second component is what we call "key". They are identifying variables that define series or entities over time. For this example, each household is a series and the key variable is customer_id. The key can include multiple variables too.

Index defines the time and key defines the series.

The key together with the index uniquely identifies each observation in a tibble. Basically it means each customer has to have unique timestamps.

Contextual pretty printing

#> # A tsibble: 46,102,229 x 8 [30m] <UTC>
#> # Key:       customer_id [2,924]
#>   customer_id reading_datetime    general_supply_…
#>         <int> <dttm>                         <dbl>
#> 1     8143667 2013-04-03 07:30:00            0.117
#> 2     8143667 2013-04-03 08:00:00            0.051
#> 3     8143667 2013-04-03 08:30:00            0.085
#> 4     8143667 2013-04-03 09:00:00            0.165
#> # … with 4.61e+07 more rows, and 5 more variables:
#> #   event_key <int>, controlled_load_kwh <int>,
#> #   gross_generation_kwh <int>,
#> #   net_generation_kwh <int>, other_kwh <int>

time interval time zone of date-times the number of series

9 / 30

Now we have a tsibble.

Tibble gives a very nice printing method and tsibble is going to enhance that by adding contextual information. Now we have a tsibble with its data dimensions. It recognises 30 min interval and the time zone associated with the index. Here we have UTC, but you might have time zones with DST. Tsibble will respect any time zone.

The key variable is reported with the number of series. We have 2,924 customers in this big table.

Look at data early: time gaps

elec_ts %>% fill_gaps()

10 / 30

We have some time gaps in this data. They are implicit missing values. Since ggplot2 isn't aware of these gaps, it always draws a straight line between data segments if use geom_line(). Tsibble comes with a handy function fill_gaps(). It fills in these gaps with NA by default or you can do by values or functions. So these lines can be removed from the plot easily with fill_gaps().

Implicit missing value handlersCheck: has_gaps()
Reveal: scan_gaps()
Summarise: count_gaps()
Fill in time gaps: fill_gaps()

has_gaps(elec_ts)

#> # A tibble: 2,924 x 2
#>    customer_id .gaps
#>          <int> <lgl>
#>  1     8143667 FALSE
#>  2     8143741 FALSE
#>  3     8143763 FALSE
#>  4     8143811 FALSE
#>  5     8143817 FALSE
#>  6     8143819 FALSE
#>  7     8143839 FALSE
#>  8     8143877 FALSE
#>  9     8144653 TRUE 
#> 10     8144679 TRUE 
#> # … with 2,914 more rows
11 / 30

Except for fill_gaps(), tsibble provides other three verbs to handle implicit missing values. They are all suffixed by _gaps(). Nearly 20% of customers have time gaps. The output shown here, you can see the last two customers with gaps. Almost all model functions require a complete series. So it's a good practice to look at and fill in those time gaps at the first place.

A new adverb `index_by()`: time-aware grouping

elec_avg <- elec_ts %>% 
  index_by(datetime = floor_date(reading_datetime, "hour")) %>%
  summarise(avg_kwh = mean(general_supply_kwh)) %>% 
  print()

#> # A tsibble: 8,760 x 2 [1h] <UTC>
#>   datetime            avg_kwh
#>   <dttm>                <dbl>
#> 1 2013-01-01 00:00:00   0.251
#> 2 2013-01-01 01:00:00   0.221
#> 3 2013-01-01 02:00:00   0.196
#> 4 2013-01-01 03:00:00   0.182
#> # … with 8,756 more rows

12 / 30

Tsibble works nicely with dplyr and tidyr verbs. A new adverb you will use quite often is index_by(). It's similar to group_by() by preparing grouping str, but it only groups the index. Combined with summarise(), the data can be aggregated to any higher-level time resolutions.

For example, I want to work with hourly data instead of 30 mins. Then I use floor_date()/celling_date()/round_date() from the lubridate pkg on the index variable inside the index_by(), followed by summarise(). I'll get hourly average electricity usage across all the households. The result is going to be a single time series at one hour interval. The key is implicit.

Functional programming: map and roll

purrr-logo

13 / 30

Functional programming: map and roll

purrr-logo

13 / 30

Another chunk of operations for time series is definitely rolling window. It not only does iteration over each element like what purrr::map() does, but also needs to roll. What I'm going to do is to wake up this cat, and let her roll.

slide tile stretch

Type stability suffix: *_dbl(), *_int(), *_lgl(), *_chr(), *_dfr(), *_dfc()

Parallel processing prefix: future_*.

14 / 30

Actually there are 3 different types of rolling operations: sliding, tiling, and stretching. It's just like purrr, slide() takes a single input, slide2() takes two inputs and pslide() for multiples. If you have a data frame to roll by obs, you need to use pslide(). For the purpose of type stability, they are going to return a list, and other variants include integers, characters and etc. And in the recent version, I've added the parallel support, they are all prefixed with future_ by using the furrr package by Davis as a backend. If doing some rolling regression, it's nice to save some time by rolling in parallel.

You can put an arbitrary function to these rolling window.

Rolling averages

slide_dbl(x, ~ mean(.), 
  .size = 24)

tile_dbl(x, ~ mean(.), 
  .size = 24)

stretch_dbl(x, ~ mean(.), 
  .init = 24)

15 / 30

A simple example is rolling average. It uses purrr-like syntax. You can call a function name or an anonymous function using ~. I specify the window size to be 24 (1 day) and it's rolling forward from left to right. If you want to move in an opposite direction, just change the window size to be negative. These 3 rolling window as you can result in different averages.

Rolling forecast

plan(multiprocess)
expand_forecast <- function(...) {
  # 3 lines of tsibble + fable code ...
}
elec_avg %>% 
  future_pstretch(expand_forecast, .size = 24, .init = 168)

16 / 30

A more advanced example, but extremely useful is rolling forecast. I defined a custom function called expand_forecast() and also enable parallel processing by using future_pstretch(). It's never been that easy to do a expanding forecast. A nice thing about functional programming, we can focus on writing expressions, instead of wring a long for-loop statement. So what sort of code goes into the expand_forecast(). This Q brings us to the next part of the presentation: tidy forecasting.

Tidy forecasting

17 / 30

How many of you have used the forecast package before?

The fable is the tidy replacement of the forecast package.

Why fable?

It makes forecasting tables.

A fable is like a forecast: it's never true, but it tells you something important about reality.

Rob J Hyndman

18 / 30

Why we call it fable?

The first 30 days on the calendar

19 / 30

Let's look at the data in the first 30 days of January. Each facet gives a daily snapshot of hourly electricity demand. The peak in the late afternoon is driven by the use of air conditioning. January is summer time in Australia. You can see some days have much higher usage coloured by red, bc they are very hot days with max temp greater than 32 degrees.

I'm going to use this subset to forecast the demand one day ahead. And I hold the data of Jan 31 as the test set.

mable: model() the dataelec_mbl <- elec_jan %>% 
  model(
    yesterday = NAIVE(avg_kwh ~ lag("1 day")),
    ets = ETS(avg_kwh)
  )

#> # A mable: 1 x 2
#>   yesterday ets         
#>   <model>   <model>     
#> 1 <SNAIVE>  <ETS(M,N,M)>
succinct model representation
summarise() semantics: reduces rows

20 / 30

Let's model the data.

I construct two models for the tsibble data with the model() verb. The first model I built is a naive model as a benchmark. The naive method simply uses the observed values from yesterday as forecasts. The second model is ETS, exponential smoothing, which can be thought of as a weighted average of the past values. ETS is also short for error, trend and seasonality.

The model function uses the formula interface. On the LHS, we specify the average supply as the response variable, and on the RHS we can put some specials related to the method. I've specified the naive function to use the 24 values from yesterday instead of the value from the previous hour. If we don't specify the RHS like ets, it will do automatic model selection by picking up the best model for you.

Now we have a mable back. A mable is a model table that contains model objects. Each cell shows a succinct model representation by saying I have a seasonal naive model and an ets with three selected components.

Models are a reduced form of the data. So the model() function is an analogue of summarise(), using the same semantics, because it also reduces the data down to a single summary but it happens to be a model object of that summary.

Inspect mable

tidy() model parameters
glance() a one-row per model summary
augment() with model fits (.fitted & .resid)

#> # A tibble: 26 x 4
#>   .model term  estimate std.error
#>   <chr>  <chr>    <dbl>     <dbl>
#> 1 ets    alpha    0.882        NA
#> 2 ets    gamma    0.118        NA
#> 3 ets    l        0.385        NA
#> 4 ets    s0       0.860        NA
#> # … with 22 more rows

21 / 30

In order to look at parameter estimates, information criterion or residuals from a mable. We just use the familiar broom functions: tidy(), glance() and augment(). By applying the tidy() function on the mable, we get the parameter estimates for the models we've built. We have got a bunch of parameters for ets like alpha & beta and initial level estimates.

fable: `forecast()` the future

elec_fbl <- elec_mbl %>% 
  forecast(h = "1 day")

human-friendly specification
point forecasts + distribution

#> # A fable: 48 x 4 [1h] <UTC>
#> # Key:     .model [2]
#>   .model    datetime            avg_kwh .distribution 
#>   <chr>     <dttm>                <dbl> <dist>        
#> 1 yesterday 2013-01-31 00:00:00   0.179 N(0.18, 0.031)
#> 2 yesterday 2013-01-31 01:00:00   0.164 N(0.16, 0.031)
#> 3 yesterday 2013-01-31 02:00:00   0.158 N(0.16, 0.031)
#> 4 yesterday 2013-01-31 03:00:00   0.151 N(0.15, 0.031)
#> # … with 44 more rows

22 / 30

It's time to forecast. We pipe the mable into the forecast(), and we're doing a one-day-ahead forecast equivalent to 24-step-ahead. It supports human-friendly input and so we can read more naturally with forecast horizon 1 day. It's convenient, bc we no longer need to mentally compute how many hours/minutes/seconds in a day. But you can still do h = 24.

We're done with the forecast, we have a forecasting table, which is a fable. It's a special tsibble, which includes the future predications. It not only tells you the point forecasts, but also an underlying pred distribution that involves uncertainty. Bc we are forecasters, not fortune tellers.

You can see the normal distribution with the mean and the standard deviation in the last column .distribution.

This is one of my favourite features about fable: reporting distribution forecasting. So you're able to produce any level of predication interval you like.

Visualise fable `geom_forecast()`

- Yesterday

23 / 30

We'll see the forecasts more clearly with plots using geom_forecast().

The naive method repeats the yesterday's pattern, but the 80 and 95 PIs are quite wide, and even goes below zero.

How about ets?

Visualise fable `geom_forecast()`

- Yesterday

- ETS

24 / 30

The ets nicely captures the daily trend and produces a much narrower PI.

Assess model performance

accuracy(elec_fbl, elec_jan31)

#> # A tibble: 2 x 8
#>   .model   Type      ME   RMSE    MAE   MPE  MAPE  ACF1
#>   <chr>    <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 ets      Test  0.0681 0.0866 0.0681  18.5  18.5 0.906
#> 2 yesterd… Test  0.0700 0.0972 0.0700  17.3  17.3 0.942

25 / 30

Which model performs better?

Use accaracy() to compare the predications with the test set that I held before.

Looking at accuracy measures, ets does slightly better than naive in terms of RMSE. But they all tend to give underestimated predications. The blank line in the plot is the actual data. Looks like it's another hot day in January. If we have weather information like temperatures, we can include them to improve the forecasts, but need to use a different model that allows for ex regressors, e.g. ARIMA.

So far, I have showed all the steps from model building to model assessments for just one time series. Would be any different if we have multiple time series in a table?

Models are fundamentally scalable

#> # A tsibble: 491,182 x 3 [1h] <UTC>
#> # Key:       customer_id [1,480]
#>   customer_id datetime            general_supply_kwh
#>         <int> <dttm>                           <dbl>
#> 1     8144977 2013-01-01 00:00:00              0.219
#> 2     8144977 2013-01-01 01:00:00              0.169
#> 3     8144977 2013-01-01 02:00:00              0.193
#> 4     8144977 2013-01-01 03:00:00              0.225
#> # … with 4.912e+05 more rows

elec_fct <- elec_sub %>% 
  model(ets = ETS(log(general_supply_kwh))) %>% 
  forecast(h = "1 day")

26 / 30

No. Bc models are fundamentally scalable.

Tsibble as a modern reimaging of time series, it designs for hosting many time series together. Especially the series have already been defined when we created the tsibble. We are obviously interested in forecasting the demand for each household here.

This is the subset, including 1480 households. No extra steps needed for forecasting. Just as what we did before, we can directly pipe them into model(), and it will fit an ets model for each time series at once. and then happy forecast(). I also take a log transformation on the response variable to ensure that I get positive forecasts back. The forecast() will take care of the back-transformation for you.

Key is the 🔑 to unlock series across tsibble, mable and fable

#> # A mable: 1,480 x 2
#> # Key:     customer_id [1480]
#>    customer_id ets         
#>          <int> <model>     
#>  1     8144977 <ETS(A,N,A)>
#>  2     8147121 <ETS(A,N,A)>
#>  3     8147611 <ETS(A,N,N)>
#>  4     8147619 <ETS(A,N,A)>
#>  5     8147667 <ETS(A,N,N)>
#>  6     8147703 <ETS(A,N,N)>
#>  7     8147725 <ETS(A,N,N)>
#>  8     8147769 <ETS(A,N,A)>
#>  9     8147795 <ETS(A,N,A)>
#> 10     8147803 <ETS(A,N,N)>
#> # … with 1,470 more rows

27 / 30

You can see 1480 models have been fitted in the mable. The key variable is always the key to refer to a series across tsibble, mable and fable.

We do not store historical data in model objects and fable.

Models are scalable but visualisation is not. So I just plot 4 customers with their forecasts here. At the individual levels, lots of noise, and much larger PI as well.

Much more in the tidyvertsDecomposition STL()
Simulation simulate()
Interpolation interpolate()
Refitting and streaming refit() & stream()
28 / 30

I've showed a proportion of what tsibble and fable can do. We've got decomposition, simulation based on model fits, interpolation of missing values and model supports for streaming data. So check them out.

Joint work

Di Cook
@visnut

Rob J Hyndman
@robjhyndman

Mitchell O'Hara‑Wild
@mitchoharawild

29 / 30

It's the joint work w

tidyverts.org

pkg.earo.me/tsibble

github.com/tidyverts/fable

slides.earo.me/rstudioconf19

github.com/earowang/rstudioconf19

30 / 30

I need to mention that the tsibble is on CRAN but fable is on Github atm. They belong to tidyverts.org

These are the useful links to these packages, my slides and the source code behind the slides.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Melt the clck

tidy time series analysis

Earo Wang @earowang

slides at slides.earo.me/rstudioconf19

1 tidy unified time series workflow ‼️

2 packages 📦

3 big ideas 💡

Tidy data workflow

Current time series workflow

Tidy time series workflow 😃

A modern reimagining of time series

An example: electricity consumption

Time series has its own semantics

Contextual pretty printing

Look at data early: time gaps

Implicit missing value handlers

A new adverb index_by(): time-aware grouping

Functional programming: map and roll

Functional programming: map and roll

Rolling averages

Rolling forecast

Tidy forecasting

Why fable?

The first 30 days on the calendar

mable: model() the data

Inspect mable

fable: forecast() the future

Visualise fable geom_forecast()

- Yesterday

Visualise fable geom_forecast()

- Yesterday

- ETS

Assess model performance

Models are fundamentally scalable

Key is the 🔑 to unlock series across tsibble, mable and fable

Much more in the tidyverts

Joint work

1 tidy unified time series workflow ‼️

2 packages 📦

3 big ideas 💡

Help

Earo Wang
@earowang

A new adverb `index_by()`: time-aware grouping

mable: `model()` the data

fable: `forecast()` the future

Visualise fable `geom_forecast()`

Visualise fable `geom_forecast()`