International Symposium on Forecasting 2020

10 min readNov 30, 2020

The International Symposium on Forecasting (ISF) is the most popular forecasting conference, attracting the world’s leading forecasting researchers, practitioners, and students. It is a combination of keynote speaker presentations, academic sessions, workshops, and social programs. Because of the Covid-19 global pandemic, it happened this year online. There were presentations at all times of the day to accommodate everybody worldwide.

In this blog post, I will present some of the most interesting presentations I have been attending. But first, I would like to give my opinion on online conferences using Zoom and Whova.

In the past few years, I have been traveling to international conferences such as ICML oor NeurIPS, and jumping from one room to another to attend the talks I wanted to hear about. When the conference is online, I find it way easier to attend all the talks I am interested in, because you just need to jump from one Zoom link to another. Also, with online presentations, you can easily take screenshots instead of pictures of model architecture for example.
I think there is still a lot that can be improved regarding online presentations: presenters audio/connection is not always good, recordings are often of bad quality (sound), and Q&A are often messy, because people ask questions on different channels (zoom chat, whova chat, live).
This was the first time I attended the ISF. And I think there are more applied research presentations than during NeurIPS or ICML, more focus on the research and theory. And even though there are many presentations on traditional time series methods, there are more and more presentations on machine learning and deep learning methods.

Let’s deep dive into some of those presentations.

Forecasting at Scale

One of the biggest challenges in machine learning, and also in forecasting, is to be able scale, meaning to forecast millions of time series with low latency. Ginger Holt and Italo Lima talked about the best practices for forecasting at scale (recording). The most important is to invest in Tooling: tools to do hyper-parameter tuning, model selection and to easily build customizable dashboards.

On that note, Xiaodong Jiang presented a self-supervised learning framework to predict the best forecasting model and hyper-parameters given any time series data, to reduce the time spent and resources used on model selection and hyper-parameters tuning (recording, slides). Based on features from the time series (e.g. length, sparsity ratio, trend, seasonality), a classifier is trained to predict what would be the best model to use for new time series data, and a multi-task neural network is trained to predict the best model’s hyper-parameters for new time series data.

Application — Fast and accurate forecasts at large scale

Using this framework, they achieved a good accuracy much faster than when doing an exhaustive model selection and hyper-parameters tuning (-75% of runtime). This framework can be used for other applications, such as ensemble models augmentation or anomaly detection.

Chris Fry and Casesy Lichtendahl presented a framework implemented at Google (recording). This framework automates many steps of the generation and monitoring of time series forecasting, uses nested cross validation, runs jobs in parallel using Google’s computing infrastructure and looks at more than 30 metrics in five categories at various horizons and levels of the hierarchy, to evaluate point and quantile forecasts. Those evaluation metrics categories are:

Accuracy: How closely a model’s forecasts adhere to the actual realizations of the time series
Bias / Calibration: The degree to which a forecast is an under-prediction or over-prediction of the actual realization of the time series.
Error Variance: The spread of point forecasting errors around their mean.
Accuracy Risk: The variability in an accuracy metric over multiple forecasts.
Stability: The degree to which forecasts remain unchanged subject to minor variations in the underlying data.

Having multiple types of metrics enables trade-offs depending on business application and needs. Each metric is calculated through backtesting and monitored in real-time.

Example of a model selection evaluation report with metrics from the 5 categories

Oskar Triebe presented NeuralProphet: a scalable and extensible time series forecasting package (recording, slides). NeuralProphet is an upgrade to the Prophet model: it is written in PyTorch and can thus handle larger datasets as well as lagged input variables. The model is easy to use in python, and you can also add events on the fly, such as holidays or sport calendar.

N-BEATS architecture improvements

The N-BEATS model is now part of the state-of-the-art models for time series forecasting. It has been tested on different use cases, and improvements have been proposed.

Ruben Crevits and Jente Van Belle worked on improving stability in demand forecasting (recording). Often, demand planners rely on forecasts made at different horizons in the future (e.g. 3 and 6 months forecast). And a good agreement between the forecasts made at these different horizons indicate a high forecast stability, which can contribute to the efficiency of supply chain management. He proposed to change the loss function of N-BEATS, to minimize the error as well as the forecast instability, by using lookback periods and lagged lookback periods as shown in the figure below. It resulted in improvements in the forecast instability measure while keeping a good forecast accuracy on the M4 competition dataset.

Haoyun Wu presented the work they did at Google to extend the interpretable N-BEATS model implementation (recording). They replaced the polynomial basis functions of the trend blocks with an Exponential Smoothing model’s fitted and forecasted values, and modified the Fourier series from the seasonality blocks to reflect the periods of the seasonal components present in the data. Also, they added a final set of blocks to extract any remaining pattern in the residual through basis functions informed by an auto-regressive process. Finally, they added covariates to the by building embeddings for categorical covariates and using a deep neural network model for temporal covariates. Their model is called STARRY-N (Seasonality Trend AutoRegressive Residual Yeo-Johnson power transformation Neural network). Overall, on the M4 yearly data, their preliminary results show a 0.5% improvement of the SMAPE metric compared to N-BEATS.

Hierarchical forecasting

Hierarchical forecasting was quite trendy during this conference, in many industries (retail, electricity, heat load, …). Daniel Wong presented the SCENTS model, a Deep Convolutional Embeddings for Hierarchical Non-Negative Time Series Forecasting (recording).

This model, inspired by the N-BEATS architecture, is an ensemble neural network framework that allows covariates to modulate the forecasted variable in a customizable, time-dependent fashion. During the M5 competition, it was ranked in the top 1% of entries, by doing ensembles of this model trained on different hierarchical levels by aggregating the data (e.g. per department, per store, per SKU, etc) and then disaggregating at the lowest level.

Behrouz Haji Soleimani explained how hierarchical forecasting ensures consistency and improves the overall accuracy, and proposed a constrained variant that can accommodate arbitrary constraints during the optimization making it applicable to a wider range of industrial applications including demand forecasting, without impacting the computational time (recording). The variant focuses on diminishing the coherence error in the hierarchy, handling non-negative constraints, shrinking the hierarchy by removing single branches and testing different weighting schemes such as volume-based and volume-error which worked very well for demand forecasting.

Hjörleifur Bergsteinsson proposed to use temporal hierarchies to improve the accuracy of heat load forecasts in district heating (recording). The goal was to figure out if there is an optimal number of temporal aggregation levels that maximizes the accuracy improvements of the heat load forecasts. Because forecasting for different aggregation levels can give different information (trend, seasonality, etc), the predictions on different levels need to be connected by sharing information between levels and within levels. The information is shared by using the precision matrix (estimated from the empirical cross-correlation matrix between the training errors), and the coherency constraints in the summation matrix. Their results show that the best performances were achieved by using all levels of aggregation (1, 2, 3, 4, 6, 8, 12 and 24 hours). Mikkel Lindstrøm Sørensen used the same type of method for wind power forecasting with the same levels of aggregation, and made similar conclusions (recording).

Finally, Varunraj Valsaraj and Kedar Prabhudesai improved their shipment forecasting model for Consumer packaged goods (CPGs) companies by training on weekly data instead of daily data, and then disaggregating (recording). In this industry, it is crucial to generate accurate short-term forecasts of order quantities that reflect the realistic demand of products, while orders are usually planned 2 to 3 days in advance and promotional products are planned up to 1 month in advance. Because daily data is very noisy, aggregating at the weekly level reduces fluctuations. They feed a neural network with the forecast of a traditional time series model, product hierarchies and lagged values and got better results by far compared to the naive forecast.

Shipment forecasting model for CPGs companies using Neural Networks and weekly model

Probabilistic forecast

According to Esther Ruiz, for forecasts to be useful, one needs to give the probabilities associated with future events: probabilistic forecast (recording). Popular methods to obtain probabilistic forecasts are based on resampling: bootstrap and subsampling. Esther Ruiz presented the growth vulnerability and risk in timeless of the Covid-19 global pandemic.

Fritz Obermeyer and Edwin Ng presented how to do probabilistic forecasting with Pyro (recording). You can find tutorials on https://pyro.ai/time.

Snapshots of how to implement, train and forecast a model a model with Pyro

Social Media impact on forecasting

Some researchers studied the impact of social media on their forecasting use cases: daily sales prediction for retail and political forecasting.

In the retail industry, Eduardo Soares considered qualitative data from Instagram from 10 big companies of the beauty industry to predict the daily sales variation based on this data (recording). The Instagram data were the following: likes on company pages, likes on tagged posts, number of comments on the company pages, post engagements. From his experiments using an LSTM model and a XGBoost model, it seems that likes and comments on other companies’ pages have more importance than likes and comments on the company pages. It means that the users’ engagements on other companies’ pages on Instagram have a negative effect on the company’s daily sales. Gabriel Pessanha did a study on the same data, predicting daily sales variation of a company that uses digital influencers as part of its marketing campaigns (recording). Quantitative data were used, as well as features extracted from images posted. From his experiments, the presence of images on social media increases the amount of likes and comments of the post and user engagement, which then increases the influence on the consumer purchase decision process.

Regarding the political forecasting use case, Harald Schmidbauer and Angi Roesch presented their work on analyzing social media dynamics during US presidential elections campaigns (recording). They used multivariate wavelet methods and monitored and analyzed hourly media uploads to Instagram and Twitter, and related properties obtained in terms of wavelet coherence to election forecasts. They showed that Trump supporters and Clinton opponents were very fast in media uploading, while Trump opponents and Clinton supporters were less energetic in 2016. These findings were opposed to most election forecasts published in 2016, but in line with election results. After doing the same analysis for 2020 US elections, they forecasted that Biden was going to win, as he did.

The M5 Competition

Of course, I cannot talk about the ISF without talking about the M5 competition (recording). The M Competitions aim at identifying the most appropriate methods for different types of situations requiring predictions and making uncertainty estimates. In the M5 competition, they introduced the concept of hierarchy in the data, and also explanatory variables such as price and promotions. And for the first time, it focused on series that display intermittency, i.e., sporadic demand including zeros. It happened on Kaggle, and the data, benchmarks and submissions made to the M5 forecasting competition are available here. Key results and learnings:

Only 35% of the participants did better than the sNaive baseline model.
In the top 5 winners, the models used were LightGBM, LSTM, traditional forecasting models and a combination of all those.
The best submissions were done using a combination of different ML models.
LSTM models are more and more used, especially to estimate the uncertainty distribution, either directly or through residual sampling.
Hierarchical forecasting models were obviously used a lot with this hierarchical sales data. All winners considered information originating from different aggregation levels.

Accuracy per aggregation level of the top 50 performing teams

To conclude

Overall, I have enjoyed a lot this conference. I have learned a lot of things in the domain, and now have a better understanding of what people are focusing, depending on the industry. You can find all the keynotes and practitioner talks on ISF youtube channel and sessions recordings on this youtube playlist.