Predictive analytics manual

Predictive Analytics

1 Introduction

Starting with version 9, Board provides a new mathematical engine that will help users in their forecast.

This tool will calculate an automatic forecast based on historical data, through the application of mathematical models that will be automatically adjusted depending on the historical data.

 

In the picture above you can see an example of historical data (in green) forecasted to the future through the Predictive Analytics functionality (in red).

Though the automatic nature of Predictive Analytics tool, it remains very flexible allowing the user to refine its forecast adding information to the forecast scenario. In fact, other than the historical time series, the user can feed beam with other quantities and parameters that we’ll see in detail later. The system will first understand which is the the best model to be applied to historical data (learning phase) and then push the model to the future (forecast phase).

 

1.1  Basic Concepts

In this section we’ll list a series of concepts needed to better understand the Analytics Interface and use it in the correct way.

1.1.1         The Flow

When you run a Predictive Analytics scenario, on some source cube, the engine will:

·        Detect the time series;

·        Label each time series as Discontinued, Intermittent or Smooth;

·        Trim the time series removing the zeroes at the beginning of each series;

·        Identificate the best model for each series via competition;

·        Identificate the outliers;

·        Detect useful covariate, applicates of exogenous covariates and discards covariates that do not improve the model.

·        Serialize the model for future reuse

This part is the learning part, once completed the system will just apply the model to the future values (forecast horizon) and output various indicators: this part is the forecasting part.

We will now explain all the concepts listed above (time series, outliers, covariates…) one by one, please note that this whole process is automatic and the user will not see it in action, the user will just see the outputs (see below).

1.1.2         Time series

In general a time series is a list of values over a time entity (day, week, month).

Let’s assume you want to forecast data in a cube (the observed cube) structured by City, Product and Month, but you want to perform the forecast at Region, Product Group and Month level (structure of the target cube). The time series are all the non zero combination of Region and Product Group in the source cube. In other words if in a layout we put Product group and city by row, and month by column, each row will represent a time series.

The number of time series is named granularity.

1.1.3         Time series labelling

There are three types of time series, discontinued, intermittent and smooth.

Discontinued

A time series is discontinued if it’s definitively zero (data of the last year is always zero).

Intermittent

A time series is Intermittent if it’s often zero but has values on some periods. For examples if you are selling yatches, it’s unlikely that you will sell yatches every month, but most probably you will sell yatches two times a year. A series like that is intermittent. Technically we calculate the median time elapsed (in perods) between two non zero values in the series, if this value is greater than 1.3 then the series is intermittent.

 

Smooth

A series with values on pretty much every period is defined as smooth, basically all the series that are not intermittent or discontinued are smooth (the median time in periods elapsed between two non zero values is less than 1.3).

1.1.4         Models

Discontinued Time series are assumed to be zero also in the future, so the model for this kind of series is just a zero value on every period. Intermittent series will use the Croston-SBA model to be forecasted. Due to the nature of this model the forecast on the future will be a constant on every period.

When it comes to smooth series things are a little bit more complicated. The model for time series is named Idsi-ARX. This model is part of the ARIMA family. The arima is fitted to the time series through competition: the series is truncated at 0.75 the length in periods, the first part is used to calculate the ARIMA and the remaining part is used as a benchmark. The model that best fits will be selected. In the competition we will also have the two naïve predictors, the persistent one (constantly the last value of the series) and the seasonal one (basically a previous year). The model that wins the competition is chosen and used to calculate also the future values, this time using all the data as input. The picture below shows the concept of competitionthe green series is modeled with orange and blue series, the blue is closer (in the squared error mean) to the original series, so it’s chosen over the orange one.

 

Once the model is chosen it is pushed to the future to obtain the forecast.

1.1.5         Outliers

Anomalous values in the historic data are detected by the system, we call them outliers. A value of a time series is an outlier if its error against the model is more than 3.5 times the standard deviation.

1.1.6         Covariates

A Covariate is a time series defined in both the future and the past that is somewhat realted to the observed time series. For example, If I’m an Easter-Eggs seller, a series defined as 1 during the easter month and 0 outside in both the future and the past is a covariate. The system will evaluate the effect that this covariate had on the series and it will push it to the future if and only if the covariate is useful, if the error with the covariate is greater than the one without, then the covariate will be discarded. A covariate can be a boolean (like the one in the example above that is either 1 or zero) or another time series (for example the average temperature is a covariate when I am observing the ice creams sold time series). You are not forced to set up future values for a covariate, for example if you know that your store of Easter Egg was closed during a certain period and this won’t happen in the future you can just tell the system that something happened during that period and it wont happen anymore.

Example 1:

Forecast of Ice creams sales with and without the temperature covariate (forecast period is the 2015).

Example 2:

Special price for ice creams during some particular period (boolean covariate defined in past and future, forecast period is the 2015)

Example 3:

My ice cream  closed twice in the past for a whole month, but i dont expect this to happen anymore (Boolean covariate defined only in the past). Green series consider the covariate, blue series doesn’t.

 

You can apply as many covariate as you want, there is not a limit.

1.1.7         Prediction intervals

Given a level of confidence  X, a prediction interval  is a n interval of values in which the future values will fall with probability X.

In other words, if we set a level of confidence of 90% the system will provide a lower value and an upper value for the forecasted periods. Future observed values will be greater than the lower value and less than the upper value with probability 90%.

EXAMPLE:

At the end of 2015 we will observe that 90% of the values have fallen in the blue area of the graph below.

1.1.8         Reconciliation

When you have multi versioned cubes, predictive analytics will handle this considering each version a separate scenario.

The sum of the forecasts of the most detailed versions is different from the forecast of the less detailed one, this means that the target cube is not aligned.

You can choose to leave this cube not aligned or to perform reconciliation.

There are two Types of reconciliation: top-down  and bottom-up.

·        Top-Down: it allocates data of the less detailed version to the lowest version (similar to split and splat)

·        Bottom-up: it aligns the cube.

1.1.9         Error Statistics

The System computes different types of errors.

MAE (Mean Absolute Error): it is the average of the absolute difference between the observation and their forecast. This measure is scale dependent.

MAPE (Mean Absolute Percentage Error): it is the average absolute percentage size of the error. This measure is scale independent.

MASE (Mean Absolute Scaled Error): It’s the ratio of the MAE and the MAE of the naïve model. It is scale independent and it measures how good the forecast has been compared to the naïve. A MASE grater than 1 indicates that the model selected performed worse than the naïve, if it’s less than 1 the model performed better than the naïve.

Weighted MASE Overall: This measure is the weighted average of all the MASE indicators of the various time series.

2 Scenario creation

For the users that will activate it, it can be found in the database tab:

 

                The Scenario Configurator will open; check the picture below to understand every section of it:

 

2.1 Scenario Setup

Here you can select which scenario to edit or run from the drop down list on the left, you can also duplicate a scenario and rename it. The drop down list on the right allows you to decide whether the system should run the learning phase again on every run or only the forecasting phase.

Please note that before the first run you are forced to run both learning and forecast, so the drop down list will be grayed out. The only reason to launch a scenario with no learning it’s improving your performance.

2.2 Target cube setting

Here you can select the cube that will contain your forecast data, the cube must have a dense structure- You can decide to create a new cube directly from here or change its structure. You can have multiple versions. If you perform a selection here only data inside the select will be taken into account, you can’t have a time selection, the time range will be decided in the next section.  You can also decide whether to clear the entire cube before populating it with the forecast or to clear only the current range use the drop-down list on the right to select the correct option. 

The target cube is very important: in fact this cube is not only the cube that will be filled with data, it also decides what’s a time series and at which level data will be read.

 

Example:

If we use a target cube like the one above, any populated combination of Customer and Product in the source data will represent a time series, the time series detail will be month.

 2.3 Input data and model settings

This section is the most important of the Scenario setup, the accuracy of your forecast mainly depends on the historical data set.

 

In the two drop down lists you can select which periods will be observed, the default values are

·        FirstLoadedPeriod: The oldest value in the cube given the selection:  please note that this value is unique and it is not evaluated for each time series;

·        LastLoadedPeriod: The newest value in the cube given the selection:  please note that this value is unique and it is not evaluated for each time series.

If you think that last period data is not complete you can exclude it in a dynamic way flagging the “ignore last period” checkbox.

Forecast Horizon: here you can decide the amount of periods you want to forecast, the default 0 value will push your forecast until the end of the time range.

Global Method: You can decide to get the model via competition or force one of the two Naïve predictors.

Confidence: here you decide the level of confidence to calculate upper and lower interval.

Reconciliation: Select one of the three reconciliation types; this is only needed with multi versioned target cubes.

Positive values: Flag this if you want to automatically discard all the models that give some negative result in some period.

Merge observed values: When this setting is on, the historical data will be copied to the cube along with the forecast.

Multi Threading: This flag won’t affect calculation results, it just make you decide whether to or not to use multiple threads for calculations, flag it if you have performance issues.

 

Let’s move to the source layout.

Here you can decide the cubes and algorithms to use as source data.

Cubes and algorithms can be set up as:

·        Observed: there can be only one observed cube/algorithm this will be the quantity you are going to forecast.

·        Covariate: there can be as much covariate as you want. Covariates are cubes containing time series with values on both past and future that have some sort of impact on the observed series (for example you can use the average temperature by month if you are observing Ice cream sales). You can decide also the maximum lag of the covariate, that’s the maximum number of periods back and forward that will be influenced by a covariate value on a given period. System will discard covariates that do not give any benefit to the forecast; if you want to force the use of the covariate you can decide it from the two gears icon.

·        Ignore, if you used a cube only to create an algorithm and you don’t want to use it as a covariate or as observed you can ignore it.

 

You can preview this layout with the button on top right. It will put the period by row and use the cubes/algorithm to show data.

 

 2.4 Outputs

Other than the forecast in the target cube, Predictive Analytics outputs a lot of data. You can decide to put this data into a series of cube if you need it.

The output cubes are not mandatory like the target cube, their structure will be the same as the target cube (if they have different structures they will be automatically converted).

IdsiARX.Smooth: This cube will be a slice of the target cube that will contain only the smooth time series.

IdsiARX.Intermittent: This cube will be a slice of the target cube that will contain only the intermittent time series.

IdsiARX.Discontinued: This cube will be a slice of the target cube that will contain only the discontinued time series.

Outliers:  this cube will only be populated on the past with the anomalous values of the various time series.

Interval lower: this cube will contain the lower limit of the forecast, actual values will be greater than the interval lower and less than the interval upper with a probability equal to the confidence level.

Interval upper: this cube will contain the upper limit of the forecast, actual values will be greater than the interval lower and less than the interval upper with a probability equal to the confidence level.

MASE: Cube containing the MASE of each time series, the MASE of a period is the MASE of the model against the time series until that period.

MAE: Cube containing the MAE of each time series, the MAE of a period is the MAE of the model against the time series until that period.

MAPE: Cube containing the MAPE of each time series, the MAPE of a period is the MAPE of the model against the time series until that period.

Holt-Winters: Also known as triple exponential smoothing, it will output the triple exponential smoothing of the series. The alpha, beta and gamma parameters can be set up directly from the interface. Please note that Holt-Winters is the algorithm beneath the forecast time function in the block editor.

 2.5 Run Results

After every run you will get some statistics about the execution time, the number of time series, and the Weighted MASE overall, you’ll also have a graph that will plot the MASE against the number of smooth series.

2.6 Run a scenario via procedure

You can also run a Predictive Analytics scenario from procedure:

The user is allowed to decide whether running the scenario with the procedure selection or the scenario selection and to use only the Forecast part rather than both learning and forecast.