#MachineLearning #TimeSeries

By Billy Gustave

SeaPort

Business challenge/requirement

SeaPort is the largest operator of Sea Planes across sea shores in Europe. SeaPort doesn't have planes of their own, rather they lease themon a short termbasis based on passenger traffic.

Goal :

  • Forecast demand of passenger traffic.
  • Data: SeaPlaneTravel.csv

Data Exploration

In [1]:
import pandas as pd, numpy as np, matplotlib.pylab as plt, seaborn as sns
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

Getting the data and turn Month column into a datetime column

In [2]:
# load data
df = pd.read_csv('SeaPlaneTravel.csv')
df.Month = pd.to_datetime(df.Month)
ts = df.set_index('Month')
ts
Out[2]:
#Passengers
Month
2003-01-01 112
2003-02-01 118
2003-03-01 132
2003-04-01 129
2003-05-01 121
... ...
2015-08-01 606
2015-09-01 508
2015-10-01 461
2015-11-01 390
2015-12-01 432

144 rows × 1 columns

In [3]:
ts.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 144 entries, 2003-01-01 to 2015-12-01
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   #Passengers  144 non-null    int64
dtypes: int64(1)
memory usage: 2.2 KB

No missing values

Data Visualization

from pandas.plotting import register_matplotlib_converters register_matplotlib_converters()
In [4]:
ts.plot()
plt.title('Monthly passenger count from 2003 to 2016.', fontsize=16)
Out[4]:
Text(0.5, 1.0, 'Monthly passenger count from 2003 to 2016.')
In [5]:
plt.fill_between(df.Month, y1=df['#Passengers'], y2=-df['#Passengers'], alpha=0.5, linewidth=2)
plt.hlines(y=0, xmin=np.min(df.Month), xmax=np.max(df.Month), linewidth=.5)
plt.title('Passengers (Two Side View) - Growth graph', fontsize=16)
Out[5]:
Text(0.5, 1.0, 'Passengers (Two Side View) - Growth graph')

From the above graphs we can see the data in trending, hence not stationary

Make Stationary

Bell curve / Normal Distribution test

In [6]:
sns.distplot(ts)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e741521e88>

Data is not Normally distributed

Mean and Standard deviation test

In [7]:
X = ts.values
split = int(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
mean1=182.902778, mean2=377.694444
variance1=2244.087770, variance2=7367.962191

Mean add std are different, not strick stationary

Dicky Fuller's Test

In [8]:
from statsmodels.tsa.stattools import adfuller, acf, pacf, kpss
print('Results of Dickey-Fuller Test:')
dftest = adfuller(df['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Results of Dickey-Fuller Test:
Test Statistic                   0.815369
p-value                          0.991880
#Lags Used                      13.000000
Number of Observations Used    130.000000
Critical Value (1%)             -3.481682
Critical Value (5%)             -2.884042
Critical Value (10%)            -2.578770
dtype: float64

P-value is greater than 5%, differencing not stationary
(0.991880 > .05)

KPSS test

In [9]:
print('Results of KPSS Test:')
kpsstest = kpss(df['#Passengers'], regression='c', nlags='auto')
kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest[3].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Results of KPSS Test:
Test Statistic                   0.815369
p-value                          0.991880
#Lags Used                      13.000000
Number of Observations Used    130.000000
Critical Value (1%)              0.739000
Critical Value (5%)              0.463000
Critical Value (10%)             0.347000
Critical Value (2.5%)            0.574000
dtype: float64
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\stattools.py:1685: InterpolationWarning: p-value is smaller than the indicated p-value
  warn("p-value is smaller than the indicated p-value", InterpolationWarning)

Test Statistic is greater than critical value at all confidence intervals
Trend not stationary

Decomposing the data to get residual
Since the data is seasonal, we will set frequency to 12

In [10]:
# since data has a yearly seasonality and monthly report, we will use frequency 12
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(ts, model='multiplicative', period=12)
ts_12 = result.resid.dropna()
plt.subplot(411)
plt.plot(result.observed, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(result.trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(result.seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(result.resid, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
In [11]:
X = ts_12.values
split = int(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
mean1=0.999035, mean2=0.997436
variance1=0.001373, variance2=0.000838
In [12]:
print('Results of Dickey-Fuller Test:')
dftest = adfuller(ts_12, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Results of Dickey-Fuller Test:
Test Statistic                -7.415925e+00
p-value                        6.936029e-11
#Lags Used                     7.000000e+00
Number of Observations Used    1.240000e+02
Critical Value (1%)           -3.484220e+00
Critical Value (5%)           -2.885145e+00
Critical Value (10%)          -2.579359e+00
dtype: float64
In [13]:
print('Results of KPSS Test:')
kpsstest = kpss(ts_12, regression='c', nlags='auto')
kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest[3].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Results of KPSS Test:
Test Statistic                -7.415925e+00
p-value                        6.936029e-11
#Lags Used                     7.000000e+00
Number of Observations Used    1.240000e+02
Critical Value (1%)            7.390000e-01
Critical Value (5%)            4.630000e-01
Critical Value (10%)           3.470000e-01
Critical Value (2.5%)          5.740000e-01
dtype: float64
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\stattools.py:1687: InterpolationWarning: p-value is greater than the indicated p-value
  warn("p-value is greater than the indicated p-value", InterpolationWarning)
In [14]:
sns.distplot(ts_12)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7427cdb08>

We can now use the stationary data to get p,d,q values

In [15]:
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(ts_12, nlags=5, fft=False)
lag_pacf = pacf(ts_12, nlags=5, method='ols')
In [16]:
from scipy.stats import norm
significance_threshold = (norm.ppf((1+0.95)/2))/np.sqrt(len(ts_12)) # 95%
#Plot ACF: 
plt.subplot(121) 
plt.plot(lag_acf)
plt.axhline(y=0,color='black')
plt.axhline(y=-significance_threshold,linestyle='--',color='red')
plt.axhline(y=significance_threshold,linestyle='--',color='red')
plt.axvline(x=1.8)
plt.title('Autocorrelation Function')
#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,color='black')
plt.axhline(y=-significance_threshold,linestyle='--',color='red')
plt.axhline(y=significance_threshold,linestyle='--',color='red')
plt.axvline(x=1.5)
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
arounding those numbers: acf -> 1.8 ~> q = 2 pacf -> 1.5 ~> p = 2 ARIMA aparameters would be ARIMA(2,1,2)

Faster search and Forecasting

Note :

Base on the above result, we can clearly see our data is trending and has seasonality. Using SARIMA is better suited for seasonal data. 'auto_arima' to get best p,d,q values. (it works like a gridsearch for time series, using 'AIC' values to determine best parameters. Lower AIC is better.

In [17]:
from pmdarima.arima import auto_arima
model = auto_arima(ts,start_p=1,start_q=1,start_P=0,max_P=5,max_Q=5,max_D=12,max_order=None,m=12,
                   information_criterion='aic',suppress_warnings=True)
model.aic()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
Out[17]:
1014.9966905633686

Train-Test-Split

In [18]:
train_size = int(len(ts) * 0.70)
train, test = ts[0:train_size], ts[train_size:len(ts)]

Training, fitting and predicting

In [19]:
model.fit(train)
pred = pd.DataFrame(model.predict(n_periods=len(test)),index = test.index,columns=['Prediction'])
In [20]:
pd.concat([test,pred],axis=1).plot()
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e74dd6a0c8>
In [21]:
pd.concat([ts,pred],axis=1).plot()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e750c67c48>
In [22]:
from sklearn.metrics import mean_squared_error
# validating the model
error = mean_squared_error(test, pred)
print('\n')
print('Printing Root Mean Squared Error of Predictions...')
print('RMSE: %.6f' % np.sqrt(error))

Printing Root Mean Squared Error of Predictions...
RMSE: 34.419604

Imporvements :

  • Best Practice to do train test split before anything else
  • Compare results with other models and/or methods such as smoothing and differencing
  • Use Time Series Cross Validation for better RMSE
  • Use GridSearch to test for other more parameters such as 'information_criterion'

Contact Me

www.linkedin.com/in/billygustave

billygustave.com