#MachineLearning #TimeSeries
By Billy Gustave
Business challenge/requirement
SeaPort is the largest operator of Sea Planes across sea shores in Europe. SeaPort doesn't have planes of their own, rather they lease themon a short termbasis based on passenger traffic.
Goal :
import pandas as pd, numpy as np, matplotlib.pylab as plt, seaborn as sns
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
Getting the data and turn Month column into a datetime column
# load data
df = pd.read_csv('SeaPlaneTravel.csv')
df.Month = pd.to_datetime(df.Month)
ts = df.set_index('Month')
ts
ts.info()
No missing values
Data Visualization
ts.plot()
plt.title('Monthly passenger count from 2003 to 2016.', fontsize=16)
plt.fill_between(df.Month, y1=df['#Passengers'], y2=-df['#Passengers'], alpha=0.5, linewidth=2)
plt.hlines(y=0, xmin=np.min(df.Month), xmax=np.max(df.Month), linewidth=.5)
plt.title('Passengers (Two Side View) - Growth graph', fontsize=16)
From the above graphs we can see the data in trending, hence not stationary
Make Stationary
Bell curve / Normal Distribution test
sns.distplot(ts)
Data is not Normally distributed
Mean and Standard deviation test
X = ts.values
split = int(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
Mean add std are different, not strick stationary
Dicky Fuller's Test
from statsmodels.tsa.stattools import adfuller, acf, pacf, kpss
print('Results of Dickey-Fuller Test:')
dftest = adfuller(df['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
P-value is greater than 5%, differencing not stationary
(0.991880 > .05)
KPSS test
print('Results of KPSS Test:')
kpsstest = kpss(df['#Passengers'], regression='c', nlags='auto')
kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest[3].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Test Statistic is greater than critical value at all confidence intervals
Trend not stationary
Decomposing
the data to get residual
Since the data is seasonal, we will set frequency to 12
# since data has a yearly seasonality and monthly report, we will use frequency 12
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(ts, model='multiplicative', period=12)
ts_12 = result.resid.dropna()
plt.subplot(411)
plt.plot(result.observed, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(result.trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(result.seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(result.resid, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
X = ts_12.values
split = int(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('Results of Dickey-Fuller Test:')
dftest = adfuller(ts_12, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
print('Results of KPSS Test:')
kpsstest = kpss(ts_12, regression='c', nlags='auto')
kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest[3].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
sns.distplot(ts_12)
We can now use the stationary data to get p,d,q values
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(ts_12, nlags=5, fft=False)
lag_pacf = pacf(ts_12, nlags=5, method='ols')
from scipy.stats import norm
significance_threshold = (norm.ppf((1+0.95)/2))/np.sqrt(len(ts_12)) # 95%
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,color='black')
plt.axhline(y=-significance_threshold,linestyle='--',color='red')
plt.axhline(y=significance_threshold,linestyle='--',color='red')
plt.axvline(x=1.8)
plt.title('Autocorrelation Function')
#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,color='black')
plt.axhline(y=-significance_threshold,linestyle='--',color='red')
plt.axhline(y=significance_threshold,linestyle='--',color='red')
plt.axvline(x=1.5)
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
Note
:
Base on the above result, we can clearly see our data is trending and has seasonality. Using SARIMA is better suited for seasonal data. 'auto_arima' to get best p,d,q values. (it works like a gridsearch for time series, using 'AIC' values to determine best parameters. Lower AIC is better.
from pmdarima.arima import auto_arima
model = auto_arima(ts,start_p=1,start_q=1,start_P=0,max_P=5,max_Q=5,max_D=12,max_order=None,m=12,
information_criterion='aic',suppress_warnings=True)
model.aic()
Train-Test-Split
train_size = int(len(ts) * 0.70)
train, test = ts[0:train_size], ts[train_size:len(ts)]
Training, fitting and predicting
model.fit(train)
pred = pd.DataFrame(model.predict(n_periods=len(test)),index = test.index,columns=['Prediction'])
pd.concat([test,pred],axis=1).plot()
pd.concat([ts,pred],axis=1).plot()
from sklearn.metrics import mean_squared_error
# validating the model
error = mean_squared_error(test, pred)
print('\n')
print('Printing Root Mean Squared Error of Predictions...')
print('RMSE: %.6f' % np.sqrt(error))
Imporvements :