####Common Timeseries Analyses R Code####

###Compiled by Michael Varnum & Igor Grossmann###

###Overview: The code below is designed to allow you to assess whether two (or more) sets of time series data are related and to generate forecasts for future values. Below you will find several ways of assessing whether or not time series are related including a variety of ways to correct for potentially spurious associations due to temporal autocorrelation. We recommend that you use at least 1 method to deal with autocorrelation be it correcting the signifiance threshold, removing the linear trend, or modelling it. We also recommend that you generate concrete forecasts for future data points.###

###Download then load these libraries first.####
install.packages(forecast)
install.packages(TSA)
install.packages(lmtest)

library(forecast)
library(TSA)
library(lmtest)

###Enter data and create dataframe. Make sure to seperate data with commas. Name variables whatever you like. If you have data with a finer grain than annual data you will want to use a different structure than that which follows below. NB you need same number of data points in each time series with same start and end date.###

X <-
c(,)

Y <-
c(,)

Year <- c(,)

Z.data <- data.frame(X, Y Year)

###Correlation between X and Y. Kendall's Tau is often preferred to Pearson as time series data will often not be normally distributed.###
cor.test(X,Y, method="pearson")
cor.test(X,Y, method="kendall")

###Auto-correlation function. This tells you the extent to which autocorrelation is present in your time series. Typically 1st order auto-correlation (AKA at a lag of 1) is what we are interested in.###
X.acf = acf(X)
X.acf
Y.acf = acf(Y)
Y.acf

###Tiokhin-Hruschka correction for p-value based on observed autocorrelation, threshold is for pearson correlation, plug in observed first order autocorrelation and n based on results of ACF for each time series. The present code will give a corrected significance threshold of alpha = .05, but you can adjust this to get a more precise estimate if you wish.###
simnum<-10000
simul <- matrix(nrow=simnum, ncol=1, 0)
  
for (i in 1:simnum){
ar.sim<-arima.sim(model=list(ar=c(.)),n=)
ar.sim2<-arima.sim(model=list(ar=c(.)),n=)
simul[i] <- cor(ar.sim,ar.sim2)
}
hist(simul)
quantile(simul,c(0.025,0.975))

###Cross-Correlation Function. This tests correlations between X and Y at various time lags. If a correlation exceeds the confidence intervals then the correlation is significant. Values to the left of "O" on the x axis mean a change in variable X precedes a change in variable Y, values to the right mean the opposite. NB this test assumes at least 1 of the time series is stationary, i.e. does not have a high degree of 1st order auto-correlation. You can run CCF's after detrending the data if 1st order auto-correlation is high for both time series.###

CCF(X, Y, lag.max = 30,main="CCF X & Y",ylab="Correlation",xlab="Lag (Years)",col="black")

##Detrending Options. Autocorrelation is a potentiallly serious problem when dealing with time series. There are several ways to address it aside from adjusting signifance thresholds as in the Tiokhin-Hruschka method. Below are three common methods to attempt to detrend your data.###

###Residualize variables based on time in order to detrend.###
Z.data$X_detrend<-lm(X~Year,data=Z.data)$residuals
Z.data$Y_detrend<-lm(Y~Year,data=Z.data)$residuals

###Correlation between detrended variables residualized by year.###
cor.test(Z.data$X_detrend,Z.data$Y_detrend, method="pearson")
cor.test(Z.data$X_detrend,Z.data$Y_detrend, method="kendall")

###Difference variables to detrend. You can change the degree of differencing by setting "differences =" to a number >1.###
X_diff <- diff(X, differences = 1)
Y_diff <- diff(Y, differences = 1)

###Correlation between differenced variables.###
cor.test(X_diff,Y_diff, method="pearson")
cor.test(X_diff,Y_diff, method="kendall")

###Prewhitening. This technique turns the time series into data with the properties of white noise. The default will filter both time series withe the autoregressive component of the x time series. You can adjust the filtering if you wish by specifying a different model. This code will yield a CCF of the pre-whitened time series as well.###
ccf.white<-prewhiten(X, Y, lag.max = 30,main="Pre-Whitened CCF X & Y",ylab="Correlation",xlab="Lag (Years)",col="black") 

###Extract correlation coefficents from prewhitened CCF, NB they are actual bivariate correlations between X and Y despite the misleading label that comes with the output###
ccf.white$ccf

###Model Fitting approach. An alternative approach to establishing relationships in time series data is through modeling fitting and this approach also provides a different set of tools for dealing with temporal autocorrelation.###

###auto.ARIMA (autoregressive integrated moving average) models for Y as DV and X as an exogenous predictor and vice-versa. auto.ARIMA is an alogrithm that runs a number of different ARMIA models (including models that contain autoregressive components and/or difference the DV) and compares model fit, yielding a final model that provides the best fit to the data. If your xreg variable is included in the final model, it means it improved model and its relationship to the DV is not likely spurious. This approach is based primarily on on model fitting rather than null hypothesis signifiance testing (NHST).###
auto.arima(Y, xreg=X)
auto.arima(X, xreg=Y)

###Test auto.ARIMA coefficients for statistical significance. It's possible to get p-values for coefficients in the final auto.arima model if you wish.###
fit <-auto.arima(Y, xreg=X)
coeftest(fit)

###auto.ARIMA with multiple exogenous predictors. Make sure you have three (or however many) sets of time series data all with same number of data points and start and end point. Let's call the second exogenous predictor "Z." Now you can see if the algorithm includes either or both exogenous variables in the final model and compare the relative strength of the two coefficients if both are included. You can also get p-values for coefficents if you wish###
Z <-
c(,)

auto.arima(Y, xreg=cbind(X,Z))
fit <-auto.arima(Y, xreg=cbind(X,Z))
coeftest(fit)

###Forecasting. One of the strongest tests of a model with time series data is it's ability to forecast the future. The code below enables you to forecast using auto.ARIMA.##
Y <-
c(,)

X <- 
c(,)

###Below specify the start date of the time series and it's frequency. If you have annual data replace XXXX with the first year in the data set and set frequency = to 1. If you have monthly data set XXXX, x to the first year and month in your data and set frequency = 12, if quarterly set XXXX and x to first year and quarter, and set frequency = 4, etc.###
Y.ts <- <- ts(Y,start = c(XXXX,x),frequency = XX)
Y.ts

X.ts <- <- ts(X,start = c(XXXX,x),frequency = XX)
X.ts

###Train-Test with ARIMA. 80/20 split the data then train and test ARIMA model. h2 is the number of data points held out for the test, so set this equal to data points corresponding to 20% of the data, replace XX with the number of data points.###
h2 <- XXL
train <- head(Y.ts, round(length(Y.ts) - h2))
test <- tail(Y.ts, h2)

trainData <- train
testData <- test
arimaMod <- auto.arima(trainData, stepwise=FALSE, approximation=FALSE)
arimaMod.Fr <-forecast(arimaMod,h=38)
plot(arimaMod.Fr)
lines(testData, col="red")
legend("topleft",lty=1,bty = "n",col=c("red","blue"),c("testData","ARIMAPred"))

accuracy(arimaMod.Fr, testData)

###Predicted point values for test set based forecast from training set.###
forecastTRAIN

####Plot training set, test set, and forecast based on auto.ARIMA of training set into future. Set the first XX to the number of data points from the time series that corresponds to 20% of the data points. Set the second XX to the number of data points that is equal to the hold out set and the number of future values you wish to forecast. This will give you a figure showing the the training set, the predicted values for the test set and future values as well as the values observed for the test set. This gives you a forecast including future values.###
h2 <- XX
train <- head(Y.ts, round(length(Y.ts) - h2))
test <- tail(Y.ts, h2)

trainData <- train
testData <- test
arimaMod <- auto.arima(trainData, stepwise=FALSE, approximation=FALSE)
arimaMod.Fr <-forecast(arimaMod,h=XX)
plot(arimaMod.Fr)
lines(testData, col="red")
legend("topleft",lty=1,bty = "n",col=c("red","blue"),c("testData","ARIMAPred"))

###Predicted point values for Y for test set and future based forecast from training set.###
forecastTRAIN

###Forecasting future values for Y using auto.ARIMA with an exogenous predictor.###

###Forecast future values for exogenous predictor. First use run auto.ARIMA on your exogenous predictor. Next forecast future values for that exogenous predictor where XX is set to the number of future data points you wish to forecast.###
EX.arima <- auto.arima(X) 
EX.arimacast<-forecast(ED. arima, h=XX)
 
###NEXT, generate a forecast for future values of Y using the model trained on prior data and using forecast future values of X as an exogenous precitor. Note that “fit” is the model trained on the prior data (can be an auto.arima for Y, or you can specify another model, in this example we use auto.arima for Y). Set XX to the number of future data points for Y that you wish to forecast.### 
fit <-auto.arima(Y)
Y.arimacast<-forecast(fit, xreg= EX.arimacast $mean, h=XX) 

###Plot past values and forecast future values for Y based on auto.ARIMA with exogenous predictor.###
plot(Y.arimacast, xlab="PREDICTOR NAME - TYPICALLY YEAR",ylab="DV VALUE")
points(testData, col="red", pch=19)