In a study recently published in Environmental Research: Ecology, researchers used vector autoregression and Granger causality testing to uncover how tide influences biogeochemistry in a salt marsh off the coast of Georgia, USA. Combining measurements from a low-cost dissolved CO2 (pCO2) sensor platform with tide gage observations throughout the course of one month allowed scientists to detect the influence of tides and calculate the amount of pCO2 leaving the ecosystem through marsh surface water. Vector autoregression and Granger causality testing helped reveal the control of tide height over other biogeochemical factors and identify causative (not just correlative) relationships among them, including those with lagged responses.
Results from the study
Small differences in pCO2 sensor accuracy or tide gage location led to large differences in the estimated amount of pCO2 leaving marsh surface water via the tide throughout the study. Although pCO2, turbidity, salinity, and water temperature all fluctuated with the tide, the influence of tide height on pCO2 was strongest. Turbidity and salinity demonstrated very weak tidal signals in modeled and observed data, respectively. Water temperature demonstrated no tidal signal in either modeled or observed data. The vector autoregression model was very poor at explaining changes in turbidity (i.e., water clarity).
How Granger causality works
Granger causality identifies variables that forecast one another and uses predictive variables to provide unique information about future dependent variables. Granger causality essentially states that if variables A, B, and C are better at predicting lagged A than just A and B are at predicting lagged A, that C contains unique information about A. Therefore, variable C Granger-causes A.
First, the number of lags to be used in the model is determined using the Akaike information criterion. Then, the original dataset is used in a multivariate vector autoregressive model (meaning all variables are included) to predict the lagged data. Lastly, the Granger causality test is performed on each possible combination of variables to determine which variables are significant Granger-causes of others (i.e., p-value is less than 0.05).
Spectral Granger causality is a function of the covariance matrix of model residuals, the conjugate of the transposed matrix (i.e., a two-component number made of a real number and its imaginary number with the opposite sign of its original), and the power spectrum of each variable at differing frequencies. In this way, Granger causality testing uses a fast-Fourier-like transform of the autocovariance matrix to rapidly simplify large datasets.
Limitations of Granger causality testing
Using too many predictor variables reduces the amount of unique contributions from each variable and can falsely produce the result of nothing being a true Granger-cause. Care must be taken to only select predictor variables that already have proven physical relationships with each other. Data used in Granger causality testing should be stationary (i.e., its mean and variance should not vary across time), and the dataset should not be singular (i.e., two variables cannot be the same). For example, in the previously mentioned study, one of the original datasets contained both moon phase and tide height as predictor variables of pCO2. Because moon phase is strongly linked to tide height, it created a singular matrix which could not be Granger-causality tested. Moon phase therefore had to be removed as a predictor.
Causality-guided machine learning
Machine learning models allow scientists to study causal relationships in a way they were not able to previously. A study recently published in Agricultural and Forest Meteorology considered causal effects in a machine learning model to improve simulated methane emissions from wetlands around the globe. By controlling the data that the model learned or forgot, the machine learning model was able to predict the lagged influence of complex environmental and biological factors on wetland methane emissions.
Soil temperature was the most important contributor to methane emissions across various wetland types, even more so than air temperature, in the study directed by a scientist at the Lawrence Berkeley National Lab. Soil temperature was also more relevant to methane emissions in wet tundra environments than in fens, bogs, and marshes. Methane emissions from bogs, fens, and marshes were more sensitive to ecosystem respiration and gross primary production. Both the machine learning model and the causal-based machine learning model predicted higher methane emissions in response to soil warming, although results from the causal-based machine learning model were 4 times higher.
Limitations of machine learning and modeling time-lagged responses
Some statisticians believe that lagged variables should never be used in mixed models (which account for both random and non-random effects) or even in standard fixed-effects models, as they can create severe bias by relying too much on the lagged variable and too little on the additional, non-lagged variables. Others argue that not including lagged dependent variables can produce unreliable results when the situation calls for them. Similarly, authors from the aforementioned machine learning study cautioned that sometimes the black-box nature of machine learning models can cause them to produce the right results for the wrong reasons.
- Correlation does not equal causation! Just because one variable appears to be related to another does that mean that it is directly causing it (e.g., tide height and turbidity).
- Studies that consider the unique analytical requirements of data with multi-driver dependency, nonlinearity, and time-lagged responses are essential for understanding ecosystem dynamics.
- Granger causality combined with vector autoregression, causality-guided machine learning, and other methods are becoming more popular statistical techniques in the world of biogeochemistry (for good reason – see point above!).
- Determining drivers of wetland gas concentrations and their limitations can guide decisions of environmental managers, policymakers, researchers, and others interested in wetland carbon storage.