Exploratory Spatial Data Analysis (ESDA): Spatial Autocorrelation

10 min readMar 8, 2023

Introduction

In data science, an EDA (Exploratory Data Analysis) is a fundamental step of a project. It gives you better understanding of the data you are working with. It helps to identify outliers and extract insights through hypothesis testing. However, the methods applied to a regular EDA does not consider the spatial component of the data.

The ESDA (Exploratory Spatial Data Analysis) comes to answer questions that the regular EDA cannot. For example, with ESDA we can answer whether a variable (a phenomenon) happens at random in space or if they follow a spatial pattern. Another analysis we can do with ESDA is to identify if a variable is spatially clustered or not and if so, where they are clustered.

The spatial autocorrelation analyse if a given variable (phenomenon) occurs at random in space or if the location where it occurs have a bit of information about the value of the variable.

Spatial autocorrelation can be defined as the “absence of spatial randomness”. (Sergio J. Rey, Dani Arribas-Bel and Levi J. Wolf.)

There are two types of spatial autocorrelation:

Global spatial autocorrelation
Local spatial autocorrelation

I will get into more details of both global and local spatial autocorrelation and how to do them with code example. You can download the notebook with the code here to follow along.

But before we get into how to do it, let’s get familiar with the dataset we are going to work.

The dataset

We are going to use the census tract of São Paulo city, Brazil. It has information of population count and total monthly income contained in each tract.

Before we take a look at the dataset, let’s import the necessary libraries:

# Importing Libraries

import pandas as pd
import geopandas as gpd
import numpy as np
from pysal.lib import weights
from splot.libpysal import plot_spatial_weights
from esda.moran import Moran, Moran_Local
from splot.esda import moran_scatterplot, plot_local_autocorrelation, lisa_cluster
import matplotlib.pyplot as plt
import folium
import matplotlib.colors as colors


import warnings
warnings.filterwarnings("ignore")

# Loading data
census_tract_sp = gpd.GeoDataFrame.from_file(r'https://github.com/AlvaroMatsuda/medium_posts/blob/main/spatial_autocorrelation/data/census_tract_sp.geojson?raw=true')
census_tract_sp.head()

# Fillna of pop_count and total_monthly_income
census_tract_sp['total_monthly_income'].fillna(0, inplace=True)
census_tract_sp['pop_count'].fillna(0, inplace=True)

# Plotting choropleth map of pop_count column
census_tract_sp.explore(
    tiles='cartodbpositron',
    column='pop_count',
    height='60%',
    width='60%',
    scheme='naturalbreaks',
    cmap='Oranges',
    style_kwds={
        'stroke': True,
        'edgecolor': 'k',
        'linewidth': 0.03,
        'fillOpacity': 1
        },
    legend_kwds={
        'colorbar': False,
        'fmt': '{:.0f}'
        }
    )

# Plotting choropleth map of total_monthly_income column
census_tract_sp.explore(
    tiles='cartodbpositron',
    column='total_monthly_income',
    height='60%',
    width='60%',
    scheme='naturalbreaks',
    cmap='Oranges',
    style_kwds={
        'stroke': True,
        'edgecolor': 'k',
        'linewidth': 0.03,
        'fillOpacity': 1
        },
    legend_kwds={
        'colorbar': False,
        'fmt': '{:.0f}'
        }
    )

Choropleth map of total_monthly_income column.

Can we tell by looking at these two choropleth maps that pop_count and total_monthly_income are spatially clustered? Looking at pop_count map, is hard to say for sure if values are spatially clustered. On the other hand, total_monthly_income map seems to be more clustered. But, can we assure that it is clustered just by looking at the choropleth map? It is hard to tell.

So, we want be sure if values of pop_count and total_monthly_income variables are geographically clustered or not and where clusters occurs, if any.

Spatial Weights and Spatial Lag

To do spatial autocorrelation analysis, first we have to define two things in our dataset:

Spatial weights
Spatial lag

In summary, spatial weights is an operation that define the neighbors of each tract. There are many ways that we can define the neighbors, such as distance based, contiguity based and block based. I am not going to get into the details of it. In our case, we are going to use the contiguity based spatial weights, to be more precise the queen contiguity.

# Building spatial weights object
w = weights.contiguity.Queen.from_dataframe(census_tract_sp)

# Plotting to visualize spatial weights
plot_spatial_weights(w, census_tract_sp);

The code above builds the weight object (w) which stores information of neighbors and weights assigned for each neighbors.

Below we can see which tract are neighbors of which. The key is referenced to the index of the tract and the values are indexes corresponding to tracts that are neighbors of the key tract index.

# Transforming weights into binary (if it's 1 = is neighbor, 0 = not neighbor)
w.transform = "B"

# Showing neighbors indexes
w.neighbors

The key “0” reference to the index 0 of the GeoDataFrame and the indexes [1, 450, 456, 429, 431] are all neighbors of the index 0.

The next code snippet show the weights assigned for each neighbor.

# Showing weights
w.weights

In our case we are going to use the binary weights. But depending on the case, we can use different transformations to standardize weights.

After constructing the weight matrix, we have to calculate the spatial lag of variables we are interested. Spatial lag takes into account the weights assigned for each neighbor and multiply that by the value of the variable. After that, it sum the resulting values of all neighbors. In other words, it is the “local average” of a variable for each observation, considering the neighborhood.

# Calculating Spatial Lag
census_tract_sp['w_pop_count'] = weights.lag_spatial(w, census_tract_sp['pop_count'])
census_tract_sp['w_total_monthly_income'] = weights.lag_spatial(w, census_tract_sp['total_monthly_income'])

census_tract_sp[['id', 'pop_count', 'w_pop_count', 'total_monthly_income', 'w_total_monthly_income']].head()

In our case, since we used a binary weight the w_pop_count column is the sum of pop_count of all neighbors for each index. For example, the index 0 of the GeoDataFrame has these indexes as neighbors: [1, 450, 456, 429, 431]. So, if we sum the pop_count column of these rows we will get the value 1,876.

Now that we calculated the spatial lag for both variables we are interested, we are set to do spatial autocorrelation.

Global Spatial Autocorrelation

The global spatial autocorrelation helps us to answer questions like:

Do variable values follow a spatial pattern?
Are similar values closer to other similar values than we would expect from pure chance?

In other words, global spatial autocorrelation tells us if similar values of a given variable are clustered in space compared to what we would expect by pure chance (randomness). It gives an overview of our entire dataset, therefore it does not tell where those clusters are.

For continuous variable, we can use Moran’s I statistic for global spatial autocorrelation. The Moran’s I statistic is just like an analysis of autocorrelated variables. That is, the Moran’s I statistic ranges from -1 to 1 and it indicates if the variable is spatially clustered, dispersed or random. Positive values indicates that similar values of the variable are spatially closer (clustered), negative value indicates that the neighbors tend to have very different values (dispersed) and values close to 0 indicates that there are no correlation between location and the value of the variable (random). In other words, it tells us if the variable is spatially correlated.

In our case the result of the Global Moran’s I statistic is 0.2425. What does this mean? Alone, it gives us the notion that the pop_count values is spatially positive correlated. However, only with this measure we can’t tell if we can be certain that the variable is spatially clustered. To be certain of that, we can calculate the p-value to test if the result of Moran’s statistic is significant or not. The null hypothesis of Moran’s I is that the values are randomly distributed in space.

With a p-value of 0.001 we can reject the null hypothesis, therefore we can assume that the variable is not randomly distributed in space and with the value of the Moran’s I statistic we can tell that it have positive correlation.

We can use the Moran’s I scatterplot to help visualize the correlation.

# Plotting Moran's I scatter plot
moran_scatterplot(moran);

The values in the x and y axis are in standard deviation. The x axis (Attribute) is referenced to the pop_count variable (pop count of each tract) and the y axis (Spatial Lag) is referenced to the w_pop_count variable (pop count of the neighbors for each tract). Looking at the plot we can analyse that as the x axis gets higher, so does the y axis in general. In other words, when pop_count is high in a tract, its neighbors also tend to have high values.

Now we are certain that the pop_count variable is spatially clustered, although the correlation is not that strong. Let’s do the same for total_mothly_income.

We can see that the total_monthly_income has higher Moran’s I value than pop_count, which means that total_monthly_income is more spatially clustered.

In summary, this is what the global spatial autocorrelation can tell us. It gives a general idea whether our data is spatially clustered, dispersed or random. Additionally, we can test if the data is spatially random distributed or not through the p-value. However, it does not indicate where the data is clustered or dispersed. That is where local spatial autocorrelation comes in. It can show us where the data is clustered.

Local Spatial Autocorrelation

As mentioned before, local spatial autocorrelation helps to identify where clusters are in a map. It achieve this by focusing on the relationship between each observarion and its neighbors.

For this LISA (Local Indicators of Spatial Association) is used. Basically, LISA is constructed based on the Moran’s scatterplot we did before. It divide the plot in 4 quadrants according to the relationship of observation and its neighbors:

high value with high value neighbors (HH);
low value with low value neighbors (LL);
high value with low value neighbors (HL);
low value with high value neighbors (LH).

# Local Moran's I
pop_count_local_moran = Moran_Local(y_pop_count, w)

# Plotting Local Moran's I scatterplot of pop_count
fig, ax = moran_scatterplot(pop_count_local_moran, p=0.05);

plt.text(1.95, 1, 'HH', fontsize=25)
plt.text(1.95, -1.0, 'HL', fontsize=25)
plt.text(-1.5, 1, 'LH', fontsize=25)
plt.text(-1.5, -1, 'LL', fontsize=25)
plt.show()

One thing that Local Moran’s I also do, is that it calculates the statistical significance (p-value) of each relationship. So, in the scatterplot we can see that there are grey dots. These grey dots are observations that doesn’t have statistical significance in the relationship of values of the observed tract value and the neighbohood weighted value. In other word, these values could be generated randomly and they are not unusual enough values to be considered statistically significant.

We can plot the map with the Local Moran’s I classification to identify where clusters are.

# creating column with local_moran classification
census_tract_sp['pop_count_local_moran'] = pop_count_local_moran.q

# Dict to map local moran's classification codes
local_moran_classification = {1: 'HH', 2: 'LH', 3: 'LL', 4: 'HL'}

# Mapping local moran's classification codes
census_tract_sp['pop_count_local_moran'] = census_tract_sp['pop_count_local_moran'].map(local_moran_classification)

# p-value for each observation/neighbor pair
census_tract_sp['pop_count_local_moran_p_sim'] = pop_count_local_moran.p_sim

# If p-value > 0.05 it is not statistical significant
census_tract_sp['pop_count_local_moran'] = np.where(census_tract_sp['pop_count_local_moran_p_sim'] > 0.05, 'ns', census_tract_sp['pop_count_local_moran'])

census_tract_sp.head()

# Plotting Local Moran's I classification map of pop_count column
census_tract_sp.explore(
    tiles='cartodbpositron',
    column='pop_count_local_moran',
    height='70%',
    width='70%',
    cmap=[
    '#D7191C', # Red
    '#FDAE61', # Orange
    '#ABD9E9', # Light Blue
    '#2C7BB6', # Blue
    '#D3D3D3'  # Grey
    ],
    style_kwds={
        'stroke': True,
        'edgecolor': 'k',
        'linewidth': 0.03,
        'fillOpacity': 1
        
        },
    )

Map with Local Moran’s I classification for pop_count column.

We can see that the southern, northern and the central regions have clusters of low-low (LL) values of the pop_count column. Besides that, it is hard to identify any other cluster. Now, let’s do the same for total_monthly_income column.

# Local Moran's I
total_monthly_income_local_moran = Moran_Local(y_total_monthly_income, w)

# Plotting Local Moran's I scatterplot of pop_count
fig, ax = moran_scatterplot(total_monthly_income_local_moran, p=0.05);

plt.text(1.95, 1, 'HH', fontsize=25)
plt.text(1.95, -1.0, 'HL', fontsize=25)
plt.text(-1.5, 1, 'LH', fontsize=25)
plt.text(-1.5, -1, 'LL', fontsize=25)
plt.show()

# creating column with local_moran classification
census_tract_sp['total_monthly_income_local_moran'] = total_monthly_income_local_moran.q

# Dict to map local moran's classification codes
local_moran_classification = {1: 'HH', 2: 'LH', 3: 'LL', 4: 'HL'}

# Mapping local moran's classification codes
census_tract_sp['total_monthly_income_local_moran'] = census_tract_sp['total_monthly_income_local_moran'].map(local_moran_classification)

# p-value for each observation/neighbor pair
census_tract_sp['total_monthly_income_local_moran_p_sim'] = total_monthly_income_local_moran.p_sim

# If p-value > 0.05 it is not statistical significant
census_tract_sp['total_monthly_income_local_moran'] = np.where(census_tract_sp['total_monthly_income_local_moran_p_sim'] > 0.05, 'ns', census_tract_sp['total_monthly_income_local_moran'])

# Plotting Local Moran's I classification map of pop_count column
census_tract_sp.explore(
    tiles='cartodbpositron',
    column='total_monthly_income_local_moran',
    height='70%',
    width='70%',
    cmap=[
    '#D7191C', # Red
    '#FDAE61', # Orange
    '#ABD9E9', # Light Blue
    '#2C7BB6', # Blue
    '#D3D3D3'  # Grey
    ],
    style_kwds={
        'stroke': True,
        'edgecolor': 'k',
        'linewidth': 0.03,
        'fillOpacity': 1
        
        },
    )

Map with Local Moran’s I classification for total_monthly_income column.

Compared to the classification of pop_count, total_monthly_income is more clustered, as it was analysed by the global autocorrelation. We can see that the northern, southern and the eastern regions have clusters of low-low (LL) values, whereas the central region has cluster of high-high (HH) values .

Conclusion

In this post, I showed you how to incorporate the spatial component in a EDA, turning it into a ESDA (Exploratoy Spatial Data Analysis). It gives us an idea of where variables are clustered in space and whether the relation between the observation and its neighbors are statistically significant. In other words, it tests if values of a variable is correlated by the location (neighbors) of where it happen.

Doing this kind of analysis can bring some insights of the problem we are dealing with.

About me

I am a geographer currently working as a data scientist. For that reason, I am an advocate of spatial data science.

Follow me for more content. I will write a post every month about data science concepts and techniques, spatial data science and GIS (Geographic Information System).