Spatial Clustering
Introduction
Spatial clustering is a valuable technique that can be used in various fields such as urban planning, ecology, and data analysis. It helps us identify patterns, group similar data points together, and gain insights into spatial distributions. In this article, we will explore how to perform spatial clustering using, scikit-learn, Geopandas and H3, a spatial indexing library developed by Uber with code example.
Why Spatial Clustering?
Spatial clustering involves dividing a dataset into groups based on their spatial proximity. It helps us understand the underlying structures and relationships within spatial data. For instance, in urban planning, spatial clustering can identify areas with similar characteristics like population density, land use patterns, or socioeconomic factors. This information is crucial for making informed decisions and developing targeted strategies
Introduction to H3 library
The H3 library, developed by Uber Technologies, is a powerful geospatial indexing and analytics library. It utilizes a hexagonal hierarchical spatial grid system to efficiently index and analyze geospatial data. By converting geographic coordinates into H3 indexes, the library facilitates tasks such as spatial indexing, point-in-polygon operations, proximity searches, aggregation, spatial joins and spatial clustering.
There are other types of spatial indexing that use squares (eg. S2 from Google) and triangles. However, hexagons have some interesting properties that the others don’t.
Triangles and squares have neighbors with different distances, whereas hexagons have neighbors with same distance.
Hexagons have the property of expanding rings of neighbors approximating circles.
Case Study
For a better understanding of what spatial clustering can bring when analysing data, let’s imagine a hypothetical case.
Imagine that we want to open a new store in a city and for that we have to map regions with the biggest sales potential.
As a data scientist we can use spatial clustering to identify regions with the biggest sales potential. We can go further and after identifying the best cluster to our case we can prioritize and rank the regions within that cluster using weighted average.
To build the solution, we are going to use the brazilian census tract data of São Paulo city produced by IBGE that contains information of monthly income, population count and count of housing units. We are going to use only these three variables to make it simple.
To build the solution we are going to follow these steps:
- Preprocess data: aggregate census tract data into H3’s hexagons
- Prepare data for modeling: rescale data
- Build cluster model.
- Analyse results.
Now that we know what we want to do, let’s get to the code.
You can get access to the notebook of with the code here:
Preprocess Data
The first step of our solution is to aggregate census tract data into H3’s hexagons. We can think of this aggregation as a way to “normalize” spatial data. One of the problem of census tract is that it comes with different sizes and shapes, making it difficult to make a fair comparison between them. With H3 hexagons we can solve this problem.
So let’s see how this aggregation is done.
First we import necessary libraries and load the census tract data
# Importing necessary libraries
import pandas as pd
import numpy as np
import geopandas as gpd
import h3
import folium
from shapely.geometry import Polygon, Point, mapping, LineString
from libpysal import weights
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import MinMaxScaler
from feature_engine.wrappers import SklearnTransformerWrapper
# Display all columns from dataframe
pd.set_option('display.max_columns', None)
# Supress scietific notation
pd.set_option('display.float_format', lambda x: '%.5f' % x)
# File path
SP_CENSUS_TRACT_FILE_PATH = r'https://github.com/AlvaroMatsuda/Spatial_Clustering/blob/main/databases/brazilian_census_tract.geojson?raw=true'
# Loading data
df_census_tract = gpd.GeoDataFrame.from_file(SP_CENSUS_TRACT_FILE_PATH)
df_census_tract.head()
Now we have to build the hexagons polygons covering São Paulo city. There are three steps to build the hexagons:
- STEP 01: Get hexagon’s indexes that are contained inside census tract polygons.
- STEP 02: Convert from bigger aperture to a smaller aperture (see comment below).
- STEP 03: Create hexagons Polygon from hexagon’s indexes.
# ========================
# STEP 01
# ========================
# Initial Aperture size
INITIAL_APERTURE_SIZE = 10
# Final Aperture size
FINAL_APERTURE_SIZE = 8
# Empty list to be populated with hexagon's indexes
hex_list = []
# Iterate over census tract geometries to get hexagon's indexes
for n,g in enumerate(df_census_tract['geometry'].explode(ignore_index=True)):
# Get GeoJson from geometry
temp = mapping(g)
# Get coordinates of geometry from the GeoJson
temp['coordinates']=[[[j[1],j[0]] for j in i] for i in temp['coordinates']]
# Fills the polygon with hexagons that are contained by the GeoJSON-like data structure.
hex_list.extend(h3.polyfill(geojson=temp, res=INITIAL_APERTURE_SIZE))
# ========================
# STEP 02
# ========================
# Column name with the aperture size
initial_hex_col = 'hex{}'.format(INITIAL_APERTURE_SIZE)
final_hex_col = 'hex{}'.format(FINAL_APERTURE_SIZE)
# Creating DataFrame with hexagon's indexes
df_hex = gpd.GeoDataFrame(hex_list,columns=[initial_hex_col])
# Convert to aperture 8
df_hex[final_hex_col] = df_hex[initial_hex_col].apply(lambda x: h3.h3_to_parent(x, 8))
# Dropping columns with original aperture
df_hex.drop(columns=[initial_hex_col], inplace=True)
# ========================
# STEP 03
# ========================
# Creating Hexagon polygons based on Hexagons indexes
df_hex['hex_polygon'] = df_hex[final_hex_col].apply(lambda x: Polygon(h3.h3_to_geo_boundary(x, geo_json=True)))
# Setting GeoDataFrame Geometry
df_hex.set_geometry('hex_polygon', crs=df_census_tract.crs, inplace=True)
# Drop Duplicated hexagons caused when we converted aperture
df_hex.drop_duplicates(inplace=True)
df_hex.head()
# Creating a Folium Map object
m = folium.Map(location=[-23.568865965761308, -46.629872859184125], zoom_start=12, tiles='OpenStreetMap', crs='EPSG3857', width='70%', height='70%')
# Adding census tract layer to the map
df_census_tract.explore(m=m)
# Adding hexagons layer to the map
df_hex.explore(m=m, style_kwds={'fillOpacity': 0.0, 'color': '#CE2020'})
There is a trick in the code to solve a problem with polyfill function from H3 that is related to the step 02. If we polyfill the census tract directly with the aperture that we want, we get the following result:
As you can see, there are areas where the hexagons don’t fully cover all census tract. To solve this problem, we need to get the indexes of a bigger aperture and them convert it to a smaller aperture. The aperture controls the resolution of the hexagons. We can think of it as being the scale or, in other words, the size of the hexagon. Bigger aperture means smaller hexagons.
With the hexagons constructed, we have to aggregate census tract data to the hexagons dataframe. To calculate total population that lives in a hexagon we have to sum the population of all census tract that are within it. However, as we can see on the images above, some census tracts intersects more than one hexagon. In those cases, I considered a portion of the data stored in the census tract according to the proportion of the area inside each hexagon.
# Calculating the area of each census tract
df_census_tract['area_census_tract'] = (
df_census_tract
.to_crs(crs=3857)
.area
)
# Overlaying census tracts that intersects each hexagon
df_agg = (
df_hex
.to_crs(crs=3857)
.overlay(df_census_tract.to_crs(crs=3857))
)
# Calculating the area of overlayed cencus tract
df_agg['area_overlayed'] = df_agg.area
# Calculating the proportion of the area for each overlayed tract
df_agg['area_prop'] = df_agg['area_overlayed'] / df_agg['area_census_tract']
# Calculating pop_count, total_monthly_income and housing_count proportional to the area inside the hexagons
df_agg = (
df_agg
.assign(prop_pop_count=lambda x: round(x['pop_count'] * x['area_prop']))
.assign(prop_total_monthly_income=lambda x: round(x['total_monthly_income'] * x['area_prop']))
.assign(prop_housing_count=lambda x: round(x['housing_count'] * x['area_prop']))
)
# Aggregating census tract and summing the proportion variables
df_agg = (
df_agg[[
'hex8', 'prop_pop_count', 'prop_total_monthly_income', 'prop_housing_count'
]]
.groupby(['hex8'])
.sum()
.reset_index()
)
# Creating Hexagon polygons based on Hexagons indexes
df_agg['hex_polygon'] = df_agg['hex8'].apply(lambda x: Polygon(h3.h3_to_geo_boundary(x, geo_json=True)))
# Setting GeoDataFrame Geometry
df_agg = gpd.GeoDataFrame(df_agg, crs=df_census_tract.crs, geometry='hex_polygon')
df_agg.head()
That is all for the aggregation. Now that we “normalized” spatial data, we are set to build a cluster model.
Clustering
Before we feed our data into a cluster model, first we need to rescale the value of the features, since clustering algorithms are based on distance of data points.
# Columns to be rescaled
cluster_variables = [
'prop_pop_count',
'prop_total_monthly_income',
'prop_housing_count'
]
# Pipeline with MinMaxRescaler to rescale variables
pipe = Pipeline(
steps=[
('rescaler', SklearnTransformerWrapper(MinMaxScaler(), variables=cluster_variables))
]
)
# rescaled DataFrame
df_rescaled = pipe.fit_transform(df_agg)
df_rescaled = df_rescaled[cluster_variables]
After rescaling we can train a cluster model. I am going to use Hierarquical clustering algorithm and build two models:
- The first will be with no proximity constrain.
- The second with proximity constrain.
The intent of this is to show the possibilities and the different results of these two different approaches.
# Instantiate model without proximity constrain
cluster_wo_constrain = AgglomerativeClustering(n_clusters=5)
# Fit Model
cluster_wo_constrain.fit(df_rescaled)
# Cluster labels
cluster_wo_constrain.labels_
# Adding cluster column with cluster labels
df_agg['clusters_wo_constrain'] = cluster_wo_constrain.labels_
df_agg['clusters_wo_constrain'] = df_agg['clusters_wo_constrain'].astype(str)
df_agg.head()
# Plotting the result of the cluster without constrain
df_agg.explore(
column='clusters_wo_constrain',
cmap='tab10',
width='70%',
height='70%',
style_kwds={'fillOpacity': 1.0, 'opacity': 0.0}
)
To build the cluster with proximity constrain, first we need to build a spatial weights matrix that indicates which hexagons are neighbors of each other. After that we can use the connectivity parameter of the model and pass the spatial weight matrix as the constrain rule.
# Building spatial weights to be used as the proximity constrain
w = weights.contiguity.Queen.from_dataframe(df_agg)
# Instantiate model with proximity constrain
cluster_w_constrain = AgglomerativeClustering(
n_clusters=5,
connectivity=w.sparse
)
# Fit Model
cluster_w_constrain.fit(df_rescaled)
# Cluster labels
cluster_w_constrain.labels_
# Adding cluster column with cluster labels
df_agg['clusters_w_constrain'] = cluster_w_constrain.labels_
df_agg['clusters_w_constrain'] = df_agg['clusters_w_constrain'].astype(str)
df_agg.head()
# Plotting the result of the cluster with constrain
df_agg.explore(
column='clusters_w_constrain',
cmap='tab10',
width='80%',
height='80%',
style_kwds={'fillOpacity': 1.0, 'opacity': 0.0}
)
As we can see, when we added the proximity constrain, all members of a cluster are connected, as opposed to the cluster without the constrain, where there are members of every cluster everywhere. Sometimes, when working with spatial data, it is desired that clusters have these constrains forming a uniform spatial cluster.
In our problem that is no need to add the constrain, as we just want to identify the regions with the best opportunities to open a new store.
Cluster Profiling
Now we can analyse the profile of each cluster and identify the cluster which contain hexagons with the biggest sales potential based on the three variables.
To do that we can calculate the average of all three variables for each cluster.
# Making clusters profile
clusters_profile = (
df_agg
.groupby('clusters_wo_constrain')
.agg(
mean_pop=('prop_pop_count', 'mean'),
mean_income=('prop_total_monthly_income', 'mean'),
mean_housing=('prop_housing_count', 'mean'),
)
.reset_index()
.sort_values(by=['mean_pop','mean_income', 'mean_housing'], ascending=False)
)
clusters_profile
Analysing the profile, we can identify that cluster 3 has the highest monthly income and population averages, which indicate that these regions have a lot of potential customers with high income.
After identifying the best cluster, we can rank the hexagons within the cluster 3 using weighted average.
# Separating only hexagons of cluster 3
cluster3 = df_agg[df_agg['clusters_wo_constrain'] == '3']
# Weights for each variable
pop_weight = 0.5
income_weight = 0.4
housing_weight = 0.1
# Calculating weighted Average
cluster3['weighted_avg'] = (
cluster3
.apply(lambda x: (x['prop_pop_count'] * pop_weight) + (x['prop_total_monthly_income'] * income_weight) + (x['prop_housing_count'] * housing_weight), axis=1)
)
# Ranking hexagons according to the weighted average
cluster3['hex_rank'] = (
cluster3['weighted_avg']
.rank(method='first', ascending=False)
)
cluster3.sort_values('hex_rank', ascending=True)
# Plotting ranked hexagons
cluster3.explore(
column='hex_rank',
cmap='gray',
height='80%',
width='80%',
style_kwds={'fillOpacity': 0.7, 'opacity': 1.0, 'color': '#000000'}
)
The map above show the hexagons ranked according to the calculated weighted average. Darker the color, higher the rank. With this, we could prioritize a deeper analysis on the higher ranked hexagons to make the decision of where to open a new store.
Conclusion
In this article we explored the power of spatial clustering using Uber’s H3 library as a way to “normalize” spatial data. We learned how to index our spatial data using H3, perform spatial clustering using Hierarquical Clustering algorithm with and without spatial constrain and their different results and lastly we have identifyed the cluster with the biggest potential and ranked the hexagons within this cluster to have a list of priorization.
Experiment with different clustering algorithms, appertures, and visualizations to gain a deeper understanding of your data and make informed decisions based on spatial patterns.