# Spatial Clustering

# Introduction

Spatial clustering is a valuable technique that can be used in various fields such as urban planning, ecology, and data analysis. It helps us identify patterns, group similar data points together, and gain insights into spatial distributions. In this article, we will explore how to perform spatial clustering using, scikit-learn, Geopandas and H3, a spatial indexing library developed by Uber with code example.

# Why Spatial Clustering?

Spatial clustering involves dividing a dataset into groups based on their spatial proximity. It helps us understand the underlying structures and relationships within spatial data. For instance, in urban planning, spatial clustering can identify areas with similar characteristics like population density, land use patterns, or socioeconomic factors. This information is crucial for making informed decisions and developing targeted strategies

# Introduction to H3 library

The H3 library, developed by Uber Technologies, is a powerful geospatial indexing and analytics library. It utilizes a hexagonal hierarchical spatial grid system to efficiently index and analyze geospatial data. By converting geographic coordinates into H3 indexes, the library facilitates tasks such as spatial indexing, point-in-polygon operations, proximity searches, aggregation, spatial joins and** spatial clustering**.

There are other types of spatial indexing that use squares (eg. S2 from Google) and triangles. However, hexagons have some interesting properties that the others don’t.

Triangles and squares have neighbors with different distances, whereas hexagons have neighbors with same distance.

Hexagons have the property of expanding rings of neighbors approximating circles.

# Case Study

For a better understanding of what spatial clustering can bring when analysing data, let’s imagine a hypothetical case.

Imagine that we want to open a new store in a city and for that we have to map regions with the biggest sales potential.

As a data scientist we can use spatial clustering to identify regions with the biggest sales potential. We can go further and after identifying the best cluster to our case we can prioritize and rank the regions within that cluster using weighted average.

To build the solution, we are going to use the brazilian census tract data of São Paulo city produced by IBGE that contains information of monthly income, population count and count of housing units. We are going to use only these three variables to make it simple.

To build the solution we are going to follow these steps:

- Preprocess data: aggregate census tract data into H3’s hexagons
- Prepare data for modeling: rescale data
- Build cluster model.
- Analyse results.

Now that we know what we want to do, let’s get to the code.

You can get access to the notebook of with the code here:

# Preprocess Data

The first step of our solution is to aggregate census tract data into H3’s hexagons. We can think of this aggregation as a way to “normalize” spatial data. One of the problem of census tract is that it comes with different sizes and shapes, making it difficult to make a fair comparison between them. With H3 hexagons we can solve this problem.

So let’s see how this aggregation is done.

First we import necessary libraries and load the census tract data

`# Importing necessary libraries`

import pandas as pd

import numpy as np

import geopandas as gpd

import h3

import folium

from shapely.geometry import Polygon, Point, mapping, LineString

from libpysal import weights

from sklearn.cluster import AgglomerativeClustering

from sklearn.preprocessing import MinMaxScaler

from feature_engine.wrappers import SklearnTransformerWrapper

# Display all columns from dataframe

pd.set_option('display.max_columns', None)

# Supress scietific notation

pd.set_option('display.float_format', lambda x: '%.5f' % x)

`# File path`

SP_CENSUS_TRACT_FILE_PATH = r'https://github.com/AlvaroMatsuda/Spatial_Clustering/blob/main/databases/brazilian_census_tract.geojson?raw=true'

# Loading data

df_census_tract = gpd.GeoDataFrame.from_file(SP_CENSUS_TRACT_FILE_PATH)

df_census_tract.head()

Now we have to build the hexagons polygons covering São Paulo city. There are three steps to build the hexagons:

**STEP 01:**Get hexagon’s indexes that are contained inside census tract polygons.**STEP 02:**Convert from bigger aperture to a smaller aperture (see comment below).**STEP 03:**Create hexagons Polygon from hexagon’s indexes.

`# ========================`

# STEP 01

# ========================

# Initial Aperture size

INITIAL_APERTURE_SIZE = 10

# Final Aperture size

FINAL_APERTURE_SIZE = 8

# Empty list to be populated with hexagon's indexes

hex_list = []

# Iterate over census tract geometries to get hexagon's indexes

for n,g in enumerate(df_census_tract['geometry'].explode(ignore_index=True)):

# Get GeoJson from geometry

temp = mapping(g)

# Get coordinates of geometry from the GeoJson

temp['coordinates']=[[[j[1],j[0]] for j in i] for i in temp['coordinates']]

# Fills the polygon with hexagons that are contained by the GeoJSON-like data structure.

hex_list.extend(h3.polyfill(geojson=temp, res=INITIAL_APERTURE_SIZE))

# ========================

# STEP 02

# ========================

# Column name with the aperture size

initial_hex_col = 'hex{}'.format(INITIAL_APERTURE_SIZE)

final_hex_col = 'hex{}'.format(FINAL_APERTURE_SIZE)

# Creating DataFrame with hexagon's indexes

df_hex = gpd.GeoDataFrame(hex_list,columns=[initial_hex_col])

# Convert to aperture 8

df_hex[final_hex_col] = df_hex[initial_hex_col].apply(lambda x: h3.h3_to_parent(x, 8))

# Dropping columns with original aperture

df_hex.drop(columns=[initial_hex_col], inplace=True)

# ========================

# STEP 03

# ========================

# Creating Hexagon polygons based on Hexagons indexes

df_hex['hex_polygon'] = df_hex[final_hex_col].apply(lambda x: Polygon(h3.h3_to_geo_boundary(x, geo_json=True)))

# Setting GeoDataFrame Geometry

df_hex.set_geometry('hex_polygon', crs=df_census_tract.crs, inplace=True)

# Drop Duplicated hexagons caused when we converted aperture

df_hex.drop_duplicates(inplace=True)

df_hex.head()

`# Creating a Folium Map object`

m = folium.Map(location=[-23.568865965761308, -46.629872859184125], zoom_start=12, tiles='OpenStreetMap', crs='EPSG3857', width='70%', height='70%')

# Adding census tract layer to the map

df_census_tract.explore(m=m)

# Adding hexagons layer to the map

df_hex.explore(m=m, style_kwds={'fillOpacity': 0.0, 'color': '#CE2020'})

There is a trick in the code to solve a problem with polyfill function from H3 that is related to the step 02. If we polyfill the census tract directly with the aperture that we want, we get the following result:

As you can see, there are areas where the hexagons don’t fully cover all census tract. To solve this problem, we need to get the indexes of a bigger aperture and them convert it to a smaller aperture. The aperture controls the resolution of the hexagons. We can think of it as being the scale or, in other words, the size of the hexagon. Bigger aperture means smaller hexagons.

With the hexagons constructed, we have to aggregate census tract data to the hexagons dataframe. To calculate total population that lives in a hexagon we have to sum the population of all census tract that are within it. However, as we can see on the images above, some census tracts intersects more than one hexagon. In those cases, I considered a portion of the data stored in the census tract according to the proportion of the area inside each hexagon.

`# Calculating the area of each census tract`

df_census_tract['area_census_tract'] = (

df_census_tract

.to_crs(crs=3857)

.area

)

# Overlaying census tracts that intersects each hexagon

df_agg = (

df_hex

.to_crs(crs=3857)

.overlay(df_census_tract.to_crs(crs=3857))

)

# Calculating the area of overlayed cencus tract

df_agg['area_overlayed'] = df_agg.area

# Calculating the proportion of the area for each overlayed tract

df_agg['area_prop'] = df_agg['area_overlayed'] / df_agg['area_census_tract']

# Calculating pop_count, total_monthly_income and housing_count proportional to the area inside the hexagons

df_agg = (

df_agg

.assign(prop_pop_count=lambda x: round(x['pop_count'] * x['area_prop']))

.assign(prop_total_monthly_income=lambda x: round(x['total_monthly_income'] * x['area_prop']))

.assign(prop_housing_count=lambda x: round(x['housing_count'] * x['area_prop']))

)

# Aggregating census tract and summing the proportion variables

df_agg = (

df_agg[[

'hex8', 'prop_pop_count', 'prop_total_monthly_income', 'prop_housing_count'

]]

.groupby(['hex8'])

.sum()

.reset_index()

)

# Creating Hexagon polygons based on Hexagons indexes

df_agg['hex_polygon'] = df_agg['hex8'].apply(lambda x: Polygon(h3.h3_to_geo_boundary(x, geo_json=True)))

# Setting GeoDataFrame Geometry

df_agg = gpd.GeoDataFrame(df_agg, crs=df_census_tract.crs, geometry='hex_polygon')

df_agg.head()

That is all for the aggregation. Now that we “normalized” spatial data, we are set to build a cluster model.

# Clustering

Before we feed our data into a cluster model, first we need to rescale the value of the features, since clustering algorithms are based on distance of data points.

`# Columns to be rescaled`

cluster_variables = [

'prop_pop_count',

'prop_total_monthly_income',

'prop_housing_count'

]

# Pipeline with MinMaxRescaler to rescale variables

pipe = Pipeline(

steps=[

('rescaler', SklearnTransformerWrapper(MinMaxScaler(), variables=cluster_variables))

]

)

# rescaled DataFrame

df_rescaled = pipe.fit_transform(df_agg)

df_rescaled = df_rescaled[cluster_variables]

After rescaling we can train a cluster model. I am going to use Hierarquical clustering algorithm and build two models:

- The first will be with no proximity constrain.
- The second with proximity constrain.

The intent of this is to show the possibilities and the different results of these two different approaches.

`# Instantiate model without proximity constrain`

cluster_wo_constrain = AgglomerativeClustering(n_clusters=5)

# Fit Model

cluster_wo_constrain.fit(df_rescaled)

# Cluster labels

cluster_wo_constrain.labels_

# Adding cluster column with cluster labels

df_agg['clusters_wo_constrain'] = cluster_wo_constrain.labels_

df_agg['clusters_wo_constrain'] = df_agg['clusters_wo_constrain'].astype(str)

df_agg.head()

`# Plotting the result of the cluster without constrain`

df_agg.explore(

column='clusters_wo_constrain',

cmap='tab10',

width='70%',

height='70%',

style_kwds={'fillOpacity': 1.0, 'opacity': 0.0}

)

To build the cluster with proximity constrain, first we need to build a spatial weights matrix that indicates which hexagons are neighbors of each other. After that we can use the **connectivity **parameter of the model and pass the spatial weight matrix as the constrain rule.

`# Building spatial weights to be used as the proximity constrain`

w = weights.contiguity.Queen.from_dataframe(df_agg)

# Instantiate model with proximity constrain

cluster_w_constrain = AgglomerativeClustering(

n_clusters=5,

connectivity=w.sparse

)

# Fit Model

cluster_w_constrain.fit(df_rescaled)

# Cluster labels

cluster_w_constrain.labels_

# Adding cluster column with cluster labels

df_agg['clusters_w_constrain'] = cluster_w_constrain.labels_

df_agg['clusters_w_constrain'] = df_agg['clusters_w_constrain'].astype(str)

df_agg.head()

`# Plotting the result of the cluster with constrain`

df_agg.explore(

column='clusters_w_constrain',

cmap='tab10',

width='80%',

height='80%',

style_kwds={'fillOpacity': 1.0, 'opacity': 0.0}

)

As we can see, when we added the proximity constrain, all members of a cluster are connected, as opposed to the cluster without the constrain, where there are members of every cluster everywhere. Sometimes, when working with spatial data, it is desired that clusters have these constrains forming a uniform spatial cluster.

In our problem that is no need to add the constrain, as we just want to identify the regions with the best opportunities to open a new store.

# Cluster Profiling

Now we can analyse the profile of each cluster and identify the cluster which contain hexagons with the biggest sales potential based on the three variables.

To do that we can calculate the average of all three variables for each cluster.

`# Making clusters profile`

clusters_profile = (

df_agg

.groupby('clusters_wo_constrain')

.agg(

mean_pop=('prop_pop_count', 'mean'),

mean_income=('prop_total_monthly_income', 'mean'),

mean_housing=('prop_housing_count', 'mean'),

)

.reset_index()

.sort_values(by=['mean_pop','mean_income', 'mean_housing'], ascending=False)

)

clusters_profile

Analysing the profile, we can identify that cluster 3 has the highest monthly income and population averages, which indicate that these regions have a lot of potential customers with high income.

After identifying the best cluster, we can rank the hexagons within the cluster 3 using weighted average.

`# Separating only hexagons of cluster 3`

cluster3 = df_agg[df_agg['clusters_wo_constrain'] == '3']

# Weights for each variable

pop_weight = 0.5

income_weight = 0.4

housing_weight = 0.1

# Calculating weighted Average

cluster3['weighted_avg'] = (

cluster3

.apply(lambda x: (x['prop_pop_count'] * pop_weight) + (x['prop_total_monthly_income'] * income_weight) + (x['prop_housing_count'] * housing_weight), axis=1)

)

# Ranking hexagons according to the weighted average

cluster3['hex_rank'] = (

cluster3['weighted_avg']

.rank(method='first', ascending=False)

)

cluster3.sort_values('hex_rank', ascending=True)

`# Plotting ranked hexagons`

cluster3.explore(

column='hex_rank',

cmap='gray',

height='80%',

width='80%',

style_kwds={'fillOpacity': 0.7, 'opacity': 1.0, 'color': '#000000'}

)

The map above show the hexagons ranked according to the calculated weighted average. Darker the color, higher the rank. With this, we could prioritize a deeper analysis on the higher ranked hexagons to make the decision of where to open a new store.

# Conclusion

In this article we explored the power of spatial clustering using Uber’s H3 library as a way to “normalize” spatial data. We learned how to index our spatial data using H3, perform spatial clustering using Hierarquical Clustering algorithm with and without spatial constrain and their different results and lastly we have identifyed the cluster with the biggest potential and ranked the hexagons within this cluster to have a list of priorization.

Experiment with different clustering algorithms, appertures, and visualizations to gain a deeper understanding of your data and make informed decisions based on spatial patterns.