Aggregating information of two overlaying GeoDataFrames using GeoPandas with code example

Alvaro Matsuda
10 min readFeb 7, 2023

--

Introduction

In this post I will show how to aggregate information from two different GeoDataFrames (layers) using GeoPandas. I will show one of many ways that we can do that with code example. But first, let’s see a situation when we might need to do this kind of operation.

Obs 01: The notebook and data used in this post are available at my Github page. If you want, you can follow this post along with the notebook.

Obs 02: If you want to display the notebook on your browser, use this link instead.

The Problem

Let’s say that you work for Carrefour in São Paulo city and you want to know how many people lives within 1 km of distance for all stores from São Paulo city and the total monthly income. With that information, you want to prioritize marketing campaign for stores with highest monthly income per capita.

We have a dataframe containing information of location of all stores in São Paulo city. We are going to extract the population count and total monthly income from Brazil census tract produced by IBGE.

Now that we know what we want to achieve and where to get data, we have to get our hand-on the code and do some data preparation before we can extract informations we want.

Data Preparation

As I mentioned above, I have collected information of the location of Carrefour’s stores in São Paulo city.

# Importing libraries
import pandas as pd
import numpy as np
import geopandas as gpd
import folium

# Loading Carrefour Store data
df_carrefour = pd.read_excel(r'https://github.com/AlvaroMatsuda/medium_posts/blob/main/spatial%20aggregation/data/carrefour_stores_latlong.xlsx?raw=true')
df_carrefour.head()

This dataframe is a Pandas DataFrame object that we will convert it into a GeoPandas GeoDataFrame object. For that, we have to extract the WKT (Well Known Text) that stores the spatial information from lat and long columns. After that, we will create a radius buffer of 1 km for each store.

# Converting Pandas DataFrame into GeoPandas GeoDataFrame

# Construct geometry Column from lat and long coordinates using geopandas
geometry = gpd.points_from_xy(df_carrefour['long'], df_carrefour['lat'], crs="EPSG:4979")

# Adding new column with the geometry
df_carrefour['geometry'] = geometry

# Constructing GeoDataFrame
gdf_carrefour = gpd.GeoDataFrame(df_carrefour, geometry=df_carrefour['geometry'])
gdf_carrefour.head()
GeoDataFrame with the geometry column

With the GeoDataFrame object we can use the explore() method to plot the map with the location of Carrefour stores:

# Plotting the map to visualize
gdf_carrefour.explore()
The blue dots are Carrefour stores.

Now, we have to create the buffer of 1 km for each store.

# Creating radius buffer
# Converting CRS to a meter based CRS
gdf_carrefour.to_crs(crs=3857, inplace=True)

# Creating 1km buffer column with WKT geometry
gdf_carrefour['buffer_geom'] = gdf_carrefour.buffer(1000.0)

# Converting back to original CRS
gdf_carrefour.to_crs(crs=4979, inplace=True)

# Setting the geometry column to the buffer geometry
gdf_carrefour.set_geometry("buffer_geom", inplace=True)

gdf_carrefour.head()
Carrefour Store GeoDataFrame with 1km_buffer_geom column.
# Plotting the map to visualize
gdf_carrefour.explore()
Map with 1 km buffer for each Carrefour Store

That is all the preparation for the Carrefour store dataframe. Now we have to prepare the census tract data.

Before we start the preparation for the census tract data, I have to mention that the data we need for this problem comes in 3 different files:

  1. Census tract shapes (polygons)
  2. csv file with information of population count
  3. csv file with information of income

To get started, we load the census tract shapes/polygons:

# Loading Brazilian census tract data
census_tract_shapes = gpd.GeoDataFrame.from_file(r'/vsicurl/https://raw.githubusercontent.com/AlvaroMatsuda/medium_posts/main/spatial%20aggregation/data/setor_censitario_sp_2010.shp')

# Setting the CRS
census_tract_shapes = census_tract_shapes.to_crs('EPSG:4979')

# Filtering to get only shapes from São Paulo city
sp_census_tract_shapes = census_tract_shapes.loc[census_tract_shapes['NM_MUNICIP'] == 'SÃO PAULO']

sp_census_tract_shapes.head()
The shapefile already comes with the geometry column and we load it directly from GeoPandas.
# Plottin the map to visualize
sp_census_tract_shapes.explore()
Map with São Paulo census tract shapes

As I said before, the census tract polygons do not come with information of population and monthly income. These informations come in separate files. Therefore, we have to load each of them:

# Loading total Monthly Income dataframe
# The column that holds the total monthly income is: V003
total_monthly_income_df = pd.read_csv(r'https://github.com/AlvaroMatsuda/medium_posts/blob/main/spatial%20aggregation/data/DomicilioRenda_SP1.csv?raw=true', sep=';', encoding='latin1')

# Converting Column "Cod_setor" from int to str
total_monthly_income_df['Cod_setor'] = total_monthly_income_df['Cod_setor'].astype('str')

total_monthly_income_df.head()
Dataframe of total monthly income (V003 column).
# Loading population dataframe
# The column that holds the population count is: V002
population_df = pd.read_csv(r'https://github.com/AlvaroMatsuda/medium_posts/blob/main/spatial%20aggregation/data/Pessoa11_SP1.csv?raw=true', sep=';', encoding='latin1')

# Converting Column "Cod_setor" from int to str
population_df['Cod_setor'] = population_df['Cod_setor'].astype('str')

population_df.head()
Dataframe of population (V002 column).

All dataframes have one column in common, that being the tract code “Cod_setor” and “CD_GEOCODI”. We are going to join these dataframes on this column as the key to get the information we need on the dataframe containing the shapes/polygons.

# Joining shape dataframe with population and total_monthly_income dataframes
sp_census_tract = (
sp_census_tract_shapes
.merge(population_df[['Cod_setor', 'V002']], left_on='CD_GEOCODI', right_on='Cod_setor', how='left') # Joining dataframe
.merge(total_monthly_income_df[['Cod_setor', 'V003']], left_on='CD_GEOCODI', right_on='Cod_setor', how='left') # Joining dataframe
.drop(columns=['Cod_setor_x', 'Cod_setor_y']) # Dropping duplicated columns
.rename(columns={'V002': 'pop_count', 'V003': 'total_monthly_income'}) # Renaming columns
)

# Renaming columns to be all in lowercase
sp_census_tract.columns = map(str.lower, sp_census_tract.columns)

# Convertin "X" values to 0
sp_census_tract['pop_count'] = np.where(sp_census_tract['pop_count'] == "X", 0, sp_census_tract['pop_count'])
sp_census_tract['total_monthly_income'] = np.where(sp_census_tract['total_monthly_income'] == "X", 0, sp_census_tract['total_monthly_income'])

# Converting STR to FLOAT of pop_count and total_monthly_income columns
sp_census_tract['pop_count'] = sp_census_tract['pop_count'].astype('float64')
sp_census_tract['total_monthly_income'] = sp_census_tract['total_monthly_income'].astype('float64')

sp_census_tract.head()
Shape dataframe with pop_count and total_monthly_income columns

We added the pop_count and total_montly_income columns into the shape dataframe that contains the geometry column. That is all the manipulation needed.

Now we are set to aggregate information between the census tract dataframe and the Carrefour store dataframe to know how many people lives within 1 km of distance from each store and the total monthly income.

Aggregating information of two GeoDataFrames

Before we do the aggregation for all buffers, I will do the aggregation for a single buffer to make it easier to understand the manipulations made and the code.

First of all, let’s get a single buffer:

# Selecting a single buffer
single_buffer = gdf_carrefour[gdf_carrefour['store_name'] == 'Market Paraíso']

# Plottin the map to visualize
single_buffer.explore()
Map with 1km radius buffer layer.

Now we will get all census tract that intersects the buffer:

geometric objects intersect if they have any boundary or interior point in common. (definition from shapely documentation)

# Filtering census tract that intersects the buffer
intersected_census_tract = sp_census_tract[sp_census_tract['geometry'].intersects(single_buffer.loc[8, '1km_buffer_geom'])]

# Plottin the map to visualize
intersected_census_tract.explore()
Map with tracts layer that intersects the buffer.

Let’s visualize both layers together in the same map:

# Creating a folium tile base map
m = folium.Map(location=[-23.568865965761308, -46.629872859184125], zoom_start=12, tiles='OpenStreetMap', crs='EPSG3857', width='60%', height='60%')

# Adding intesected census_tract layer to the map
intersected_census_tract.explore(m=m)

# Adding buffer layer to the map
single_buffer.explore(m=m,style_kwds={'fillOpacity': 0.0, 'color': '#CE2020'})

# Showing the map
m
Visualizing both layers

The image above shows that we have selected all census tract (blue polygons) that intersects the buffer (red polygon). Now we have to sum the column pop_count and total_monthly_income to extract informations that we need. However, before we sum these columns we should ponder what to do with tracts that have area outside the buffer. Should we consider only tracts that are 100% inside? Should we consider all of these tracts? Or should we consider just a portion of the information stored in the tracts that have area outside the buffer?

That are many ways that we can deal with these polygons with area outside the buffer.

  • We could use the centroid of the polygon: if the centroid is inside the buffer we consider the totality of the information stored in that polygon.
  • Consider only polygons that are 100% inside the buffer
  • Consider a portion of the data stored in the polygon according to the proportion of the area inside the buffer

Depending on the use case you might want to use different approaches. For example, if we are working with polygons that represents borough/districts of a city, we have to be carefull to not consider the same tract twice in different borough/districts, duplicating information, as well as we might not consider a polygon at all, causing misinformation, depending on the chosen approach.

In our case I chose to use the third approach mentioned above. To get started, first we need to calculate the proportion of the area that is inside the buffer for each tract polygon.

# Calculating the proportion of tract area inside the buffer
intersected_census_tract['prop_area'] = intersected_census_tract['geometry'].intersection(single_buffer.loc[8, '1km_buffer_geom']).area / intersected_census_tract['geometry'].area

# Rounding number to be 1 (100%)
intersected_census_tract['prop_area'] = np.where(intersected_census_tract['prop_area'] >= 0.998, 1 , intersected_census_tract['prop_area'])

intersected_census_tract.head()
Dataframe with prop_area column

To explain what the above code did, basically, it calculates the area that are inside the buffer (like the map to the right) and divide the result by the total area of that tract. After that, we create a new column to store the proportion of the area. Also, the code round the number to 1 when the value in prop_area column is grater than 0.998. We do that, because when we do the division, the area calculated is not exact, having some minor differences, so even if the tract is 100% inside, the value will be 0.9999 due to this minor difference.

After that, we can use the prop_area column to calculate the proportion of the information to be considered when summing pop_count and total_monthly_income columns.

# Calculating the proportion of the pop_count according to the prop_area
intersected_census_tract['prop_pop_count'] = round(intersected_census_tract['pop_count'] * intersected_census_tract['prop_area'])

# Calculating the proportion of the total_monthly_income according to the prop_area
intersected_census_tract['prop_total_monthly_income'] = round(intersected_census_tract['total_monthly_income'] * intersected_census_tract['prop_area'])

intersected_census_tract.head()

Then, we can calculate the sum of the prop_pop_count and prop_total_monthly_income columns to know how many people live and the total monthly income that are within 1km of distance from the store.

# Calculating the sum of pop_count and total_monthly_income inside the buffer
single_buffer['1km_buffer_pop_count'] = intersected_census_tract['prop_pop_count'].sum()
single_buffer['1km_buffer_total_monthly_income'] = intersected_census_tract['prop_total_monthly_income'].sum()

single_buffer

Now that we have done all the processing for one buffer, I will show the code for all buffers.

# Iterating over each buffer
for index, value in gdf_carrefour.iterrows():

# Selecting Tracts that intersect the buffer
df_intersected = sp_census_tract[sp_census_tract['geometry'].intersects(value['1km_buffer_geom'])]

# Calculating the proportion of the area inside the buffer
df_intersected['prop_area'] = df_intersected['geometry'].intersection(value['1km_buffer_geom']).area / df_intersected['geometry'].area

# Rounding number to be 1 (100%)
df_intersected['prop_area'] = np.where(df_intersected['prop_area'] >= 0.998, 1, df_intersected['prop_area'])

# Calculating the proportion of the pop_count and total_monthly_income according to the prop_area
df_intersected[['prop_pop_count', 'prop_total_monthly_income']] = df_intersected.loc[:, ['pop_count', 'total_monthly_income']].multiply(df_intersected.loc[:,'prop_area'], axis=0)

# Calculating total pop_count and total monthly income inside the buffer
gdf_carrefour.loc[index, ['1km_buffer_pop_count', '1km_buffer_total_monthly_income']] = list(df_intersected[['prop_pop_count', 'prop_total_monthly_income']].sum())

gdf_carrefour.head()

At last, we can calculate the monthly income per capita to know which stores have population with highest buying potential to prioritize marketing campaign for these stores.

# Calculating monthly income per capita
gdf_carrefour['monthly_income_per_capita'] = gdf_carrefour['1km_buffer_total_monthly_income'] / gdf_carrefour['1km_buffer_pop_count']

# Sorting DataFrame to show 5 stores with highest monthly income per capita
gdf_carrefour.sort_values(by='monthly_income_per_capita', ascending=False).head()

With that we could prioritize marketing campaign for these 5 stores to attract new customers and make customers to spend more on these stores.

# Creating a copy to use as store point layer
point_gdf_carrefour = gdf_carrefour.copy()

# Setting the geometry to the "geometry" columns (geometry representing points of the location of stores)
point_gdf_carrefour = point_gdf_carrefour.set_geometry('geometry')

# Creating a folium tile base map
m = folium.Map(location=[-23.568865965761308, -46.629872859184125], zoom_start=12, tiles='OpenStreetMap', crs='EPSG3857', width='70%', height='70%')

# Adding buffer layer to the map
gdf_carrefour.sort_values(by='monthly_income_per_capita', ascending=False).head().explore(style_kwds={'fillOpacity': 0.0}, m=m)

# Adding Store location layer to the map
point_gdf_carrefour.loc[gdf_carrefour.sort_values(by='monthly_income_per_capita', ascending=False).head().index, :].explore(style_kwds={'color': '#000000', 'weight': 8}, m=m)

# Showing the map
m

Conclusion

In the GIS (Geographic Information System) market, this kind of manipulation is used very often and there are a lot of other use cases that we can use it.

I wanted to show how it is done in Python, because, as a data scientist who have worked with GIS before, I believe that we can enhance our analysis and models with this kind of manipulations. For instance, we could use the aggregated informations for each buffer as new features to a machine learning model.

I hope this post helps you in any way and don’t forget that the notebook with the code and data used in this post is in my Github.

About me

I am a geographer currently working as a data scientist. For that reason, I am an advocate of spatial data science.

Follow me for more content. I will write a post every month about data science concepts and techniques, spatial data science and GIS (Geographic Information System).

--

--

Alvaro Matsuda

Writing about Geospatial Data Science, AI, ML, Python, GIS