Spatial Data Science: 3 main data structures for GeoSpatial data

Alvaro Matsuda
9 min readJan 6, 2023

--

Photo by Kyle Glenn on Unsplash

Introduction

It is said that wherever that is data, there is work for data scientist. And in the geospatial field that are lots of data being generated that can be analysed to extract insights from. As Sergio J. Rey, Dani Arribas-Bel and Levi J. Wolf wrote in their free online book (Geographic Data Science with Python):

“everything has a location in space-time, and this location can be used directly to make better predictions or inferences.”

So in theory, every data that we store could potentially have a location attached to it, because everything that happens in our world takes place in a certain location. And adding this geospatial element to our data can enhace our analysis.

As we can see, geographic data can be everywhere. And to start working with geospatial data, first we need to know how is the data structure of it. So in this post I will make a summary of the Part I of the aforementioned book, about the section Geographic thinking for data scientists where it discuss the three main data structures used for geospatial data analysis:

  1. Geographic tables;
  2. Surface;
  3. Spatial graphs.

Geographic Tables

Basically, geographic table is a typical table of rows and columns where one of the columns stores geographic information. This column stores the geometry that describes how that sample (row) is going to be represented in a map along with latitude and longitude.

In cartography, it was set that every thing can be mapped using 3 geometries: point, line, polygon. The image below show how these 3 geometries are used to map the world.

Image from google maps from the Savassi borough in Belo Horizonte city in Brazil.
  • Points: McDonald’s, Royal Golden Hotel Savassi, Supermernosso Funcionários, etc.
  • Lines: Mainly the streets.
  • Polygon: The suface/area that is inside the red line, delimiting the Savassi borough.

Each of these 3 geometries have a Multi variant that indicate a collection of multiple geometries of the same type.

So, when we use geographic table, this additional column will describe if the sample is represented as a point, line or polygon with their respective latitude and longitude.

In the example below, we loaded the New York city borough dataset — from the GeoPandas library — that contains information of borough code, name, lenght, area and the geometry column.

import geopandas as gpd

gdf = gpd.read_file(gpd.datasets.get_path('nybb'))
gdf

As we can see, the geometry column stores the geometry that represent the area of each borough (multipolygon) and their coordinates that delimits the surface/area. If we get the data type of each column we will see that the geometry column is a geometry type. This is a special feature of the GeoPandas, that it can interpret this type of information.

gdf.dtypes
Data types of each column.

And if we use the plot() method we will see the shape of each borough:

Because the fact that a geographic table has a tabular structure, we can leverage the advantages that it brings to data manipulantion. As you might have thought, the geometry column is treated as one more feature. Therefore, we can apply every method of data manipulation that can be done with a tabular structure.

In fact, with a geographic table, we can go further as we can do manipulations considering the spatial element. Some of the most comum manipulation of spatial data are the aggregation of information that is contained in a polygon, counting points contained in a polygon, calculation of area and distance from 2 geometries, density of points. Those are just some examples of what we can do with a geographic table.

Surface

A surface data structure is compared to an image data structure. If you have already worked with image in data science, you already know the structure of band/layer and pixel. A surface data structure is just like that, however instead of a pixel, each row is a specific latitude and each column is a specific longitude and the value of each cell store information about that specific location of a certain measurement.

For example, a surface for air pollution will be represented as an array where each row will be linked to the measured pollutant level across a specific latitude, and each column for a specific longitude. If we want to represent more that one phenomenon (e.g. air pollution and elevation), or the same phenomenon at different points in time, we will need different arrrays that are possibly connected.

To manipulate this type of data, we can use the xarray library. This library deals with N-Dimentional arrays, also known as cube or tensors. Xarray allow us to easily manipulate these type of data making math operations and aggregation across arrays/layers and much more.

Let’s look at an example from the temperature dataset that is available in the xarray tutorial class:

import xarray

temperature = xarray.tutorial.open_dataset('air_temperature')
temperature
The output of the above code showing the data structure of this dataset.

Looking at the output, we can identify some things like the dimension (25, 2920, 53), arrays consisting of latitude, longitude and a time dimension and the variable that is represented in the values is about the air. And we can add attributes to our data to describe and add information about the dataset.

In summary, we can interpret this dataset as having arrays consisting of “lat” as rows, “lon” as columns and “time” as the time reference of the values contained in a array/layer and the values being about the air temperature.

Spatial Graphs

Spatial graphs is a data structure that capture relationships between objects in space. Spatial graphs is derived from the graph theory that study relationships between objects. In fact, a spatial problem was what originated the graph theory and today it is applied in a variety of fields. We can define it, in other words, as geographic networks.

In the graph theory there are basically two main elements:

  1. Vertex or node;
  2. Edges that connects nodes. Edges can store information about direction or not.

Therefore, a spatial graph data structuctures stores information about how a given node is spatially connected/related to other nodes in a dataset. The Networkx library does exactly that. The internal data structure used in networkx is, as defined by them:

The graph internal data structures are based on an adjacency list representation and implemented using Python dictionary datastructures. The graph adjacency structure is implemented as a Python dictionary of dictionaries; the outer dictionary is keyed by nodes to values that are themselves dictionaries keyed by neighboring node to the edge attributes associated with that edge.

An example of use of this type of data, is the study from IBGE (Instituto Brasileiro de Geografia e Estatística) about network of cities, that defines hierarchies among cities that are connected/related according to some parameters, like quantity and quality of services such as hospitals, universities, flow of population and more. And according to this study, many cities and regions make their planning based on it.

Map of Brazil’s city network.

From the map above, each point are nodes and the lines (relationships between cities) are edges that connects nodes.

An example of the Networkx library is shown bellow:

graph_sp = osmnx.graph_from_place('Parque Ibirapuera, Sao Paulo, Brazil')
osmnx.plot_graph(graph_sp)
Graph from Parque Ibirapuera, São Paulo, Brazil.

The first line of code extracts the dataset that represents the graph of Parque Ibirapuera in São Paulo City from the osmnx library. The graph_sp variable is a xarray object. Then, we use the plot method to visualize the graph.

As we can see, the graph from Parque do Ibirapuera has 251 nodes and 729 edges and in this case nodes are intersections of streets and edges are the streets. The Networkx library offers many things like, manipulation, analyses, algorithms and more.

Spatial Weights Matrix

Another structure used for spatial graphs is Spatial Weights Matrix. As the name suggests, it is a matrix where each cell represents the relationship of each spatially referenced object in a dataset. This relationships is often called neighbors. Therefore, each cell indicates whether the object represented by the row is a neighbor of the object represented by the column.

There are two elements in spatial weights matrix that we need to define and adjust according to what we want to accomplish:

  1. Define how to consider a neighbor
  2. How to encode the weight in case of neighbor objects.

There are many ways or rules that we can define to determine if a pair of objects are neighbors or not. For example, we can determine neighbors by contiguity, in other words, if a pair of objects shares a boundary. Or we can use the distance, setting a fixed radius distance and every object within the radius are considered neighbors.

After we decided how to identify neighbors, we have to decide how we want to encode the relationship between observations. One way to encode is with a binary encode [0, 1], 0 for non neighbors and 1 for neighbors.

Another way, is setting 0 for non neighbors and a decimal between 0 and 1 for neighbors. And we can apply many rules to define the weight for each neighbor. For example, one of the most commom methods to assing a weight is by a decay function according to the distance, where close neighbors have large weights and further neighbors have smaller weights.

Let’s take a quick look at pysal library. To do that, first I create a geopandas dataframe with some hexagons:

Then, we can construct our spatial weight matrix from the geodataframe using pysal as shown below:

To construct a spatial weight from a geopandas dataframe is as simple as that. Moreover, we can see how each hexagon is related with each other.

We could use others types os weights constructors, like Rook, voronoi, distance based, and many others that pysal offers. To see other types os weights constructor check pysal documentantion.

We can check that the weights object is of the type weights.contiguity.Queen from pysal. Knowing that, we can check other things like, neighbors matrix and mean_neighbors.

There are many intricacies in spatial weights, and this post is just an introduction of these topics. If want to know more about the libraries shown, check their documentation.

Conclusion

Geospatial data structures are very similar to others data structures, but they have something additional: a spatial component. We saw, briefly, how are those data structures (geographic table, surface and spatial graphs) and what python libraries handle them.

Dealing with geospatial data can bring a bit of complexity, however it greatly expand and enhance our analyses.

About me

I am a geographer currently working as a data scientist. For that reason, I am an advocate of spatial data science.

Follow me for more content. I will write a post every month about data science concepts and techniques, spatial data science and GIS (Geographic Information System).

References

Part I — Building Blocks of the free online book Geographic Data Science with Python by Sergio J. Rey, Dani Arribas-Bel and Levi J. Wolf.

--

--

Alvaro Matsuda

Writing about Geospatial Data Science, AI, ML, Python, GIS