A comprehensive guide to learning how to improve GeoPandas' spatial analysis performance, including code and benchmarks
Introduction
Geospatial processing is notorious for its resource-intensive demands, particularly when dealing with raster data. However, vector-based processing can also strain system resources, especially with large datasets and multiple operations.
GeoPandas, the primary Python library for handling vectorized datasets, builds upon several other key geospatial libraries in the Python ecosystem, including Shapely, Fiona, and Pyproj. While GeoPandas generally delivers good performance and efficient geospatial data handling, vectorial data—though less demanding than raster datasets—can still become a bottleneck in projects, particularly when dealing with sizable datasets requiring several iterations.
In my recent project for the Brazilian National Water Agency, I dealt with the South American hydrography network. This vectorial database comprises nearly a million river reaches, necessitating thousands of spatial queries and correlations with other layers. It serves as a prime example of when applying advanced techniques becomes crucial for optimizing GeoPandas' performance.
In this article, we’ll delve into six practical tips (complete with code examples and benchmarks) that prove invaluable when working with large vectorial datasets using Pandas.
For our illustrative examples, we’ll utilize Brazil’s 250k hydrography network, as depicted in Figure 1, which is freely accessible via the National Water and Sanitation Agency’s metadata portal (available here).
Tip 1: Choose the Appropriate Engine for Data Reading
In GeoPandas, the selected underlying engine can greatly impact data reading efficiency. Starting from version 0.11 (released in July 2022), GeoPandas introduced a new engine called pyogrio for reading vectorial datasets. However, it’s essential to note that the default engine remains Fiona.
Our test GeoPackage has almost 1 GB of data and almost 1 million features. When using the default configuration, reading this dataset into memory can take up to 5 minutes—a significant bottleneck.
To improve this performance, we can switch to the pyogrio engine by passing engine=' pyogrio' in the read_file command. If we select pyogrio, we can further enhance data transfer from GDAL to Python by setting use_arrow=True. This leverages Arrow for more efficient data handling.
Let’s look at the code and the benchmarking results for these scenarios.
The chart clearly demonstrates a remarkable 10x performance improvement (reducing the read time from 290 seconds to 26 seconds) when using pyogrio and enabling Arrow. This optimization can significantly impact your workflow depending on your specific use case.
Tip 2: Tabular Filtering
When working with large datasets, reading the entire dataset can be time-consuming. GeoPandas offers an efficient solution: tabular filtering. Instead of loading the entire dataset, you can filter a subset of the data based on specific conditions or criteria before reading it.
Here’s how it works:
Row-Based Filtering: You can define filters to select specific rows based on a slice. For that, we need to know exactly the position of the features we want to load.
SQL Query Filtering: Starting from version 0.12, GeoPandas also supports SQL query-based filtering. With this approach, only the data that meets the query conditions is loaded. It’s a powerful way to optimize performance when dealing with large datasets.
For instance, you might want to load only a specific river from your dataset. In our example, let’s focus on the São Francisco River. Considering we don't know the position of the São Francisco river reaches in the database, but we do know the river code that identifies all these reaches (identified by code 676), we can achieve this using SQL query filtering.
Let's take a look at the code and times. In this example, the tabular filtering represented an additional 25% increase in performance, from 26 to 19 seconds, which can be important if we know just a subset of the data will be used.
Tip 3: Spatial Filtering
When it comes to optimizing GeoPandas performance, spatial filtering outshines tabular filtering. Here, we explore two spatial filtering options:
Polygon-Based Filtering:
With polygon-based filtering, you define a mask—a polygon or a set of polygons expressed as Shapely geometry, GeoDataFrame, or GeoSeries objects.
This approach is handy for precise spatial queries, but there’s a caveat: It works exclusively with the Fiona engine.
Bounding Box Filtering:
To leverage the pyogrio engine, consider bounding box filtering.
Here’s how it works: Instead of precisely defining a polygon, we load a larger area (a bounding box) that encompasses the desired region.
While this approach reads more data than is strictly necessary, the resulting performance improvement is well worth it.
Let’s measure the impact of spatial filtering in practice. Suppose we want to load river reaches intersecting the Brazilian state of Sergipe. We’ll compare the two aforementioned scenarios.
Note that the addition of spatial filtering, even with Fiona as engine, significantly improves the performance for opening the dataset that intersects the desired state. However, even after reading more data, the option of pyogrio with a spatial bounding box is the go-for option for maximum performance.
Tip 4: Coordinate Indexing (cx)
While reading the dataset is essential, efficient spatial operations are equally crucial. Imagine you already have your geospatial datasets in memory and need to perform tasks like spatial joins. Even though these operations are inherently costly, small performance gains can significantly impact overall efficiency, especially when dealing with a large number of operations.
Let’s consider an example where we want to select river reaches intersecting the Brazilian state of Sergipe. The difference from the previous example, is that we have the entire dataset already loaded in memory here.
We will compare two approaches. In the first scenario, we perform a simple spatial join between the river reaches dataset and the desired state. In the second scenario, we filter the dataset by the state's bounding box before performing the spatial operation using the coordinate index CX operator.
Notice that simple filtering just before the main spatial operation can cut down times by almost 60%. If many operations like these are to be performed, that can make a huge difference.
Tip 5: Parallel Processing
In scenarios where multiple operations need to be performed, the previous optimization techniques might not suffice. Consider a situation where you want to select river reaches within each state for all 27 Brazilian states. Even if each operation takes less than 1 second (perhaps more for larger states), the cumulative time can become significant. In such cases, parallelizing the processing becomes essential.
To demonstrate parallel processing, let’s create a function that selects the river reaches intersecting each state. We’ll compare two scenarios:
Synchronous Processing:
In the first scenario, we use a straightforward for-loop for synchronous processing.
While this approach works, it might not fully exploit the available computing resources.
Parallelized Processing (Using concurrent.futures):
We create a PoolExecutor to manage parallel processes.
By distributing the workload across multiple cores, we achieve better performance.
Let's take a look at the code and processing times.
Note that, although not impressive, we have an additional gain in performance of around 35%. Remember that, in this scenario, the degree of improvement depends on hardware specifications, the number of computing cores, and the specific workload. Parallel processing can significantly boost efficiency, especially for large-scale geospatial tasks!
Tip 6: Geometry Simplification
When other optimization techniques fall short, consider creating a simpler representation of your geospatial dataset. The simplify() method in GeoPandas is a powerful tool for achieving this. It primarily simplifies geometric shapes—such as polygons and line strings—by reducing the number of vertices while preserving the overall shape as closely as possible.
Simplification can speed up spatial operations such as intersection, buffering, and spatial joins, making them more efficient for large datasets as well as minimizing storage and bandwidth requirements.
It's important to highlight the method is topology-preserving, meaning it maintains the basic shape and connectivity of the original geometry while reducing the number of vertices.
In this next example, let's compare a spatial clipping operation that cuts the river reaches to the bounds of the Santa Catarina state using the original dataset and a simplified version. In addition, we will compare the disk storage space of the simplified version with that of the original version. Let's get into the code:
In addition to the 30% increase in clipping performance, the simplified version occupies significantly less disk space—more than four times smaller—while still preserving essential information.
When simplifying geometries, it's important to fine-tune the simplification tolerance accordingly based on your specific use case. Whether you’re looking for faster spatial operations or efficient storage, geometry simplification is a valuable technique!
Conclusion
In the dynamic world of geospatial analysis, efficiency matters. Whether you’re working with hydrography networks, land cover data, or administrative boundaries, optimizing performance ensures smoother workflows and faster insights.
Remember that each tip (reading engine, tabular and spatial filtering, coordinate indexing, parallel processing and simplification) plays a vital role in enhancing GeoPandas performance and serves distinct purposes. Whether you’re a seasoned geospatial analyst or just starting your journey, these strategies empower you to work smarter, faster, and more effectively.
Comments