GeoPandas is a Python package that extends the very popular Pandas package with the ability to read, analyze, and visualize geospatial data. Like Pandas, GeoPandas is generally used within a Jupyter notebook which provides a powerful framework for documenting your analysis workflow. Over the past few years I have moved increasingly towards using GeoPandas for any analysis project as I have found it to have many advantages over traditional desktop GIS approaches. (See my blogpost Geospatial Data Science vs GIS for more information)
Some basic knowledge of Python is required to use GeoPandas however you do not need to be an expert programmer to take advantage of GeoPandas. Python provides the syntax necessary to call GeoPandas methods but most GeoPandas code will be very simple and easy to read and understand. As such it is a great way to learn to use Python.
What is Pandas?
In order to better understand what exactly GeoPandas is, lets first talk about Pandas. Pandas allows you to read a variety of data structures into a Pandas dataframe. You can think of a dataframe as an in memory version of a table. Each row contains a record and each column contains a field, or attribute. Pandas can read data into a dataframe from a file (csv, spreadsheet, etc) or from a database table, or even from a web page.
Once the data is read into a Pandas dataframe it is accessible from Python and can be manipulated in a variety of ways using a combination of Pandas methods and other Python packages. You can easily select a subset of rows and/or columns, you can create new columns and populate it with values based on existing data, you can join dataframes based o a common attribute, you can visualize the data using Matplotlib, Seaborn, Plotly, Bokeh or other python packages, you can analyze the data using statsmodels, sci-kit learn, etc. The entire Python data science stack is available to you once you get your data into a Pandas dataframe.
What is GeoPandas?
GeoPandas extends Pandas to work with geospatial data in much the same way as PostGIS extends PostgreSQL to work with geospatial data. The first step is to read your geospatial data into a GeoDataFrame. A GeoDataFrame can be thought of as an in memory version of a feature class and each row in a GeoDataFrame corresponds to a spatial feature (point, line, or polygon). The columns, referred to as Series in Pandas terminology, correspond to attributes. In addition to text, numeric, and date columns, each GeoDataFrame has one or more GeoSeries containing the spatial information for that feature. GeoPandas can read data from shapefiles, geopackages, geojson, and other common file based geospatial storage formats, as well as from spatial database tables such as PostGIS.
Once your data is read into a GeoDataFrame, it can be manipulated in a variety of ways. It is easy to create a subset of a GeoDataFrame based on attributes or spatial coordinates. It is also easy to create a subset that contains just the columns that are needed. You can create new columns based on data in other columns. You can perform most common geospatial operations such as buffering, intersections, clipping, etc. You can also visualize both the tabular data as charts and the spatial data as maps using a variety of visualization tools. GeoPandas also serves as a core technology for geospatial data science and most python data science packages, such as statsmodels, PySAL, and sci-kit learn will read directly from a GeoDataFrame.
You will be happy to know that GeoPandas uses the well known and well understood open source libraries GDAL/OGR, GEOS, and PROJ under the hood for data input/output, spatial analysis, and projection needs. These libraries have stood the test of time over many years and are used by virtually all open source geospatial packages. This means that your results should be consistent with the results you would get with PostGIS or QGIS.
What are Jupyter notebooks?
Jupyter notebooks are a key component of the Python data science workflow. They allow you to integrate rich text documentation with Python code blocks and the output from that Python code. The text can include complex formulas (using Latex equation formatting), pictures, hyperlinks, lists, and other formatting. This allows you to document your analysis from start to finish, as you are conducting it. The Python code blocks can be executed individually or all at once. Notebooks can be shared with other people so they can see your thought process and replicate your analysis with their data. Notebooks can also be converted to HTML and shared that way (as static web pages).
This workflow is quite different from the typical desktop GIS workflow where the user typically users a mouse to select options from a menu. The data is read in, cleaned up, analyzed and resulting data sets produced all with the click of a mouse. After the GIS work is done the user is often tasked with going back and explaining what they did which can be an arduous task, especially when it takes place days or weeks after the actual work was done. It also often involves opening a GIS summary table i a spreadsheet to format as a table and/or produce a chart that can be read into a document, and exporting GIS output (maps) to image files that can be read into a document, etc. This workflow is OK when you have to do something once. But what if the client makes some changes and sends you new input data. Maybe the project footprint changes. With the traditional approach, you have to go through the entire process all over again, often making many intermediate files along the way. This is where Jupyter notebooks shine. With a Jupyter notebook, you can simply change the initial input and run your entire notebook again.
Now you might be wondering how this is different from building a model using desktop GIS. There are several important differences. First the Jupyter notebook allows you to include rich-text documentation to explain in great detail your thought process. Second, Jupyter notebooks allow you to run each code block individually which provides you the opportunity to modify the inputs if needed ad/or view the outputs at each step and make decisions about how to proceed next. This provides more flexibility than a model as it allows you to add some steps or remove steps as needed. Of course you can also run the entire notebook at once as well if you are sure that you won’t need any changes. Third the ability to include custom Python code also provides a lot more flexibility than simply chaining together geoprocessing tools in a rigid model framework. For instance if you want to use the inputs to generate an email and send it to your project manager, that is relatively easy to do. Finally, at times the intermediate steps are important as well and with a Jupyter notebook you can produce a map or a chart or a table as visual output with each step of the process.
Why use GeoPandas?
First lets talk a little about what GeoPandas is not the best option for. GeoPandas is not a replacement for desktop GIS software such as ArcGIS or QGIS. When it comes to creating publication quality maps, or creating and editing data, desktop GIS software is hard to beat, especially for beginner and casual users. But when it comes time for analysis I find myself increasingly switching to GeoPandas in a Jupyter notebook for a number of reasons.
- Documentation – When doing scientific work that will end up in a peer-reviewed publication or any work with legal ramifications that might need to be defended in court I will ALWAYS use GeoPandas. Work that is published in scientific journals should be documented well enough that other scientists can replicate exactly what you did. Jupyter notebooks allow you to show exactly what you did because the code is included. Further explanation can be included in a rich-text cell that can include equations, pictures, tables, hyperlinks, and more. Even if you are somewhat lax about documenting your work as you go, simply having all the code blocks with every parameter that you set right there in the code for everyone to see will be incredibly useful if and when you have to go back and explain what you did.
- Repetitive jobs – Any analytical process that is expected or required to be repeated can be implemented in a Jupyter notebook. As an example when I worked for an environmental consulting company we did literally thousands of biological assessments for oil and gas development project. Every well that was dug needed to be evaluated for distance to streams, wetlands, raptor nests, sensitive plant species, and many other things. Some of this could be automated in a model in desktop GIS but some of it required some human guidance along the way. For instance topographical position (ridge, valley, slope, flat, etc) can be difficult to automate, especially for polygon data. But in a notebook, a map could be produced showing the project against a topographical map and the user could manually enter a topographic position description before proceeding. Another advantage for repetitive jobs is that using a Jupyter Notebook makes it more difficult to accidentally skip a step.
- Flexibility – Although you do not need to be a great Python programmer to use GeoPandas, the ability to use Python to write your own custom functions and methods to handle more complex analysis needs than desktop GIS will easily allow.
- Power – Once your data is in a GeoDataFrame the entire Python data science stack is available to you. You can use statsmodels and sci-kit learn for statistical analysis and machine learning applications. You can use matplotlib, seaborn, plotly, bokeh, follium, and other visualization packages static and interactive charts and maps.
- Cost – Its open source so there are no software licensing fees. You really have nothing to lose in trying it out and seeing if it works for you, other than your time (which I understand is valuable).
The following examples are intended to show some of the basic GIS functions that GeoPandas can do and what the code looks like. I don’t expect you to necessarily understand the code examples, rather just see that they are relatively simple commands that replicate some of the options that you would expect to see in a menu or dialog box in a desktop GIS.
Almost every GeoPandas project, like ay GIS project, will start with loading some geospatial data. In GeoPandas, this is done with a simple line of Python code rather than mouse clicks from a menu. In the following code block the first line first imports the GeoPandas package. The second line imports a shapefile to a GeoDataFrame named raptor. The third line renames some of the columns, and the last line instructs GeoPandas to display the GeoDataFrame.
The following shows a formatted text cell followed by a code block that loads two shapefiles into GeoDatFrames called raptor and county
Once we have read the data we can perform a spatial join to add the name of the county that the raptor nest is in with a single line of code. Note that although this creates a new GeoDataFrame in memory, it does not create a new file on disk, although if desired it is a simple matter to output the new GeoDataFrame as a shapefile, geopackage layer, or some other permanent disk storage.
We can also easily create a summary table showing the number of nests of each species of raptor in each county with a single line of code. Note that the pivot_table method is actually a Pandas function, not a GeoPandas function so we first have to import Pandas but Pandas will work with a GeoDataFrame, it just ignores the geometry column.
The following line of code shows how easy it is to convert from one coordinate system to another. All you need is the EPSG code which is a standard value used in most open source applications.
The following line of code creates a new geometry column called buffer which contains a buffer around the original geometry. The buffer distance in this case is taken from the row_width column but it could also be a single value for all the geometries.
Note that the GeoDataFrame now has two geometry columns. One is the original line data and the other are the polygon buffers. This is not something that is possible with most desktop GIS data but it is no problem for GeoPandas and is actually quite powerful. It does create a slight complication however in that you have to select which geometry column is the active one. This is done using the set_geometry method as seen in the first line of code below. The second line of code selects a spatial subset (using the cx method) and plots the buffers with a color scheme based on the type column.
The following is not specific to GeoPandas but shows some possibilities for including equations in your Jupyter notebook text cells.
The following is also not specific to GeoPandas but shows how it easy it is to create complex visualizations from tabular data in a GeoDataFrame with a single line of code that calls a Seaborn function. This is not something that would NOT be easy to do with desktop GIS software but the ability to create these kind of visualizations from within a Jupyter notebook in combination with spatial analysis makes for a very powerful analytical framework.
GeoPandas, Jupyter notebooks, and the rest of the Python data science stack are not replacements for the cartographic capabilities of desktop GIS software. Nevertheless, it is possible to create very nice professional visualizations using GeoPandas tools in combination with Matplotlib and other Python visualization libraries. The following image shows one example.
Finally an example showing raster data displayed in a Jupyter notebook. Again this is not GeoPandas specific as GeoPandas only deals with vector data, but it can easily be combined with raster data. In this case I created custom precipitation contours from raster data and clipped it to the boundary of Mexico which WAS from a GeoDataFrame using a raster specific Python package called rasterio. This demonstrates again how GeoPandas is the core of the Python Geospatial data science stack and almost all Python geospatial analysis libraries will use it.
Where to go from here?
If this post has piqued your interest in GeoPandas (and the rest of the Python data science stack) and you would like to learn more what is the next step? GeoPandas has extensive documentation on the web and there are other third party tutorials on the web, you tube, etc.
I personally struggled to sort through all of the available resources as I was learning and found myself wishing for a single source to learn the basics. This is often the case with open source software. There are many short tutorials available to get started and there is plenty of detailed documentation but not much in between to really help get started. As a result I created a series of three courses that are available on the Udemy platform that have been rated very highly by those who have taken them. Udemy has frequent sales during which you can purchase these courses for as little as $9.99 each. Each of these courses will also be available through October 24 for $9.99 each using the coupon code OCT2021.
The first course “Geospatial Data Science with Python: GeoPandas” has almost 10 hours of video content over 51 lectures and covers the basics from installing Python, loading, manipulating, and analyzing your data and basic visualizations. It is a pre-requisite for the other two courses.
The second course Geospatial Data Science: Statistics and Machine Learning is focused on statistical analysis and machine learning using the statsmodels and sci-kit learn libraries, which both work directly off of GeoDataFrames. GeoPandas is also used extensively for reading and preparing data for analysis. This course contains more than 12 hours of video content in 56 lectures.
The third course Geospatial Data Science with Python: Data Visualization is focused on visualization your data, both as traditional charts and figures and as maps. It has almost 8 hours of video content in 42 lectures covering Matplotlib, Seaborn, Plotly, Rasterio, Folium and other Python visualization libraries (including GeoPandas own visualization tools), all of which also work directly off of GeoDataFrames. GeoPandas is also used extensively for reading and manipulating data prior to visualization.
Although you do not need to be a great Python programmer to use GeoPandas, if you are a complete newcomer to Python or find yourself struggling with the Python in the the above courses, you might also be interested in my course Survey of python for GIS applications which provides a thorough introduction to Python, specifically focused on geospatial applications and introduces the main Python libraries for working with geospatial data. This course contains over 13 hours of video content over 85 lectures and is also available through October 24 for 9.99 using the coupon code OCT2021.
I also have courses available on many open source geospatial topics including Web GIS, Mobile GIS applications, QGIS, PostGIS, GeoServer, etc. If you are interested you can find out more about these topics on the courses page of this blog.