In a recent post, I attempted to answer the question What can a spatial database do for you? That post was a broad overview of the advantages of storing your geospatial data in a spatial database. This post follows up with more specific information on how to get started and step by step instructions on using one type, SpatiaLite, with the popular open source GIS software QGIS.
I chose to use SpatiaLite for this guide because it is a file-based database, rather than a full-blown client server database and thus installation is much easier. In fact with QGIS, there is no installation required at all as QGIS can create new SpatiaLite databases natively. If you really want to take advantage of the power of spatial databases for storing and managing your GIS data you will probably want to install PostGIS at some point. Or more likely you will want to ask your IT department to install PostGIS on your server and provide you with the connection information. The installation is not complicated but most IT departments will prefer to keep end users away from the server, for good reason.
Fortunately, both SpatiaLite and PostGIS implement the Open Geospatial Consortium’s Simple Features for Spatial specification and so their spatial functions are almost identical and most SQL queries will run on both with very little modification. QGIS also works with both in a similar manner so you can get started with SpatiaLite today and move to PostGIS later with almost no learning curve.
I will assume that you have a basic working knowledge of QGIS. The purpose of this post is to teach spatial databases and not QGIS. I cant’t do it all in one post, but there are plenty of free resources available for you to learn QGIS if you need to and I offer a 10 hour course called QGIS 3.0 for Professionals which is available on Udemy for $20
If you do not already have QGIS installed you can go to QGIS.org and download the latest version. At the time of this writing the most recent version is 2.18 and that is the version I will be using in this article, but QGIS 3.0 is about to be released in about 6 weeks. Fortunately from the pre-release version I’ve seen there will not be substantial changes to the DB Manager and so I expect that you will be able to follow along with either version.
You can download a zip file containing the data that I will be using for this course here. After downloading the file, unzip it anywhere that is convenient for you on your computer. I will refer to that directory as the SpatiaLite home directory.
Creating a SpatiaLite database
- Navigate to the spatialite_data directory in your SpatiaLite home directory and double click on the overview.qgs file to open the Overview QGIS project. You should see something similar to this:
- Right click on the first layer in the layers panel, Linear Projects and then choose Save As… You should see the Save vector layer as… dialog box. Make sure that you set the format to SpatiaLite, then navigate to the spatialite_data directory in your SpatiaLite home directory using the Browse button after the File name text box and create a file named quikstart.sqlite.
- Enter linear_projects in the Layer name text box. The upper part of the dialog box should something like this:
- We’ll leave the CRS in WGS84 and assume everything is fine with the lower part of the Save vector layer as… and click the OK button at the bottom of the dialog box. You should see a new layer called quikstart linear_projects MultiLineString appear in the layer panel. The default layer name is a combination of the database name, the layer name that we provided, and the geometry type.
- Repeat steps 2-4 for the remaining 4 layers in the layer pane with the exception that you just have to point the browser to the quikstart.sqlite file in step 2. If there is no file with the name that you provide, QGIS will create one automatically. Otherwise QGIS will simply create a new layer in the existing file. Use the layer names buowl_habitat, baea_nests, raptor_nests, and gbh_rookeries.
Note: Both SpatiaLite and QGIS follow the OGC Simple Features for SQL standard. The geometry types are Point and MultiPoint, LineString and MultiLineString, and Polygon and MultiPolygon. The multi prefix indicates that the geometry for a single feature is an array of that geometry type. So there are multiple line geometries with a single set of attribute values, etc. In general, a point shapefile will be converted to a point geometry type, while line and polygon shapefiles will be converted to MultiLineString and MultiPolygon geometry types.
That’s all there is to it! You’ve just created a SpatiaLite database with 5 layers in it. We’ll be using this database for the remainder of this course. It was motivated by work I did several years ago as the GIS specialist for an environmental consulting company in Colorado. The data itself is fake, but the motivation for this data is real. It represents a real-world project with real-world data that we can use to answer real-world questions. And what is GIS after all but a tool for answering questions about the world?
We’ll explore this data in depth but the basic premise is that we have a client in the oil and gas industry who is building and maintaining pipelines (and supporting roads, electric lines, etc) in a natural gas field. They need to make sure that they are not violating any environmental regulations so they hire your company to monitor wildlife species that might present regulatory issues for them to help them plan and implement their jobs with a minimal impact on the environment.
Loading data from SpatiaLite
Before showing you how to load data from SpatiaLite in QGIS we will need to remove the layers that are already there. You can do this by right-clicking on each layer in the Layers Panel individually and choosing Remove from the context menu. You can also select all they layers at once and remove them in one operation.
There are four ways to load data from a SpatiaLite database into QGIS.
- In the Browser Panel navigate to the location where you saved the quikstart.sqlite database and double click on it. You should see the Select vector layers to add… dialog box. You can select individual layers or select all of the layers. Press the OK button to load the selected layers into the map.
- In the Browser Panel find the SpatiaLite icon and right click on it, then choose New Connection… and navigate to the quikstart.sqlite file we created in the first section and open it. You should see a small > appear to the left of the SpatiaLite entry in the Browser Panel. which indicates that that entry can be expanded. Click on the > to expand it and then also expand the quikstart.sqlite entry to see the individual layers in the database. You can double click on those layers one-by-one to add them to the map, or you can highlight the layers you want and drag them into the Layer Panel.
- Open the Add SpatiaLite Layer(s) dialog box either by clicking the SpatiaLite icon in the Manage Layers toolbar (usually found along the left edge of the program, or navigating to the Add SpatiaLite Layer option in the Layer menu (Layer -> Add Layer -> Add SpatiaLite Layer….).Click the New button and navigate to the quikstart.sqlite file that we created in the previous section and open it. Click Connect to create a connection to the database and then select the layers that you want to add and click the Add button. You can also set a filter from this dialog box to limit the features to include in the layer.
- In the Database menu select the DB Manager item. Right click on SpatiaLite in the connection tree and select New Connection. Navigate to the quikstart.sqlite file and open it. You can double click the layers you want to add to the project or right click on them and select Add to canvas. Notice that the panel to the right of the Tree panel contains three tabs. The first shows info about the layer including geometry types, fieldnames, and any database triggers for the layer. Triggers are SQL functions that are executed in response to database events such as adding or updating data. The Table tab shows the attribute table and the Preview tab shows the geometry of the features.
Introduction to SQL
The DB Manager panel also has a SQL window ()that allows you to write, execute, and view the results of a SQL query on the SpatiaLite database. It is this ability that gives SpatiaLite its flexibility and power and separates it from a mere data storage format. We’ll spend the remaining part of this getting started guide writing SQL queries against this data.
The simplest SQL command follows the pattern SELECT * FROM tablename. The * character is a placeholder for all the fields in the table.
Type SELECT * FROM baea_nests into the query entry box and the click the Execute (F5) button and you should see a results table in the bottom window like this with all of the fields displayed.
Everything else we do with SELECT queries will build from this.
Now type SELECT nest_id, lastsurvey, recentstat, recentspec FROM raptor_nests into the query box and execute it. Notice that this restricts the fields in the result table to just those that are shown in the field list.
We can also further restrict the results table by using a WHERE clause followed by some conditional statement that determines which features are returned. Lets add a WHERE clause to restrict the list to only active nests by adding the following to the previous SELECT statement: WHERE recentstat=’ACTIVE NEST’ . When executed the results table should look like this:
We can further restrict the result table by joining conditional statements in the WHERE clause with the logical operators AND, OR, and XOR. For instance lets add the text AND recentspec=’Red-tail Hawk’ to our current statement so that the full SQL query is:
SELECT nest_id, lastsurvey, recentstat, recentspec FROM raptor_nests WHERE recentstat=’ACTIVE NEST’ AND recentspec=’Red-tail Hawk’
We can also direct the database to sort the results table with an ORDER BY clause. For instance to get the table sorted by the lastsurvey date add ORDER BY lastsurvey to the current statement. If you wanted to sort them by most recent survey date (descending order) add the keyword DESC to the end of the order by statement. You can also add multiple fields to sort by simply by adding them to the ORDER BY statement separated by commas. For instance:
SELECT nest_id, lastsurvey, recentstat, recentspec FROM raptor_nests WHERE recentstat=’ACTIVE NEST’ AND recentspec=’Red-tail Hawk’ ORDER BY lastsurvey DESC, nest_id
We can get information summarized by a field using the GROUP BY clause. For instance the SQL statement
SELECT recentstat, count(recentstat) FROM raptor_nests GROUP BY recentstat
will return a result table showing a count of the number of nests with each recentstat value. And the SQL statement
SELECT recentspec, recentstat, count(recentstat) FROM raptor_nests GROUP BY recentspec, recentstat
will return the same information but subdivided out by species as well. If the column you are summarizing is numerical you can use other aggregate functions such as sum(), min(), max(), etc.
Just for fun, see if you can recreate the following results table.
Note also that the QGIS DBManager interface allows you to save a query with a name that you can retrieve at a later date. As you might guess, SQL queries can get quite long and complex and this can save you a lot of typing time with frequently used queries.
Adding Spatial to SQL
Hopefully, you are beginning to see the power of SQL for extracting the information you need from your tabular data, and this is very definitely just the tip of the iceberg. There are entire college level courses on SQL and it can get quite involved especially when multiple joins and grouping clauses are included, but if you are good with SQL you can almost always get exactly the information that you need.
Extending a normal database for geospatial applications involves 2 steps. First of all, you need a way to store the geometry part of a geospatial feature, and second, you need some functions that can process geometries. Most spatial databases store the geometry in binary form in a geometry column in the database and SpatiaLite is no exception. The Open Geospatial Consortiums Simple Features for SQL specification provides a set of standard functions for dealing with geospatial data and SpatiaLite, as well as PostGIS adhere pretty closely to this standard.
As an example, lets use the function AsGeoJSON() to return the geometry in GeoJSON format which is a text based format that is human readable so you can see the actual coordinates that form the polygon.
SELECT species, activity, AsGeoJSON(GEOMETRY) FROM gbh_rookeries
The AsGeoJSON function can take a second optional parameter that control the precision of the coordinate values. For example you could use AsGeoJSON(GEOMETRY, 5) to limit the coordinates to 5 decimal places.
If you want to see the coordinates in some other spatial reference system you can use the ST_Transform() function to convert it to a different spatial reference system. For example in the SQL statement
SELECT species, activity, AsGeoJSON(ST_Transform(GEOMETRY, 26913), 1) FROM gbh_rookeries
The Geometry is transformed to UTM Zone 13N NAD 83, which has an EPSG code of 26913, prior to being converted to GeoJSON so we can see the coordinates in the CRS. You can see the full list of EPSG codes at spatialreference.org/ref/epsg/. The ST_Transform function (in bold) returns a geometry object which can then be passed as a parameter to any other function that requires a geometry object. This ability to nest functions is incredibly powerful. For instance, many functions that take distance values as inputs will expect those values to be in the distance units of the spatial reference system of the geometry. But if you can transform the geometry into a different spatial reference system you can use a different distance unit.
As an example, the ST_Buffer function takes a geometry and a distance and returns a buffered geometry of the specified distance. If we wanted to create 1/2 mile buffers for the eagle nest data we would run into a problem because the eagle nest data is stored as points in geographic coordinates (latitude/longitude) and while you could convert degrees of latitude to miles easily, the length of a degree of longitude is only equal to a degree of latitude at the equator. The further north you go the smaller a degree of longitude is until it approaches 0 at the north pole.
Fortunately we can use the following SQL statement
SELECT nest_id, status, ST_Buffer(ST_Transform(GEOMETRY, 26913), 804) FROM baea_nests
to create buffers in UTM Zone 13N NAD 83 that are 804m (1/2 mile) in diameter by transforming the geometry first before creating the buffer. This query will result in a table that looks like this.
“Big deal”, you may be thinking. “The geometry is binary and doesn’t even appear in the table. What good does this do for me?”. Well, I’m going to tell you. QGIS can display the results from any query that contains a geometry field as a layer in the map. Simply check the Load as new layer box and tell QGIS which field contains the geometry by selecting the field in the Geometry column box, give the layer a name, and click the Load Now! button.
Once the layer is loaded into QGIS it can be used like any other layer including as inputs to geoprocessing functions. It can even be saved as a real file in virtually any file format or spatial reference you want by right clicking the layer in the Layer Panel and choosing Save as… from the context menu. This “virtual layer” has several important advantages. First it takes up no additional space on disk, it is solely based on a SQL query of the point data. Second, because it is based on a SQL query the buffer layer will be updated automatically when the underlying point layer is changed. If features are added or deleted or data is changed, it only has to be done in the point layer and those changes will be reflected in the buffer layer imply by clicking the refresh button.
Take some time and add virtual layers for raptor nest buffers (1/4 mile), burrowing owl habitat buffers (300 meters) and for linear project buffers (dependent on the ROW width. HINT: you can replace the numeric buffer distance with the field name of the field that contains the ROW width).
SpatiaLite also includes useful functions for measuring lengths and areas and these functions can be used in a SQL query to create a virtual field. For instance the following query includes the length of each linear project (in km) without taking up any space on disk and the query will be correct even if the underlying geometry is edited.
SELECT project, type, row_width, ST_Length(ST_Transform(GEOMETRY, 26913)) / 1000 FROM linear_projects
You can also use these virtual fields in summary functions in statements with a GROUP BY clause. For example the following query sums up the total length of linear projects in km for each type.
SELECT type, sum(ST_Length(ST_Transform(GEOMETRY, 26913)) / 1000) FROM linear_projects GROUP BY type
You can see the whole group of spatial functions that you can use in SpatiaLite here. I would suggest taking a look through them and then seeing if you can come up with a query to summarize the total number of hectares of burrowing owl habitat summarized by the recentstat field.
There is a lot you can do with SQL in single table queries, we’ve seen virtual feature classes, calculating measurements, and summarizing values. But the real power of SQL is in its ability to conduct queries with two or more tables. Multiple table joins requires that each field be prefixed by the table name (or an alias for simplicity’s sake). There is also a join clause that specifies the conditions for which features from two tables are joined.
In non-spatial tables, the join clause is typically based on fields with common values. For example invoices can be joined to a customer because the invoice records each contain the customers ID. With spatial tables, joins are often based on spatial relationships such as “The nest is within 1/2 mile of the construction project” or “The pipeline intersects the wetland”.
Because we have two geometries we can also do things like calculate distance between features, the amount of overlap between features, etc which provides a huge amount of flexibility in performing spatial analysis just with a SQL query.
The basic structure of a spatial join follows the query below
SELECT r.nest_id, r.recentspec, r.recentstat, r.lastsurvey, p.project, p.type, ST_Distance(r.GEOMETRY, p.GEOMETRY) as dist FROM raptor_nests r JOIN linear_projects p ON PtDistWithin(r.geometry, p.GEOMETRY, 0.005)
The FROM clause is followed by the keyword JOIN followed by the table being joined. This is followed by the keyword ON and the join clause. The join clause can be any expression that evaluates to true or false and only combinations of feature that evaluate to true will be included in the results set. In this case we are using the function PtDistWithin(r.geometry, p.geometry, 0.005) this function returns true if the distance between the raptor nest geometry and the linear projects geometry is less than 0.005 degrees.
Note that the raptor_nests table in the FROM clause and the linear_projects table in the JOIN clause are followed by r and p respectively. These are alias’s and are used to prefix the fieldnames to explicitly state which table the field is from. You could use the entire tablename as the prefix but this would quickly create very long SQL statements.
There are several types of JOIN’s. The default is an INNER JOIN in which the results set will only include cases where the join clause is true. LEFT JOIN will return at least one result for every record in the left side of the Join, in this case the raptor_nest table, but if there is no match in the linear_projects table those fields in the result will have NULL values. RIGHT JOIN is similar but it will include one record for every feature in linear_projects and if there is no corresponding raptor_nests feature those fields will be NULL in the result. FULL JOIN will include at least one result from every feature in both tables but if there is no match the corresponding fields will be NULL.
Real life examples
As you might guess, there are an almost infinite number of queries that could be created in order to ask specific questions about this data. It’s not possible to go over every possibility so what I am going to do, to stimulate your thinking about the kinds of questions that you can answer with spatial SQL queries, is include a few realistic types of questions that your client or project manager might ask the GIS department to answer and show the SQL query that can be used to provide that information. Try to think about how you might formulate a SQL query to answer the question before looking at my solution. With most GIS processes there are often multiple ways to get to the same answer and SQL is no different. There may be a several ways to write a query to get the same answer, so don’t be too worried if you come up with a different query than I did.
- Your project manager wants a list of active Swainsons Hawk nests that impact the clients projects so they can send field staff out to verify their current status. He wants them sorted by lastsurvey date so that they can prioritize the nests which have gone the longest without a survey.SELECT r.nest_id, r.recentspec, r.recentstat, r.lastsurvey, p.project, p.type, ST_Distance(r.GEOMETRY, p.GEOMETRY) FROM raptor_nests r JOIN linear_projects p ON PtDistWithin(r.GEOMETRY, p.GEOMETRY, 0.005) WHERE r.recentstat = ‘ACTIVE NEST’ AND r.recentspec = ‘Swainsons Hawk’ ORDER BY lastsurvey, nest_id, project
- At the beginning of the nesting season your client wants a list of all projects with a count of how many nests might potentially affect it. Furthermore he wants the count parsed out by the previous years status under the assumption that nest that were active last year are more likely to be active this year.SELECT p.project, p.type, r.recentstat, count(r.recentstat) FROM raptor_nests r JOIN linear_projects p ON PtDistWithin(r.GEOMETRY, p.GEOMETRY, 0.005) GROUP BY p.project, r.recentstat
- Field staff report that a new active eagle nest is located at -104.83504, 40.19849 and the project manager wants a list of all the projects that will be affected, ordered by those closest to the nest first.SELECT project, type, ST_Distance(GEOMETRY, ST_PointFromText(‘POINT(-104.83584 40.19849)’)) as dist FROM linear_projects WHERE dist<0.01 ORDER BY dist
Note that once we give the value returned by the distance function an alias (dist) we can use that value in the WHERE and ORDER BY clauses.
- Your client wants to know what percentage of each project is affected by burrowing owl habitat and he wants it sorted by project type first. This one is a bit tricky, so you get a gold star if you figure it out.SELECT p.type, p.project, Round(sum(ST_Length(ST_Intersection(b.GEOMETRY, p.GEOMETRY)))/ST_Length(p.GEOMETRY)*100, 1) as pcnt FROM buowl_habitat b JOIN linear_projects p ON ST_Intersects(b.GEOMETRY, p.GEOMETRY) GROUP BY project ORDER BY type, project
If you have suggestions for other queries on this dataset, please feel free to add them to the comments of this blog and I will try to come up with a solution.
Where to go next
Hopefully by now, you are seeing some of the advantages of being able to use SQL queries with your GIS data. Take a few minutes to go back to the examples above and think about how many steps would be required to get those answers in a traditional menu-driven desktop GIS system and how many intermediate files would be created and need to be managed. Also consider how much easier it would be to support remote users if you could just email them a SQL query that they could copy and paste into the SQL editor on their database, rather than attempting to talk them step-by-step through the entire process over the phone.
If you want to learn more about SQL in general there are many resources available. Many of them are available free on the internet. There are also books you can purchase and courses on Udemy.com and other platforms. Most will not mention any of the spatial aspects of SQL but will provide a good grounding on using SQL for tabular queries and you have to learn to walk before you can fly.
I have a course called Spatial databases with PostGIS and QGIS that is available NOW on the Udemy platform. This course will be approximately 10 hours long and cover the basics of installing PostGIS, SQL for non-spatial and spatial applications, designing databases and entering data through SQL, integration with desktop GIS software, accessing your data from multiple clients, and basic database maintenance and administration tasks. More information available here.