The term “Data Science” has become one of the hottest job descriptions over the past few years, but what exactly does it mean? Can GIS professionals call themselves “Data Scientists”? There has been some reporting that jobs calling for “Data Scientists” and especially “Geospatial Data Scientists” pay as much as 25% more than jobs calling for “GIS analysts”. Is there really a difference or is it just the latest buzzword?
I would argue that while there IS a lot of overlap between GIS and Geospatial Data Science they are not the same thing and there are important differences. Some, including Nate Silver, have argued that “Data Science” is really just another name for statistics and again, while there is much overlap between statistics and data science I would argue that they are not the same thing. So what exactly is Data Science then? I would suggest that in the simplest terms possible, the role of a data scientist is to extract usable information from raw data and communicate that information to stakeholders. In order to do this they need to be well versed in a variety of tools including
- machine learning and AI
- database technology
- IT infrastructure
- reporting and visualization methods.
All of these topics are quite dense in and of themselves and it is probably impossible for any single person to be an expert in all of them. Most data scientists will probably specialize in one or two of these areas but I would argue that in order to call yourself a data scientist you need to at least be well grounded in the basic concepts of each. For example, you may have a PhD in statistics and understand all the math behind the methods and be on the cutting edge of developing new statistical methods but if you don’t have any background in database or IT infrastructure to deal with the large amounts of data common in the modern world you are not a data scientist. Or you may be well versed in setting up clusters for parallel processing and do Hadoop MapReduce operations in your sleep but if you don’t know enough about statistics to engage in a discussion about logistic regression with a statistician then you are not a data scientist.
Tools of the Data Scientist
Yes, you do need a solid grounding in statistics to be a data scientist. The goal after all is to reduce raw data into usable knowledge. You do not need to be a statistician but you do need to be knowledgeable enough about statistical methods to at least have an idea about which analytical techniques are appropriate and how to fit a statistical model to your data, test assumptions, and interpret the results. If the questions that you are asking are important enough you can always consult with a statistician to make sure that you are getting the technical details correct, but you have to be able to discuss those details competently and you have to know how to implement their recommendations on whatever platform you are using. They may be able to guide you in the right direction in terms of statistical analysis but they probably will not be able to advise you on how to implement it in a dataset with millions of records.
Machine Learning / AI
Many of the concepts behind statistics and machine learning are similar, however the purpose is different. Statistics attempts to make inference to a population based on a sample from that population. Machine learning attempts to make predictions about unknown values based on a training set of known values. Some statistical methods such as linear regression and logistic regression can also be used as machine learning algorithms, however it is the prediction that is of primary interest rather than the structure of the underlying data and what that implies about the population of interest.
In scientific research, careful thought can be put into which statistical model is appropriate BEFORE data is collected and that data can then be collected in such a way as to ensure that the assumptions of the statistical method are met. Confounding factors can be controlled for and alternative explanatory models (each representing a hypothesis) can be formally tested with a high degree of precision. Often there is very limited data available as that data needs to be collected specifically for the intended purpose.
In today’s world however, there are massive amounts of data being collected on a daily basis that are often publicly available. This includes GPS locations from cell phones, browsing histories, credit card spending, social media posts, satellite imagery etc. This data is often useful for purposes other than what it was originally collected for and finding patterns in this data is the realm of machine learning. What this data lacks in specificity it often makes up for in quantity, allowing the detection of relatively small but real effects.
Artificial intelligence or AI can be thought of conceptually as an extension of machine learning that allows detection of patterns that don’t fit a specific model. While machine learning algorithms are typically similar to statistical models, AI generally involves neural networks that can fit a wide variety of patterns. The downsides are that they typically require estimation of many more parameters and are very computer intensive. This also makes them a bit more of a “black box” approach requiring little understanding of the underlying patterns.
Geospatial data scientists can use machine learning and AI to answer questions like “How many cars will pass this location between 9 AM and 5 PM?”, “Which cities are likely to buy the most of my products?”, and “Where are all the stock ponds in this county?”.
Due to the large amounts of information being collected today, traditional file based storage are often insufficient. Enterprise level databases such as Oracle, SQL Server and PostgreSQL, provide many advantages over file based storage. These include multi-user editing, performance, virtually no limits on size, ability to customize to fit your needs, security, robustness, and accessibility from other platforms.
While it is possible to work with data in an enterprise level database using standard point and click tools in desktop GIS software without much more knowledge than is required for flat file data storage, to really take advantage of all the benefits requires some knowledge of SQL (The programming language used to interact with most databases), network infrastructure, remote servers, etc.
A data scientist will need to understand, at a minimum, how to connect to a remote database, access data stored in tables using SQL, and output that data in a form that can be used by the software that will be used to analyze that data. More than likely, they will also need some understanding of setting up user accounts to control individual access to the data, setup an instance of a database on a server, and communicate with their IT department about things like backup, replication, transaction control, etc.
A geospatial data scientist will also need to understand the geospatial extensions available on most database platforms. These allow the storage and analysis of vector and raster data right in the database without requiring specialized GIS software. Most GIS specialists are amazed at what can be done quickly and efficiently with Spatial SQL once they commit to learning it.
Although there is a learning curve associated with moving to enterprise level databases, the advantages are substantial and the technology is robust and stable. SQL has changed very little for 40 years and is not likely to change much over the foreseeable future so what you learn now and procedures that you implement are not subject to arbitrary marketing department driven decisions of commercial GIS software to suddenly change the underlying platform.
Given the massive amounts of data available for public use, it may be necessary to spread that data across multiple computers that are linked together as a cluster. Clusters are also used to increase performance for computationally intensive procedures. By linking multiple computers together you have more disk space available for data storage and more processing power available for computations.
Taking advantage of multiple computers for processing power does not happen automatically however. Most desktop GIS software will take limited advantage of multiple processors. Separate geoprocessing tasks may be sent to individual processors so they don’t interfere with each other, or with the main program’s user interface and this is an advantage. But modern CPU’s usually have multiple cores and taking advantage of that processing power for a single operation is not straight forward. Software has to be written in such a way that it can break a process up into self-contained processes that can be sent to separate processors and then aggregated back into a single result. This process is known as parallelization and it can provide extreme performance advantages in processor intensive operations even on a single computer with a multi-core CPU. When spread over multiple computers the increase in performance is almost unlimited. Operations that take hours on a single core may take seconds when the code is optimized for multiple cores.
Writing parallel processing code is advanced technology and definitely falls in the realm of computer science rather than data science. It is unlikely that data scientists will ever be faced with that task but it is very likely that data scientists, especially geospatial data scientists, will at some point find themselves tasked with a problem involving processor intensive operations on large data sets and will need to consider using software optimized for parallel operations to improve performance. Nothing is more frustrating than spending large amounts of money on expensive new computers with multi-core processors and finding out that it doesn’t run that massive intersection operation any faster because the software you are using is still running the entire process on a single core and the other cores are doing nothing. It is often far more cost effective to use software that will take advantage of multiple cores and even GPUs (Graphics Processing Unit) to use all the cores available on a single CPU or in a cluster of computers linked in a network. Therefore a data scientist needs to at least know what software takes advantage of parallel processing and can be implemented across a cluster. They will also need to be familiar with the cloud-based tools available to both store data and perform compute operations and when to implement a cloud-based solution rather than investing in local hardware.
R is an open source language that began as a language for statistical analysis. Data scientists with a strong background in statistics may be very familiar with R, especially if their formal education was in the last 15 years or so as most college statistics courses use R now. R runs on almost all platforms and has a rich ecosystem of third party packages for almost any purpose including working with databases, working with vector and raster geospatial data, machine learning, visualization, and parallel processing.
Python is also an open source language and although it began as a more general purpose language it also has a wide variety of third-party packages available that provide the capability for data analysis, visualization, and just about anything else you might want to automate on a computer including web pages, email, reading and writing just about any file type you can think of, etc. Most major desktop GIS software have Python API’s available for automating and customizing GIS operations and so geospatial data scientists may be familiar with Python and prefer it to R.
In the end, the choice of whether to use Python or R comes down to personal preference and what you are familiar with. Both will provide the data scientist with the tools that they need for working with large amounts of data. If you are a complete programming novice that is interested in expanding their skill set in the geospatial data science realm, I would recommend Python as your first choice for learning. It is an easy language to learn and is already incorporated within most desktop GIS software, and is very flexible.
Reporting and Visualization
Almost all data analysis workflows will end with a requirement to communicate the results to the end user. This will generally entail a report detailing the steps that you took to perform the analysis and including results, often in the form of tables and charts. Traditionally, word processing software has been used to generate a report as a stand-alone product that incorporates static tables and charts that are often produced in spreadsheet software and static maps produced in desktop GIS software. This workflow is not the most efficient however. In an ideal world, the report should be written in such a way that someone with access to the report and the original data that you used could replicate your work and get the exact same result. For this to happen, care must be taken to explain step by step, exactly what you did in your analysis, and that can be difficult and time consuming to do.
A tool that is commonly used in the data science world for reporting and visualization is the Jupyter Notebook. Jupyter notebooks allow you to integrate text documentation with code blocks and the output from those operations. They can be used with both Python and R as well as a few other programming languages. A key advantage is that they are dynamic and thus can be modified as needed. Each codeblock can be modified and executed by whoever has access to the notebook and they are easy to share with others who may want to replicate your analysis or use the methods that you used with their own data. The text documentation is used to explain each step of the process and then the user can see the exact code that was used to implement it and the results. This combination of documentation, code, and results in a dynamic environment is incredibly efficient and powerful, especially if there is any possibility that you will need to defend your analysis in a scientific publication or in a courtroom, or if there is a chance that the inputs will change and you will have to redo the analysis.
As an example consider a project that I worked on many years ago. The company I worked for was tasked with permitting a solar project that would cover almost 8 square miles. A different consulting firm had originally been given the project but the project managers were unhappy with their performance and they asked my boss if he could get the answers they needed faster. So my boss asked me what it would take and I was able to provide those answers in a day or two rather than the moths they had been waiting. This made everyone happy and we got more and more work and eventually took over the entire project. The project was on public land and there were a lot of changes as environmental issues and other things came up and every time I had to re-do some complicated analysis with desktop GIS software and each time it required doing virtually the same steps with a different set of data and it was very time consuming. Eventually there were some legal challenges to the project and suddenly, a year later, my boss had to testify in court and asked me to provide a detailed description of everything I had done. This was many months of work and it began with “just get it done as fast as possible” and so going back a year later and documenting everything I had done was a daunting, time-consuming, and frustrating task.
Now consider if I had used a more typical data science workflow with a Jupyter notebook. First of all I wouldn’t have had to explain all the buttons I pushed to get my original results because the code would have been right there in the notebook. Second I wouldn’t have had to push all of those buttons again everytime there was a change to the original data (This happened dozens of times). I could have just changed the original input and rerun the entire notebook with one easy step. Finally when asked to document everything I did, I would have had it all there in the notebook. It would have saved me hundreds of hours of monotonous work and untold amount of frustration. In addition everytime I redid the analysis with desktop GIS I produced dozens of intermediate files and it became a file management nightmare.
At this point you may be thinking, well you could have built a model instead of running the same analysis over and over again. This is true, and if I had known at the time I started that I would have to redo it that many times I could have done that at the beginning and it would have saved me some time. But there are several reasons why this is not an ideal solution. First of all the report is still completely separate from the model. You would still need to export the model outputs to other software to create tables, figures, maps, etc and incorporate them into the report. Second, models tend to be fairly rigid and not always easy to modify if you need to make a slight change. Third, models tend to be proprietary and if you reach the point where your data exceeds the limitation of the software you are using you are stuck. Jupyter notebooks have none of these limitations.
Another way of communicating the results of your analysis is through a web-based interface. During the recent COVID epidemic the idea of a web-based dashboard became part of the popular lexicon. These were simply a web-page that incorporated data from a variety of sources to display real-time current data about the pandemic in tabular, graphical and mapping formats. As a data scientist you probably won’t be required to create one of these dashboards yourself but you might have to work with a web developer to explain what information you want to display, how to access that information, and contribute content explaining the data and how to use the site for end-users.
Advantages of the Geospatial Data Science approach
Hopefully if you have read this far, you have seen some advantages in the approach over traditional GIS software. The primary advantages in my opinion are the lack of limitations, the use of open-source software, and the ability to incorporate both documentation, code, and results in a Jupyter notebook.
Nothing is more frustrating for management than to have their GIS leaders tell them that the GIS software that they spent thousands of dollars on is limited and they will need to spend tens of thousands more to upgrade to new software and train personnel on its use in order to handle the current projects requirements.
Components of the Python Geospatial Data Science stack
Something that can be confusing when moving from the commercial GIS world to using open-source software is that there are often many ways to do the same thing and understanding the differences and choosing which is the best for your purposes can be daunting. Below I discuss one set of components that perform most basic geospatial data science tasks well. They have large user communities, are well documented, integrate well with each other, and will serve most beginners well. This is a starting point but it will provide the basic knowledge necessary to move forward to more advanced tools if and when they become necessary.
- Database – PostgreSQL and PostGIS. PostGIS is a geospatial extension to PostgreSQL that allows storing and analysis of geospatial data. They are easy to install on all platforms (Windows, MacOS, and Linux) and thus you can get started learning on your own local computer and if necessary, migrate your data to a server on your corporate network or a hosted server accessible over the internet with ease. Storing your data in an enterprise level database will allow multi-user editing and allow you to access your data remotely from desktop GIS, web mapping/dashboard applications, and mobile data collection applications. You can learn more about databases in general and PostgreSQL and PostGIS here.
- Python – There are multiple distributions of Python available. Its likely that you already have one installed on your computer, especially if you have desktop GIS software installed. Although it is possible to use an existing distribution, I would recommend installing the Anaconda distribution to use for data science applications. Anaconda comes with all the packages required for data science such as Jupyter Notebooks, Pandas, Matplotlib, Scikit-learn and more. Anaconda’s package management system tends to work better with geospatial specific libraries such as GDAL, GEOS, and PROJ that are the backbone of geospatial analysis and it will make life easier for most users. You can learn more about Python for geospatial applications here.
- Pandas and Geopandas – Pandas is a core library for working with data in python. It allows you to read data into an in-memory dataframe from a variety of sources (including a remote database). Once the data is in Pandas there is a wide variety of tools available to manipulate and visualize it, and most other packages in the Python data science stack will work with data in a pandas data frame. Geospandas is a geospatial extension to Pandas that allows you to read geospatial data into a geodataframe that includes vector data and also includes geospatial specific tools to manipulate and visualize geospatial data. You can learn more about Pandas and Geopandas here.
- Statsmodels and Scikit-learn – Statsmodels allows one to perform traditional statistical analysis in Python using data stored in a Pandas dataframe (or Geopandas geodataframe). Scikit-learn includes a large number of machine learning algorithms and tools to create training datasets, explore parameter space, and evaluate different models. Like Statsmodels, Scikit-learn integrates with Pandas and Geopandas. You can learn more about statistical analysis and machine learning with geospatial data here.
- Matplotlib, Seaborn, Rasterio, Plotly – These are all python packages that allow you to create static and dynamic visualizations from both tabular data (charts) and geospatial data (maps). Although not a replacement for the cartographic tools available in desktop GIS or the charting tools available in spreadsheet and statistical software, it is possible to create very nice output that is integrated into the dynamic environment of a Jupyter notebook, along with your documentation, code, and data. This provides a very powerful environment for data analysis and reporting. You can learn more about these tools here.
I hope that this stimulates your interest in Geospatial Data Science. This approach has become my first choice for analysis in recent years and quite frankly I can’t ever imagine going back to traditional desktop GIS point and click analysis. If you are interested you can learn more about this approach from my Udemy courses accessible from the links above or the courses page of this blog. All of my courses will be available for $9.99 until October 24 using the coupon code OCT2021.