This is a loaded question as there is probably not one single answer for everyone. There are important considerations, however, that do apply to all. I have earned a living for the past 14 years as a GIS/Database specialist for a number of small but growing environmental consulting firms. This post is specific to my experience but I suspect that at least some aspects of it are common to small consulting firms in other fields and possibly small non-profits as well. I will focus specifically on optimizing the GIS data storage of large datasets in a single-user environment with this post, but I will follow shortly with a second post dealing with moving from single user desktop GIS to multi-user enterprise level systems.
All of these firms were started by someone with an environmental science background. They hired environmental scientists to manage projects. Usually one of the employees became the default “GIS Person” by virtue of the fact that they were more experienced and/or more interested in GIS than the rest of the firm. Maybe they had recently had a class or even a number of classes, but that alone is rarely sufficient. Entry level GIS courses tend to focus on making pretty maps and spatial operations and rarely give consideration to data storage issues, especially with large datasets that are common in production GIS environments.
At some point all of these companies begin to realize that they needed more GIS expertise. Usually they wanted someone with an environmental background to help out when there wasn’t enough straight GIS work to keep them busy. So they tend to hire GIS staff that come from the scientific world, rather than those with IT backgrounds.
Being the “GIS guy” in a company of specialists
GIS technology changes rapidly over that time but it is difficult for consultants to keep up with changes in technology for a number of reasons.
- Upper management rarely provides incentive for non-billable hours, such as data organization and training.
- Upper management at times takes the short-sighted (in my opinion) view that increasing efficiency means less billable hours and thus less profit.
- It’s difficult to even bid on projects outside your comfort zone when you lack sufficient knowledge to estimate your time properly.
- It’s hard to bid competitively, and borderline unethical, when you have to include training hours in the bid.
- Upper management often (rightfully) considers environmental compliance to be the company’s primary focus. Any money spent on GIS infrastructure, therefore, is (wrongfully in my opinion) considered to be “overhead” and thus something to be minimized as much as possible, rather than a core part of the company business.
- It’s difficult to sell management on the advantages of adapting new technology when you have less than compete understanding of it yourself.
- There is often push-back from co-workers who don’t want to change the way they are used to doing things.
This made my job frustrating as I felt I was falling behind the times and often I knew enough to understand that there were better ways to approach problems but with my workload and deadlines and dis-incentives for non-billable hours I rarely had time to learn the new approaches.
It was also incredibly frustrating to me when I would tell management that I could automate a task that my co-workers would spend hours on (such as manual review of potential environmental issues) only to be told that “then we can’t bill as many hours”, even as those same co-workers were burning out and quitting on a regular basis due to their heavy workloads and boring repetitive tasks. In my opinion it’s ALWAYS better to be efficient in the long run, even when you can’t see the immediate advantages. I was far more successful in focusing my arguments on reducing embarrassing errors then I was on increasing efficiency.
As a result of these factors, I found myself coming into new firms three times in the past decade and tasked with the job of “making our GIS work better”. It wasn’t that previous personnel were not intelligent or didn’t care, far from it. They just lacked experience and often were more interested in focusing on their area of environmental expertise rather than GIS.
I also came into GIS from an environmental background with a BS in wildlife biology and a MS in Ecology, but I also learned to program in the early 80s. I sold my first software as a senior in high school in 1984, and I spent 6 years developing database applications between high school and college. This additional computer experience gave me an advantage over many of my peers that I was able to use effectively as a GIS specialist.
GIS Data Storage Case Study
In one recent job, the staff of a rapidly growing consulting firm was so frustrated with the performance of the companies GIS system and complained so loudly, that the boss had recently spent $20,000 on 5 top-of-the line workstation level computers, only to find that they had minimal effect on performance. Sadly this was only a month before I was hired as I could have told them it was wasted money.
Their problem was:
- They were getting all of their imagery over the internet as an ArcGIS Online service
- ALL of their vector data was coming from the server
- ALL of it was stored as shapefiles.
- They didn’t take advantage of scale dependencies in display of layers and labels.
I separated all of their static vector data into a single file geodatabase that I installed on each computer along with NAIP imagery for our entire area of interest. This meant that most of their data, especially the largest files such as aerial imagery, roads, soils, etc was stored locally and not loaded from the server, let alone the internet. Not only that but file geodatabases are more efficient than shapefiles (less than half the size) and have better spatial indexing, both of which increase performance substantially.
This worked because this data rarely if ever changes so it doesn’t matter if there are multiple copies stored locally on each person’s computer. Once a year, during slow periods, I would check if new data was available and update the local data on everyone’s computer. Many important data layers change very infrequently, such as the USGS NHD hydrology data, TIGER road data, NRCS Soils data, NWI wetlands data, DEMs, etc. NAIP imagery is generally on a 3 year cycle. In Colorado our state wildlife data is updated on a 4 year cycle, so there was very little need to change this core data.
Dynamic data (that which changes frequently), such as oil well data, raptor nest locations, etc. was still stored on the server but moved to a file geodatabase for efficiency. Project specific data, such as a study area boundary for smaller projects were still stored as shapefiles. Shapefiles were what people were familiar with and often received from clients and they small enough that the performance hit was negligible. If a project got large enough to warrant it, a separate project specific file geodatabase was created. These changes greatly reduced the internet bandwidth required and the load on the server and thus reduced map opening times from over 2 minutes to less than 20 seconds and response time when panning and zooming to a similar degree.
Those 5 workstations with 8 xeon cores and 32GB of ram were awesome but didn’t make a bit of difference because ArcGIS only uses one core, and the issue was data transfer speeds not processer speed or memory. Even though this company had several experienced GIS personnel who were quite capable analysts, none of them had the basic understanding of computer architecture necessary to address the problem.
Generalized Approach to GIS Data Storage
The fundamental problem is that GIS datasets tend to be large. A MrSID file of NAIP imagery for a single county can be several gigabytes in size. DEM’s and other raster data can also be quite large. Even vector data such as NRCS SSURGO soils data can quickly reach gigabyte levels when downloaded and integrated for multiple counties. Trying to move such massive amounts of data over the internet or even an internal network will quickly bog down any computer no matter how fast it’s processor is or how much RAM it has. The issue is data transfer speeds and it is critical that you optimize how your data is stored and do everything you can to retrieve the biggest data files from locations with the fastest data transfer speed.
In general rates of data transfer speed are, from slowest to fastest:
- Internet – 50-100 Mb/s (shared)
- Network – 1,000 Mb/s (shared)
- External hard drive – 1,000 Mb/s (dedicated)
- Internal hard drive – 1,500 – 3,000 Mb/s (dedicated)
- Solid State Drive – 8,000 – 12,000 Mb/s (dedicated)
Processor speed and memory are definitely important, especially if you do a lot of number crunching geoprocessing operations with large data sets. For people who are just opening a map document and panning around it, however, as many GIS users do, data transfer rates are, by far, the limiting factor in performance.
Fortunately most of the biggest data sets that we used were the large background data sets such as imagery, DEM’s, hydrology, soils, and roads. These don’t change often and could be loaded directly from a local hard drive. I found that a $500 laptop with a 128GB SSD outperformed my $4000 desktop with an old-fashioned hard drive for simple panning and zooming operations.
Pure transfer rates are not the only consideration. Data storage format is also important. File geodatabases generally take up less than half the space of the equivalent data in a shapefile. This means that the equivalent data loads more than twice as fast. Equally, if not more, important than file size is that geodatabases use a more efficient spatial indexing scheme so that only data that is within the map view is loaded into your document, rather than loading an entire shapefile at once.
Pyramids can be created for raster files which greatly decrease the amount of data required when zooming in and out. Most internet map services load imagery from tiles, meaning that, like spatially indexed vector data, only small pieces (tiles) are loaded at one time. Generally just enough to cover the current map view and caching is used to minimize how often tiles are loaded from the server. Even so, loading tiled data from the internet is slower than even large image files from a fast local SSD, especially if you are sharing the internet connection with others. It is also possible, to use software such as the QGIS QTile plug-in or TileMill to generate tiles from your existing data and create a locally stored tile layer that I would expect would be very fast to load, although I have not directly compared performance.
Another performance enhancer to consider is to set scale limits on your data. I heard a lot of cussing from people who had a 1:24,000 hydrology layer loaded with streams labeled and inadvertently zoomed out to 1:500,000 scale and had to sit for minutes waiting for ArcGIS to fill the entire screen with blue lines and then attempt to label them all. This should never happen.
Many data types such as roads and hydrology are available at a range of scales and you set the scale levels for each dataset for an appropriate level. For example, maybe only highways and major rivers are shown at 1:1,000,000 scale, then as you zoom into 1:100,000 scale county roads and larger streams appear, and then at 1:24,000 scale all streams and back-roads appear. You can set scale limits for labels as well. For example maybe you see oil wells as points only at some scales but don’t see labels until you zoom in close enough for the label to be appropriate.
The key to making this system work is a properly designed map template. I always used absolute referencing of data in my templates. This allows map documents based on the template to be moved around at will without breaking links to the data. Of course the data has to stay in the same location, but that is usually easier to control than the location of the map document. This also allows you store map documents on the server that reference the large static data stored locally. When a document is opened from the server, dynamic data stored on the server is referenced to the servers drive letter (say S:) while the locally stored static data is referenced with the local hard drive (generally C:). So whoever opens the map document will be seeing the same dynamic data on the server, but different, although duplicate, static data from their own local drive.
I used this approach to optimize GIS operations for several companies and my clients were pleased with the results.
- Does anyone else use this approach?
- Are there any other ways to streamline the use of large datasets without going to an enterprise level geodatabase?
- Has anyone else experienced the same issues with their company or clients?
Part II of this post will discuss issues related to multi-user access to data, which often becomes an issue as small companies become larger companies. In my experience, many GIS analysts that lack a background in IT infrastructure, struggle with making this transition.
Mild sales pitch ahead…..
If you are interested in learning about developing your own multi-user web based interfaces to your data for your co-workers or clients without any user or license fees please check out my blog-post on web-based GIS or consider enrolling in my Udemy courses on programming for web-based GIS applications and on the Leaflet.js and Turf.js JavaScript mapping API’s. Now 80% off ($20 total) for readers of this blog!.
Great article. Thanks for sharing your experience. Your case study is valid as the data was less frequently changing. I would suggest another powerful approach of using PostgreSQL database architecture for managing GIS operations as well as GIS datasets efficiently. PostgreSQL is freely available database system with GIS support.
Thanks Muhammed. I’m a huge fan of PostgreSQL and PostGIS. Its especially impressive when you need multi-user editing capability. I just published a new post about using QGIS with PostGIS for multi-user enterprise GIS applications. http://millermountain.com/geospatialblog/2017/04/20/qgis-arcgis-users/
This is definitely a good summary, we use a mix of the bad systems you identify at the outset, and the various techniques you lay out as the solution. We also have server level geodatabases which we synchronise locally with the server replication options. This is sort of the next step up….
Also, some python scripts to automate data update cycles is fairly do-able these days as well.
This article is particularly useful.
I appreciate the approach to GIS data management. In order to improve performance, teams could also set a sync between the GIS repositories on their computers, to the GIS repository on the server.
This type of file structure also allows for non-GIS file types that are a part of the GIS workflow (ie – Adobe Illustrator files used in the production of map products).
Looking forward to part two of this post!