Geography 337: GIS II: Post 4: Geocoding

Goals and Objectives

The goal of this exercise is to geocode Wisconsin 18 sand mine locations so that they can be used in the next exercise, which is routing. Spatial and descriptive data for the mines were presented to the class in the form of an un-normalized Microsoft Excel file, as would be expected if the Wisconsin-DNR were to provide such data (fig. 1). Each student was assigned 18 mines from the excel file that they would need to normalize before geocoding.

Data from the Excel file was geocoded using an ESRI geocoding service. Mines located with the geocoder were compared with real-world mine locations to ensure that addresses produce by the geocoder were as accurate as possible; if an erroneous match was produced by the geocoder other data, e.g. Public Land System Survey (PLSS) and Google Earth, would need to be used to locate the mine (figure 2).

When all of the data was geocoded, it was merged as a class producing results for 148 mines. Each mine had a Unique Identification field that was used to compare the locations of geocoded mines (as a class) to the actual location of the mines in order to determine our accuracy.

Methods

Normalizing Wisconsin DNR Data

In order to use the ESRI geocoder, sand mine data provided by the Wisconsin DNR needed to be normalized. Columns in the excel spreadsheet needed to sorted so that each column contained one set of information. For example, in Figure 1 the 'Address' column for many of the mines contained several pieces of data such as street address, zip code, town/city, and PLSS data. In addition, to each column having its own set of information to ensure that the ESRI geocoding tool worked as efficiently as possible, a PLSS field was created in the normalized table as well. PLSS information will be a useful field when the geocoder does not locate and one has to locate the mine manually, as will be discussed in the 'Geocoding Mines' section of the report.

Geocoding Mines

The normalized table was added to ArcMap and loaded into the ESRI geocoding program. Sand mines were geocoded based on their addresses and the geocoder matched 15 out of the 18 mines, with 2 unmatched mines and one tie (fig. 1). A geocoding shapefile was also produced that contained the geocoded locations (fig. 2).

In order examine the 3 mines that were not matched with geocoder, and to check the accuracy of the 15 matched locations, rematch (figs. 1) was selected. The ESRI 'World Imagery' basemap was used (fig. 1) so that the points in the 'geocoding results' could be compared to locations of mines on the basemap (where they existed) using PLSS data. For example, PLSS information was overlain onto the basemap. When the property was located down to the quarter-section (fig. 3), the address was manually picked.

Due to outdated imagery on the ESRI basemap, Google Earth was also used to manually locate geocoded mines for accuracy. If the location of a mine that was automatically geocoded was found to be inaccurate, the 'Interactive Rematch' interface was used to manually select a location based on basemap and Google Earth verification (fig 4).

Figure 1. Initial results of running the ESRI geocoder.

Figure 2. Point shapefile of sand-mine locations automatically located with the ESRI geocoder with ESRI basemap imagery to aid in manual location.

Figure 3. PLSS data for unmatched (or poorly matched) mines were verified using PLSS data from WDNR data and ArcGIS PLSS data down to the quarter-section (red squares).

Figure 4. Interactive Rematch interface used to manually locate mines based on PLSS and imagery (ESRI basemap and Google Earth).

Geocoded Mine Accuracy

Merge. Geocoded mines from each student were merged with the ArcMap Data Management 'Merge' tool. In order to run the tool, tables from each student had to be consistent with one another. For example, one student renamed their 'Mine_Unique_ID' field to 'Mine_ID'. In order to run the merge tool with all the appropriate fields, all 18 of the student's unique mine IDs were re-entered into a new field named 'Mine_Unique_ID'.

Distance. Once the class' mine data were merged into one shapefile, the 'Point Distance' tool was used to create a table (fig. 5). Input features were the merged class mine data and near features were the "all_mines" shapefile, which contained the actual, accurate locations of the mines; accurate mine data were provided after each student geocoded their respective mines. Before the distance tool was used, both the all_mines and merged class data shapeifles were projected into a state coordinate system. For the projection, NAD_1983_2011_Wisconsin_TM (meters) was used. A radius of 1000 meters was used in the point distance tool and a table of distance (distance table), which excluded 56 mines from the table.

Join. The merged class data were spatially joined with the distance table. In order to do the join, a new field was created in the class mine data shapefile. The new field was Input_FID and corresponded to a field generated when the distance table was generated. The Input_FID corresponded to the Object_ID of the input shapefile (i.e. the class mine data).

Once the new field (Input_FID) was created in the merged sand mine shapefile, its respective attribute table was joined with the distance table generated by the point distance tool; the joined table displayed the unique mine IDs and as well as the distance between locations of actual mines and those that the class geocoded that were less than 1000 meters from one another.

Results

Normalization of Wisconsin DNR Data

Sand mine data from the WiDNR were inappropriate for the purpose of geocoding (fig. 5). For example, the address column contained not only the street number, but in many cases the zip codes, city, state, and PLSS information (fig. 5; highlighted portion).

Such information was separated into individual columns so that the ESRI geocoding program could be run in ArcMap; the result was a normalized table (fig. 6).

Figure 5. Un-normalized table of sand mine data that was provided by the WiDNR.

Figure 6. Normalized table of sand mine data with a separate column for each piece of address information: street address, city/town/village, zip, county, state, PLSS.

Geocoded Mines

18 sand mines were geocoded using the ESRI geocoding program. All 18 of the mines were located manually because none of the mines were geocoded correctly by the program. Locations were picked using ESRI basemap imagery, PLSS shapfiles (in conjunction with PLSS data provided by the WiDNR), and Google Earth. The geocoded shapefile was projected into a Wisconsin state coordinate system: NAD 1983 (2011) Wisconsin TM (METERS); the data is displayed in figure 7.

Once each individual in the class geocoded their 18 mines, the instructor provided a new shapefile (All Mines) that showed the sand mines' actual locations. A spatial comparison of the 18 mines that were geocoded by the author of the report (Luczak Mines) is compared to the accurately placed All Mines (fig. 8).

Once the current GIS class' geocoded mines were merged and projected into an appropriate coordinate system, the distance between the class mines (Class Mines shapefile) and the actual locations of the mines (All Mines) were calculated (fig. 9). To calculate distance, a radius of 1000 meters was used. One reason for selecting such a radius was economy of data, for example, if no radius were selected every location in the Class Mines shapfile was compared with every point in the All Mines shapefile; the result was over 500 comparisons. Also, any distance greater than 1 km between the geocoded mines and the actual mines was likely the result of gross error and ultimately worthless. A total of 56 of the Class Mines were located at a distance greater than 1km from the actual mines. A total of 10 of my geocoded mines were in excess of 1km from the locations of the actual mines and are given a value of "<null>" (fig. 9).

Figure 7. Map of the 18 mines that were initially geocoded for the project.

Figure 8. Map showing the 18 mines that were geocoded for this project (Luczak Mines) versus the actual mine locations (All Mines).

Figure 9. The distance between the 18 mines that were geocoded for this project ('Distance' field) and the actual mines.

Discussion

Error is common in geographic data due to numerous factors such as data quality, operational commands, and data collection methods. Errors found in geographic information is grouped into two categories which reflect the characteristics of the errors (Lo and Yeung, 2007). The two categories of error are operational and inherent and they are summarized in the table in figure 10. Both operational and inherent errors can be gross: mistakes/blunders, systematic: mechanical defects in collection tools/changing environmental conditions, or random: errors that are left after gross and systematic errors are accounted for(Lo and Yeung, 2007).

Inherent errors result from the fact all maps are merely scale representations of the Earth, which is far too complex to be modeled on a 1:1 scale. In contrast, operational errors result from the collection and management of geospatial data. Both operational and inherent errors were encountered in the previous exercise and will be discussed.

The geographic data provided by the WiDNR is likely more accurate than the data geocoded by the class. The reason that the WiDNR data is likely more accurate is that they verified sand mine locations by going out into the field and collecting GPS points at such locations. Using GPS to verify sand mine locations likely contains inherent and operational errors like field survey measurements due to instrument limitations, and minor operational errors such as sampling procedures (fig. 10). However, assuming the WiDNR has set up standard operating procedures (SOPs) for collecting GPS data in the field, gross operational errors in data collection are probably lower than those of the class' when the mine locations were geocoded based off address data.

Class Mine error was probably due to gross operational errors, especially when locating data for manually geocoding mines. For example, the geocoded mine that was closest to the All Mines location was placed 36 meters away (Unique ID#106). Furthermore, roughly 38% of the mines geocoded by the class (56 out of 148) were calculated to be greater than 1 Km from the mine locations determined by GPS information. While 1 Km may not seem like too far to be off in terms of accuracy, one could imagine if the mine locations were actually houses. If emergency services were dispatched to locations 1 Km away from where they needed to be, then people's lives could be at risk.

Other types of error, both inherent and operational, are likely present in the geocoded data as well, however, they are likely overshadowed by the inconsistencies caused by gross operational error that resulted when mines were geocoded. Other types of error that were likely overshadowed by the gross operational error in Class Mine data includes, but is not limited to, inherent and operational error from projecting the shapefiles into NAD_1983 format and numerical rounding

during computation (fig. 10).

Of course, all error was determined based on the belief that GPS data provided by the WiDNR the most accurate dataset. If the WiDNR data were inaccurate due to gross operational errors such as data collection, then all error calculations between their data and the class' data could be false. However, the likelihood that the DNR's field data is less accurate than the data geocoded by the class is not very good.

Figure 10. Screen-shot of a table summarizing error source and type. From Lo (2007).

Conclusions

The purpose of the geocoding exercise was to stress the importance of data integrity and standardization. Standardization of data in all stages of the exercise would have likely resulted in more accurately geocoded data at the end of the exercise. For example, each student probably normalized their tables differently from one another, thus it is hard to tell who is and who is not accurate.

The exercise also established the fact that all geospatial data that one receives should be questioned for integrity. For example, it was only after realizing hodge-podge nature of assessing the class's geocoded data (my own included) that I began to question the accuracy of the WiDNR's data: How accurate is their geospatial data regarding sand mine locations? Did the DNR establish standard operating procedures (SOPs) for the collection of GPS points? If such SOPs were established by the DNR, were they followed by all technicians collecting the data on the mines? Even if all GPS data collected by the DNR was collected consistently and according to established SOPs, did they assess the data for error? If so, how did they make such assessments?

Work Cited

Lo, C.P., and Yeung, A.K.W., 2007, Concepts and Techniques of Geographic Information

Systems: Upper Saddle, New Jersey, Prentice Hall, 544p.

Geography 337: GIS II

Friday, April 10, 2015

Post 4: Geocoding