Geocoding is the practice of taking in an address and assigning a latitude-longitude coordinate to it. Doing so for millions of rows can be an expensive and slow process, as it typically relies on paid API services. Learn how we saved our client time and money by leveraging open source tools and datasets for their geocoding needs.
Our solution reduced their geocoding process from hours to minutes and their reverse geocoding process from unfeasibly expensive and slow to fast and cheap.
Geocoding answers questions such as:
Given the address: “17600 seneca spgs college station tx 77845”, what’s its latitude-longitude coordinate?
Reverse-geocoding answers the reverse:
Given the coordinate (-30.543534, 129.14236), what address does it correspond to?
Both are useful in several applications:
Our client needed to geocode and reverse-geocode millions of rows at a time. Geocoding was slow, and reverse-geocoding was unfeasibly expensive. To process ~7,000,000 rows:
The solution we delivered them, on the other hand, was lightweight, cheap, and fast. To process the same number of rows:
We’re here to share our findings and give an overview of how we did it.
Suppose we’re starting with a batch of addresses and need to geocode them. The gist of the solution we delivered is as follows:
Whilst conceptually simple, we encountered several hurdles when implementing it. We’ll now tell how we overcame them.
Road names vary between providers. For example, “Seneca Springs” might also appear as “Seneca Spgs.” We used the libpostal’s `expand_address` function, as well as some hand-crafted logic, to generate multiple variants of each address (in both the input and the lookup dataset), thus increasing the chances of finding matches.
The OpenAddresses data contained all the information we needed, except that the zip code was missing for some rows. For such rows, we would do the following:
The last option used a Polars plugin we developed specially for the client (who kindly allowed us to open source it). Using that plugin, it’s possible to do approximate reverse geocoding of millions of rows in just seconds.
The amount of data we collected was several gigabytes—much more than our single-node 16GB RAM machine could handle. This is why our client was previously using a cluster to process it. However, we found this to be unnecessary because Polars’ lazy execution made it very easy for us not to have to load all the data at once.
By leveraging Polars’ lazy execution and query optimization, we were able to carry out the entire process on a single-node machine! The overall impact was enormous: the geocoding process went from taking hours to less than 10 minutes. This was fast and reliable enough that the client was able to discontinue a paid API service that was costing them ~$30,000 per year!
Thus far, we’ve talked about geocoding. What about the reverse process, reverse-geocoding? This is where the success story becomes even bigger: not only did our solution run on a single node, but it could also run on AWS Lambda, where memory, time, and package size are very constrained!
In order to describe our solution, we need to introduce the concept of geohashing. Geohashing involves taking a coordinate and assigning an alphanumeric string to it. A geohash identifies a region in space – the more digits you consider in the geohash, the smaller the area. For example, the geohash `9xe` stretches out across hundreds of miles and covers Wyoming entirely (plus parts of other states), whereas `9xejgxn` covers a very small amount of land and allows you to identify 3rd Street Shoshoni, Wyoming. Given a latitude and longitude coordinate, the geohash is very cheap to compute, and so it gives us an easy way to filter out irrelevant data from our lookup dataset.
Here’s a simplified sketch of the solution we delivered:
This solution is easy to describe – the only issue is that no common dataframe library has in-built functionality for computing geohashes, nor for computing distances between pairs of coordinates.
This is where one of Polars’ killer features, extensibility, came into play: if Polars doesn’t implement a function you need, you can always make a plugin that can do it for you. In this case, we used several plugins to adapt Polars to our needs:
Our complete environment was composed of the following:
Not only did it all fit comfortably into the AWS Lambda 250MB package size limit, execution was also fast enough that we could reverse-geocode millions of coordinates from across the United States in less than 10 minutes, staying within the 10GB memory limit!
That’s the power of lazy execution and Rust. If you, too, would like custom Rust and/or Python solutions that can be easily and cheaply deployed for your use case, please contact Quansight Consulting.
By leveraging both open source datasets and open source tools, as well as our in-house expertise, we were able to save our client time and money on their geocoding and reverse-geocoding needs. We made the infeasible feasible. If you’d like customized solutions tailored to your business needs, delivered by open source experts, please get in contact with Quansight today.
Contact us today: connect@quansight.com