A dalliance with distance

During a recent engagement, I was tasked with making a query faster. Since I signed an NDA (Non-Disclosure Agreement), I can’t go into detail, but it made extensive use of the STDistance() function of the GEOGRAPHY data type.

The query is being called many times a second, and owing to the way the application was designed, is effectively running in a cursor. For each row in a table that contains GPS coordinates, it has to calculate how far the principle GPS coordinates are from that row’s latitude and longitude.

As you can imagine, even with only a few hundred rows, this does not scale well. The database had hundreds of thousands of rows that needed to be calculated each time the query ran because these values are constantly changing.

The data was then being ordered on the STDistance result so that the nearest neighbour showed up first.

After some investigation, I discovered that the spatial index on the coordinates column was not set up correctly. However, we were unable to schedule a maintenance window to rebuild that index. Therefore, I was tasked to make the query faster without modifying any indexes.

To begin with, removing the ORDER BY clause made the query run much more quickly, even though it was running in a loop. After removing the looping structure and refactoring the query to be set based, it ran even more quickly. No surprises there, but the result set was now much larger.

However, it was still causing blocking on the table where coordinates were being stored, and the returning data was unsorted.

Enter the Great Circle Rule. In principle, it states that any distance on the surface of a sphere can be calculated using a simple mathematical formula (known as the Haversine Formula). If there’s one thing I know about computers, it’s that they perform mathematical formulas really well.

The haversine formula is not simple to express in a T-SQL query. Instead, we can use the spherical law of cosines, which is not as accurate (there is a maximum drift of around 2 metres, or a 0.3% margin of error).

(cos c = cos a cos b + sin a sin b cos C)

The advantage of this cosine law is that it’s one line of code, which means it can be used as a drop-in replacement for STDistance.

The customer agreed that for the purpose of this query, 0.3% was well within an acceptable error range. I helped the lead developer rewrite the query to replace STDistance value with the cosine formula, which increased performance by a factor of 1000 and did not need to use the spatial index at all.

In purely T-SQL terms, this looks as follows (the [Latitude] and [Longitude] are columns in the table I’m querying):

-- Latitude of source
DECLARE @Latitude FLOAT(53) = 34.09833
-- Longitude of source
DECLARE @Longitude FLOAT(53) = -118.32583
-- Diameter of the earth, in miles
DECLARE @Diameter FLOAT(53) = 3959

SELECT [ColPK], [Latitude], [Longitude],
ACOS(COS(RADIANS(90 - @Latitude)) * COS(RADIANS(90 - [Latitude]))
+ SIN(RADIANS(90 - @Latitude)) * SIN(RADIANS(90 - [Latitude]))
* COS(RADIANS(@Longitude - [Longitude]))) * @Diameter AS [Distance]
FROM [InterestingTable]

(Source: http://gis.stackexchange.com/a/40929)

Now of course this can be ordered and sorted as required. Because the distance is expressed in miles, we can filter the results to only display values that fall within a specific radius from the source, making the query run even faster.

If you have any neat geography data things to share, find me on Twitter at @bornsql .

Author: randolph

Randolph West is a Microsoft Data Platform MVP, and has worked with SQL Server since the late 1990s. When not consulting, he can be seen acting on the stage and screen, or doing voices for independent video games. Connect with Randolph on Google+ or Twitter.