I have a friend who likes to ride his bike a different route every time. How can he know if he has already ridden a given street or not? We will help him by making a heat map.
First of all – you need data. Today we will operate on data collected in the Strava application, so we go out into the field and do some kilometers!
Personally, I ride a bit, but not enough to invest in a full Strava account. Because a full account has the option of drawing a heatmap…
By the way, every month Strava prepares maps and publishes them on its website , and for example this is how people ride a bike around Poland :
and like this in Warsaw and its surroundings :
Everyone would like to have their own map, not everyone just wants to pay for it. We will help!
After logging in in the browser (it doesn't work in the phone app), go to your profile
There in the “My Account” section at the bottom we find “Download or Delete Your Account”
We go to the data export mechanism by clicking "Get Started"
In the second point – “Download Request” – we press the button. After some time we receive an email with a link to the archive
We download them, unpack them. We are only interested in two elements:
We unpack both of them to the directory with our script.
Important: in the “activities” directory some files are packed with GZip (they have the .gz extension) – it is worth unpacking them. In Linux, gunzip *.gz
in the console will be enough. In Windows, you will probably need some additional program ( 7-Zip is good for everything).
Of course, you can unpack these files in Python, but we will skip this part.
Once we have collected our data in the appropriate structures, we can start processing it.
The task will be to extract the measurement time and coordinates from each file.
Strava exports all routes to XML files in GPX format. Roughly speaking, such a file looks like this:
XHTML
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | <? xml version = "1.0" encoding = "UTF-8" ?> <gpx creator = "StravaGPX Android" xmlns : xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi : schemaLocation = "http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version = "1.1" xmlns = "http://www.topografix.com/GPX/1/1" > <metadata> <time> 2017-06-12T15:39:43Z </time> </metadata> <trk> <name> Afternoon Ride </name> <type> 1 </type> <trkseg> <trkpt lat = "52.2858350" lon = "20.9405770" > <ele> 95.0 </ele> <time> 2017-06-12T15:39:43Z </time> </trkpt> <trkpt lat = "52.2858750" lon = "20.9405770" > <ele> 95.0 </ele> <time> 2017-06-12T15:40:35Z </time> </trkpt> </trkseg> </trk> </gpx> |
We have a trkseg section in which we find a collection (list) of individual measurement points trkpt . Each of these points has its GPS coordinates (and that's enough for the heatmap), as well as other parameters - here the height above sea level ( ele ) and the measurement time ( time ).
We will process this data into a regular pandas data frame. So we need to go through the XML file tree and pick out the appropriate elements from each branch,
The second page is the activities.csv file which has a ton of metadata about each route traveled. We will only be interested in selected information:
To tell the truth, we won't need "Activity ID" and "Activity Name" for anything, but it's always worth leaving them if, for example, we need to verify something. We'll also save the ready data (processed from all XMLs) to one collective CSV file - so maybe the ID and activity name will be useful for further analysis?
The code will be short because the algorithm is simple:
First, imports:
Python
1 2 3 4 5 6 7 8 | import globe # file list import xml . etree . cElementTree as ET # traversing XML import panda ace pd # data frame, saving data to CSV import numpy ace e.g. #2D-histogram from tqdm import tqdm # progress bar - to let you know something is happening 😉 import matplotlib . pyplot as plt # charts import matplotlib . cm as cm # color palettes |
Now a helper function that reads one GPX (XML) file and translates it to a pandas data frame:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | def parse_gpx_file ( filename : p . ) - > pd . DataFrame : """The function loads the given GPX file and returns a dataframe with its contents""" def find_tag ( xml_node : ET . Element , tags : p . ) - > ET . Item : """The function searches for a node with a specified tag in XML. Returns the found node or None""" found = False for node in xml_node : if node . tag . endswith ( tag ) : found = True break if found : return Node return None # we load the XML file tree = ET . parse ( filename ) root = tree . getroot ( ) # starting from root we look for the "trk" section trk_node = find_tag ( root , "trk" ) if not trk_node : return pd . DataFrame ( ) # having the "trk" section we search for "trkseg" in it trkseg_node = find_tag ( trk_node , "trkseg" ) if not trkseg_node : return pd . DataFrame ( ) # in the "trkseg" section we have a collection of route points points = [ ] for node in trkseg_node : # from each point we extract the information that interests us and add it to the list element = node . attrib for nn in node : elem [ nn . tag . split ( "}" ) [ - 1 ] ] = nn . text points . append ( elem ) # from the list we make a data frame points_df = pd . DataFrame ( points ) #correction of types of collected data points_df [ 'lat' ] = points_df [ 'years' ] . astype ( float ) points_df [ 'lon' ] = points_df [ 'lon' ] . astype ( float ) points_df [ 'ele' ] = points_df [ 'ele' ] . astype ( float ) points_df [ 'time' ] = points_df [ 'time' ] . apply ( lambda s : pd . to_datetime ( s ) ) points_df [ 'filename' ] = filename return points_df [ [ 'time' , 'filename' , 'lon' , 'years' , 'ele' ] ] |
Let's assign paths to individual input and output data sets to the variables:
Python
1 2 3 | activities_data_gpx = "activities/*.gpx" activities_data = "activities.csv" activities_data_output = "activities_data.csv" |
And now we can process all route files in one loop:
Python
1 2 3 4 5 6 7 8 9 | # empty dataframe to which we will append data from subsequent files full_df = pd . DataFrame ( ) # we list the files and parse each of them for file in tqdm ( glob . glob ( activities_data_gpx ) ) : tmp_df = parse_gpx_file ( file ) # we append the parsing result to the full data frame full_df = pd . concat ( [ full_df , tmp_df ] , axis = 0 ) |
Let's load the route metadata and combine it with the points already obtained from individual trips:
Python
1 2 3 4 5 6 7 | activities = south . read_csv ( activities_data , usecols = [ 'ActivityID' , 'Activity Name' , 'Activity Type' , 'Filename' ] ) # join two tables full_df = pd . merge ( full_df , activities , how = 'left' , left_on = 'filename' , right_on = 'Filename' ) # we sort by date and save to CSV "for now" full_df . sort_values ( 'time' ) . to_csv ( activities_data_output , index = False ) |
In principle, each of the points on the route has its own coordinates, and these are coordinates in two-dimensional space. And it so happens that the NumPy library has a function called histogram2d() . Thanks to this, we can easily prepare data for a heatmap, and later - using ordinary matplotlib - show them:
Python
1 2 3 4 5 6 7 8 9 10 11 12 | heatmap , xedges , yedges = e.g. histogram2d ( # we only choose activities like "cycling" x = full_df [ ( full_df [ "Activity Type" ] == "Ride" ) ] [ "lon" ] , y = full_df [ ( full_df [ "Activity Type" ] == "Ride" ) ] [ "years" ] , bins = 250 # the higher the value the greater the accuracy ) extent = [ xedges [ 0 ] , xedges [ - 1 ] , yedges [ 0 ] , yedges [ - 1 ] ] Fig = plt . figure ( figsize = ( 9 , 9 ) ) plt . imshow ( heatmap . T , extent = extent , origin = 'lower' , cmap = cm . Grays ) plt.show ( ) |
My trips around Warsaw look something like this:
Such an image has several disadvantages. First of all, it cannot be directly superimposed on a map – the projection is not what we are used to, something would have to be stretched. Secondly – without superimposing it on a map, it is difficult to recognize specific places. But an image prepared in this way will work great as a prerequisite for preparing an infographic.
But why not use ready-made solutions and prepare a heatmap applied directly to the map? We will use Folium:
Python
1 2 | import folium from folium . plugins import Heatmap |
First, we aggregate the data, because that's what heatmap is all about. We aggregate in Pandas:
Python
1 2 3 4 5 6 7 8 | # only bike rides agg_points = full_df [ ( full_df [ "Activity Type" ] == "Ride" ) ] [ [ 'lon' , 'lat' ] ] . copy ( ) # rounding coordinate values reduces the number of points (and unfortunately the accuracy) agg_points [ 'lon' ] = agg_points [ 'lon' ] . apply ( lambda x : round ( x , 4 ) ) agg_points [ 'lat' ] = agg_points [ 'years' ] . apply ( lambda x : round ( x , 4 ) ) agg_points_plot = agg_points . groupby ( [ 'years' , 'lon' ] ) . size ( ) . reset_index ( ) |
One note about the last line of code above: we are counting the number of position readings at a given point, not the number of routes that passed through that point. This means that if we go around in circles, we will get many readings from roughly one point. A better way would be to count the unique Activity IDs at a given point – that would give us a unique number of routes. If we want to do it this way, we do it a bit differently:
Python
1 2 3 4 5 | agg_points = full_df [ ( full_df [ "Activity Type" ] == "Ride" ) ] [ [ 'lon' , 'years' , 'filename' ] ] . copy ( ) agg_points [ 'lon' ] = agg_points [ 'lon' ] . apply ( lambda x : round ( x , 4 ) ) agg_points [ 'lat' ] = agg_points [ 'years' ] . apply ( lambda x : round ( x , 4 ) ) agg_points_plot = agg_points . groupby ( [ 'years' , 'lon' ] ) . nunique ( ) . reset_index ( ) |
Having prepared the aggregates, we mark the center of the map (so that it does not have to be moved every time), the map itself ( map base ), and then simply apply the calculated heatmap:
Python
1 2 3 4 5 6 7 8 9 10 11 12 | center = ( e.g. mean ( agg_points_plot [ ' lat' ] ) , e.g . mean ( agg_points_plot [ 'lon' ] ) ) # we build a Map object which is initially just a background with the map m = folium . Map ( location = center , tiles = "OpenStreetMap" , zoom_start = 10 , zoom_control = True ) # we calculate the heatmap and add it to the Map object HeatMap ( agg_points_plot . values . tolist ( ) , radius = 10 ) .add_to ( m ) |
Finally, we save the generated map to an HTML file:
Python
1 | m . save ( "heatmap.html" ) |
Now we can open the generated file, for example, in a browser and enjoy the result of our work.
It's no secret where I live, so the heatmap above is actual data.
What happened here today? We learned four things!
You can calculate the instantaneous speed between subsequent route points and, for example, draw a heatmap from these values (after averaging).
Having the measurement time at a given point, we can also try to check whether we reach a higher driving speed in the afternoon than in the morning. By adding data on the wind direction (this will be difficult - where to get such detailed data in location and time?) and calculating the azimuth of movement from two adjacent measurement points, we can determine whether the driving speed after adding/receiving wind force is constant or dependent on the time of day.
In the data we also have the value ele , or height. So we can calculate the height gradient at each point and again compare it to the speed.
In most cases these ideas are a bit absurd, but look what you can get with a series of measurement data with only four features: measurement time, longitude and latitude, and altitude above sea level. From these four parameters we derive the following: instantaneous speed, altitude change, azimuth of movement, time-related features (hour of the day, day of the week, month, season, year, time elapsed since the start of the route) which we can freely combine.
I am not talking about geographical features and for example the use of machine learning algorithms, such as the DBSCAN algorithm which would allow finding groups of closely adjacent points. These could be favorite places or something like that.