Crawling Harness Data — The Calendar of Events

Data for this project is available from the early 80’s at (this link) and there is even an organization https://www.rise-digital.com.au/ that has an api available ‘freely’ to use. The problem with rise is that I have attempted to obtain access from them several times and not heard a response. Their terms of service clearly states: ‘RISE values creativity and encourages broad use of the API to encourage wagering’, but their lack of response clearly demonstrates their actual values.

The only option I have because I do not have access to the API is to build a series of web crawlers to gather that data that I require.

Below is the data model at an abstract level. This model reflects each major area of the data, as well as the logical and functional units of work. Working from top to bottom there is flow that eventually leads me to the data that I store in a flat structured table that reflects the outcome of a race event.

Harness Racing Model

Calendar of Events

The calendar of events is the first data that I crawl. I want to know about every harness race meeting and then I want to know about every race in each meeting, and subsequently, every result for each horse in that race.

The goal of the calendar crawler is to populate a table called meetings which holds the url of each meet. The next step is to crawl each meeting and download a copy of the html to storage for future processing.

To achieve this, I have written a crawler which runs 4 times a day and ‘rolls’ through a period of 17 days around the current day. I go back 10 days and I go forward 7 days from the current date. In this way I repeatedly cover the same date 17 times while it is ‘rolled’ over, so that I can detect any changes to the calendar data (such as event schedule changes, day/night changes etc.). The reason that I crawl 4 times a day is purely because the crawler itself can crash due to the harness website not being operational. I have never actually seen this event occur, but have left it as it is because my resources on Google App Engine are free, and the crawler completes its job in just a few seconds.

Each time I crawl a calendar day, I create an entry for each meet I find in the meetings table. This does lead to non-redundancy, which I take care of when I actually crawl the event. Because I store every event I find, I am also storing events that I do not need and examples include daily programming summaries, empty events, and non-existent events. However, as stated above, at this level the redundancy is ok, as I detect and correct these issues at later stage.

The issues I have encountered while crawling the race calendar are that meetings change:

  • The race day/night schedule can change. A race can change from a day meet to a night meet and vice-versa
  • The stated location of the race may actually be different to the actual location. An example that you can see in the blog title image for Sun 7th shows ‘Orange at Bathurst (D)’.
  • Races get cancelled. See the Tue 2 entry for ‘Wagga at Riverina Paceway’.
  • Race dates change.
  • Races can be trials.
  • Links to the races can be dead links.

The image below shows the table that is created with the url of the calendar date range. I use this in the actual crawler to return a JSON file of meetings which I then parse to the meetings table.

Calendar Table

And the subsequent meetings table.

Meetings Table

Next week I will dig into the actual mechanics and code of the calendar crawler and cover how it works and the problems I encountered using the Python requests library, issues such as rate limitation, use of Javascript bundles, and how I got around some of these using a Proxy server.

See you then!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store