The calendar of race events (Meeting Calendar) is constantly in flux as race meets are added, removed, and changed. The crawler I have written runs every 6 hours using a Cron scheduler on Google Application Engine and it initiates a Python script that requests and processes a URL.
The Python script exists within a container instance on the App Engine Standard Environment and is free to run. So long as the script itself executes completely within 10 minutes then the Standard Environment which is free, is a perfect way to do this. My experience with this particular script is that execution of a calendar scrape takes less than 15 seconds, well within the 10 minute time limit set for the Standard Environment.
The goal of the scraper is to identify the url’s of race meetings, and save the url in a table that I call ‘meetings’ for later processing (the meetingcrawler which will be discussed in future stories). I do not check for any redundancy in the url as I want to make sure that I do not miss any data when I finally begin to crawl and parse the actual race results. It is at this stage, parsing race data, where duplicates are resolved and the architecture and mechanics of this will be discussed in future stories.
The basic architecture of the calendarscraper is as follows:
The Proxy server is used in cases where the receiving server rate limits our requests or denies access to the requested url. In the case of the web data source this varies depending on the url and I have never encountered a rate limit at the calendar level. However, when I process an actual race meet or attempt to get some specific horse data I get rate limited on almost every attempt, even with a proxy. This will be discussed further when I write about how I process a url and then in each of the additional crawlers (for horses and events).
How the proxy is set up, as well as code specifics for the calendar scraper will be discussed in forthcoming stories.
The Code Flow.
The basic pseudo code for the code looks like this
- Set the crawl base url to
- Add a start date equal to todays date less 10 days
- Add an end date equal to todays date plus 7 days
- The result is
- Process this url using the Python requests library which will return a JSON file containing all race meets between the start and end dates.
- For each tuple in the JSON file process a race meet
- Apply basic reformatting to determine if the race is night or day
- Save the race information in table meetings with raceDate, Url, raceclass, title and nightday
Starting next week I will go into more detail and the actual code of the way I process the URL and how I save the race meeting data in the meetings table.
If you are interested in what the base url returns, below is a snapshot of what the JSON file looks like. You can copy the above url directly into your browser window to see the actual result.
We are at week 4, and have hardly touched the server. It will take 3 or 4 more stories to completely cover the calendarscraper before we move on to the next crawler which grabs and stores each race event.