I began writing these stories as a way of journaling and reflecting on my progress as a python coder and as a way of documenting my journey into Machine Learning. These stories are not intended to teach or suggest anything other than what my own personal journey as a self taught coder has been. I believe that most self taught coders probably go through a similar journey, especially if they have not been exposed or experienced professional, corporate experience. I have received criticism that this is simply a journal. Well, yes, that’s exactly what it is. It is my attempt to reflect, learn, and re-do a couple years of work and is in no way meant to solve anyone’s coding problems or issues.
I have found that every question I have ever had for any time that I got stuck, or had to learn something new, that there was a course (Udacity, Udemy, Coursera, FreecodeCamp, FastAi, Khan Academy, Kaggle, Datacamp, Codersrank, free courses at MIT, Harvard etc Stack Overflow….and Google) that provided me with a solution. So, who am I to provide solutions to problems that have been solved 100 times over? If I ever feel that I have stumbled upon something so unique that I can add to the body of knowledge as a solution then I will do so. This set of blogs is definitely not intended to do that.
In the last story we began dissecting the Calendar Scraper that I built that crawls the Race Calendar entries daily to find new race meetings and subsequently new data for my analysis. In this story I will discuss the Meetings class and how it fits in with the crawler architecture and then we will finish by outlining the actual python code that brings all of this together into a crawler/scraper.
In the Meetings class we have defined the following:
- id — Unique id
- racedate — the date of the race event
- url — The url of the race event page
- classname — The type of race [‘trial’, ‘tab_night’, ‘tab_day’, ‘tab_twilight’, ‘non_tab’, ‘cancelled_meet’].
- title — The Name of the ‘Field’ that is hosting the race. [‘PENRITH’, ‘HAROLD PARK’,…..’TABCORP PK MENANGLE’] etc
- processed — A Boolean flag indicating that the url has been downloaded for further processing.
- nightday — Whether the meeting was held during the day or night
- donotprocess — an Administration field (Boolean)
- processederrorcount — No longer used
- lastupdated — redundant.
So, now back to the python code that uses the CalendarList and Meetings Class. The python code is accessed by the Google Cron Job Scheduler using a url that accesses an exposed flask route.
The snippet above is the entry point for the scrapeCalendars route. The statement @app.route binds the url (‘/crons/scrapeCalendars’) to the function main(). This in turn runs the function scrapeCalendarsFromWeb().
The main function will finalize by returning a response that contains some JSON and a success status code of 201.
I am returning rubbish JSON here. This could be improved simply by removing the JSON and returning NONE. For the time being there is no impact to the code to keep this as it is.
Next we have the scrapeCalendarsFromWeb function.
When I first wrote this code I was all about catching some of the statistics associated with each run. This is what the statsDict dictionary is used for. I capture parameters for each run, including the type of process, how many meetings were processed, how many urls were processed directly from my GAE server versus any that I had to resort to a proxy (rate limited, blocked etc), how many events were saved, how many meeting were processed, how many events were not processed, and a count of how many rate limits I encountered.
To begin with I grab the current date and time, set the baseurl and then calculate a start and end date for the crawl. The start date is set to 10 days prior to the run as I want to ensure that I capture all completed races and any changes that may have been made subsequent to a race being completed. I also look for new events in the upcoming period of 7 days ahead. In this way I have the ability to construct ‘rolling’ urls which expose the events calendar over a 17 day period.
This information is then used to construct the url which is subsequently crawled.
I then assign a new dictionary, q, to hold the url. This is confusing as in retrospect I see no reason why I did this. There is an opportunity to refactor this by removing the assignment of the url to a dictionary.
The useProxy is Boolean flag that I can manually adjust to force the use of ProxyMesh. The proxy is a way of getting around blocking, rate limiting and several other issues associated with the url host to prevent scraping. This is a problematic area for me, as I have tried many ways of crawling which work for various lengths of time, and then I am blocked, or prevented from accessing sections of content. Over time, we will see how I get around these issues as we work through the remaining architecture of this entire application.
I probably should also use an environment variable for the useProxy flag. This is another refactor opportunity.
I then set some more variables. The sleep variable is set to 1 second because I pass this to the processUrl function which does all of the heavy lifting. I also set a filename (completely redundant! Refactor) and a fieldDict that is a dictionary that holds all of the event data (again, redundant, Refactor!).
Then, we run the processUrl function which is where all the magic happens. It is this function that requests the page, does some initial processing, and saves the page to Google Storage. I will begin working through this function in the next story.