I began writing this story as a way of documenting the journey of creating a massive, clean, tidy and Machine Learning ready database for the analysis and prediction of harness racing results. What I did not understand at the time was that this process becomes very reflective and as I worked through my model, and my code, I realize that so much has changed in my evolution as a Python programmer that if I were to write this code today, I would do it completely differently.
So, I think that from here on what I describe in these stories will first be about how I approached solving the problem based on what I knew and understood when I wrote the original code, but also about how I re-factor and point out areas that can be improved. I may as well re-build as I go now that I am so much further ahead as a Python programmer.
The most recent improvement will be the re-structure of the code to something more object oriented where I get the chance. Using a class structure within the code will sometimes make sense, rather than a top down / waterfall approach to the code which dates back to my early college days in the 80’s and 90’s.
In the last story (“Crawling the Race Calendar”) I described the following flow of the code:
- Set the crawl base url to
- Add a start date equal to todays date less 10 days
- Add an end date equal to todays date plus 7 days
- The result is
- Process this url using the Python requests library which will return a JSON file containing all race meets between the start and end dates.
- For each tuple in the JSON file process a race meet
- Apply basic reformatting to determine if the race is night or day
- Save the race information in table meetings with raceDate, Url, raceclass, title and nightday
This week I will begin walking through the actual code. Instead of just pasting the full code here I will instead paste snippets and describe how each of these snippets work and at the end of the series for this group of stories I will provide the entire code base for the calendar crawler.
The set of code following is the initialization of the Python code. I can already see some potential areas of improvement here, discussed below.
The code above was created by using GitHub Gist. Very cool and this is the first time I have used it. To get the color effect you simply create a Gist, then paste the link into the story. Medium does the rest, automagically (thanks to the story posted by Squirrel — ‘How to add code snippets to Medium?’)
The first 12 lines are simply library imports for utility functions that I use further in the code. As I use each of these I will explain them rather than going through each of these lines right now. If you are a programmer then they should set a twinkle in your eye, if you are a non programmer then I will reveal how these things work as we go.
It all begins to get interesting at line 14 where I import from something called Models. The model is a python file made up of classes that reflect the data structure within the PostgreSQL database. Further on, I sometimes use a Library called SqlAlchemy which makes it easy for me to manage the database and gives me a choice to use either raw SQL (query) code, or the Object Relational Model (ORM) provided by SqlAlchemy.
On a side note, my good friend Victor Morgante is a subject matter expert in database modeling, especially when it comes to ORM and Natural Language Processing (NLP). He is a genius….and farms Mangos in the far North of Australia. If you are reading this, and have gotten this far, you should check out and follow his blog on Medium.
So, this Model thing has a couple of classes that I use somewhere in this file: Base, CalendarList and Meetings.
Base is a SqlAlchemy thing (don’t you love my writing style? It’s because I don’t have a great vocabulary), that acts as a Base Class (programmers object oriented talk), that all the other classes inherit from, in this direct example, these will be: CalendarList & Meetings.
What Base does, in SqlAlchemy speak, is to use something called declarative_base to create a base class that is initialized with two common attributes called Table and mapper().
These are used by SqlAlchemy to automate the creation and management of Tables in the actual database and provide a way to map from a table structure to the magic that happens inside SqlAlchemy.
So, in the code above we import a whole stack of things from SqlAlchemy, some Geo Spatial stuff that is not used in this snippet, and then we define a constant called Base which is an instance of declarative_base().
We then have our first table definition, CalendarList which is a class, that inherits from Base which in turn inherits from declarative_base(). Whew.
The CalendarList replicates what we see in the backend PostgreSQL database.
- id: a sequential serial number that uniquely identifies a table record
- url: the url that is built based on the rolling dates described in the last story
- processedcount: the number times this url was processed. I no longer use this and it should be removed.
- datetoscrape: the date that the url should be processed. I no longer use this and should remove it.
Interestingly, if you are a standards type of person, I have not named my columns very well, nor according to best practice. This is an opportunity for re-factor, but changing the names here also means finding them in ALL of my code and making the changes there also. Unfortunately for me, this needs to be done and is a good practice.
To clean up I will remove the redundant columns. No renaming required as the id and url fields which will remain are perfectly ok as field names in the database.
The __repr__ is a special class method to represent an objects (the field definition of CalendarList) as a string. A programmer can use __repr__ by calling the repr() function ( repr(CalendarList) ) and this will return the object as a string which can be printed:
This simply provides a mechanism useful for debugging and it can also be used to store the object definition as text, and then replicated by rebuilding the object somewhere else. I can think of lots of useful ways to use repr and here it is used by SqlAlchemy when an error is thrown and some debug info is needed.
I think that I have written enough for this week. We introduced the concept of using classes to define our database tables, and we have already performed some cleanup. We have defined one table so far, CalendarList, but we do not yet know how it is used or where it fits in the world of the crawler just yet. We know a little about __repr__ and we have heard about this SqlAlchemy thing. We do not really understand classes, ORM, or Base, but we will take that in our stride and soldier on with the hope that we will learn more about them as we go.
Next week I will finish with the description of the Meetings table object (class) and I will discuss one additional reflective re-factor that I can see myself making in the near future.
Same time, same channel, next week.