Everything has a beginning, sometimes with no conceptual end.
This is a story of my journey into Data Science and the progression from an idea to something that takes me to the dystopian world of statistics, probability, and the arcane world of Machine Learning.
Beginnings. “‘import React from ‘react’”.
I had already learned some Python, thanks to the 1,500 page epitome of Python by Mark Lutz (which, 5 years later I am half way through), and a Udacity Nano-degree in Front End Development. Several short courses later, including more from Udacity and Udemy I found myself intrigued by React.
I had found the Rabbit Hole, but had not yet dipped my toes into it. I already knew that building something from scratch required more than a knowledge of Html and CSS (I really, really dislike CSS), but I also knew that given the challenge, I could literally flatten a mountain, albeit by one shovel of dirt at a time.
My first ‘idea’ was to build a site that could help a vehicle buyer find the best possible deal in a particular market. This taught me web crawling skills, mostly in Python and sometimes Node, and introduced me to the real world of data gathering that is web scraping. The idea was good, and I actually built a site, but gathering the data, cleanly, was the most time consuming issue.
The second was tracking the success of CrossFit. I am a CrossFit ‘observer’. I love the idea of being super fit, and decided at one time (maybe twice) that I should own a CrossFit gym. That did not work out so well (not once, but twice). But, on a positive note, during my time failing at business (and CrossFit) I developed a web crawler (ha!), that on a weekly basis scrapes the active affiliates from the CrossFit site. I have been doing that successfully for 5 years now, but no web-site has yet appeared. The data however, tells a wonderful story, but not for this story.
The third idea, and the most impressive, was to scrape racing result data about the Harness Racing Industry in Australia. I am Australian born, but also a Canadian Citizen, and I have been interested in Horse racing statistics since the late 80’s when I was an engineer with Alcatel.
So, down the Rabbit Hole I go.
Not only do I need to gather data, and a lot of it, just under 5 million race results as I write this. I also need to learn how to wrangle, clean, manage a massive data set (for me), and Machine Learning.
As an example of data quality issues I have encountered, when race results are gathered, horses are ranked in their finishing order (duh), and the most interesting order is 1, 2 or 3. You would think that describing the distance between horse 1 to horse 2 to horse 3 would be simple. It should be. It should be something like 0.15m, 1m, 10m…choose any numerical measurement system, metric, imperial, logarithmic, …any of these would work. But no, whoever came up with the system allows a mixture of numerical distances with things like ‘NOSE’, ‘NS’, ‘HALF NOSE’, ‘HF NOSE’, ‘1/2 NS’, and the list goes on and on. In fact, there are more than 340 representations of the same data to describe the distance between the 1st place horse and the 3rd.
The rest of this story, will be written at the rate of a chapter per week. I will describe my vision for the data, how I learned to gather and manipulate the data, the development of a web site, and ultimately the application of Machine Learning to create an analysis, and perhaps some predictions.
I am currently in the data gathering stage (the 5 million records are stored in a PostgreSQL database) and am just experimenting with some basic analysis to identify inconsistencies and areas to add more data or clean up of existing data. It is February 2021, and I am guessing that I will not have a clean data set until around June to August 2021.
This is my journey down the analytical rabbit hole.