Back in the late 80’s I worked as a Test Engineer with Alcatel Australia designing test software and hardware for the testing of AXE exchange equipment. We used Hewlett Packard 3065 test stations that used a bed of nails to detect everything from shorts and opens to complex CPU patterns to help ensure that our manufacturing process was error free.
I happened to work with a test engineer who I think was polish and he liked to place bets on Harness Racing. I had seen my father place bets all my life and I recall one Saturday afternoon where my Dad turned a $1 parley into about $4k in cash by predicting the winner of 5 consecutive races, with each previous amount of winnings being placed on the next. In other words, $1 turned into $10, which turned into $80, and so on, for 5 consecutive bets.
I never was a gambler but this polish guy intrigued me to the point where I would purchase the ‘Trotters Guide’ and go home and place that into what was the equivalent of MS Excel at the time. I then managed to get an online (telephone) betting account with the Australian TAB and, no pun intended, I was off to the races.
Even then I was able to guess whether a horse would come 1st, 2nd or 3rd. The problem was that the returns on place bets did not cover your losses. So, you would win say $2.50 on a $1 bet, putting you $1.50 ahead, but if you missed the next two or the returns were small, you eventually eroded your bank.
I knew that to really make this work that you needed to pick the daily double or a trifecta. Both of these involve picking the 1st and 2nd place (or the 3rd) in the same races, with the odds multiplying by each horses odds of winning. In this way you could turn a $1 bet into thousands, but you also had to absorb the many losses along the way. I thought I was on to something, and I still do.
Back then, any analysis had to be performed on a sample of what you had available, you could not really take a single horse and answer the question about its probability of winning with any degree of accuracy. Machine learning, with its ability to harness the law of large numbers, means that you can now single in on a single event probability, meaning that you can now perform an analysis of a single horse and weight it against its peers.
Without Machine Learning, and the ability to bootstrap statistics, the problem I had back in the 80’s would still be a major issue today. Back then I could deal with maybe 100 horses, possibly up to 1,000. As of this morning, as I write this on February 14 2021, I have 144,000 horses and 4,893,357 individual horse results over 114,000 races.
This is by far the biggest racing result dataset that I have found outside of the racing authorities themselves.
Another interesting thing I learned in the early 90’s was that insurance companies would invest in the results of race outcomes. They had access to Economists, Statisticians, Mathematicians and lots of computing power. I wondered at the time what they knew that the general punter did not.
There are lots of papers out there where academics have researched and written thesis on the aspects of results prediction. ‘Googling’ — “horse racing predictions machine learning” returns over 14 million results. There is even the folk lore about Bill Benter who invented a ‘can’t lose’ algorithm which helped him win the Triple Trio (trifecta) worth $13million.
Most of the academic articles I have read have worked on small datasets and nearly all of them attempt to predict the ending position, generally 1 through 10, using various clustering or relative distance algorithms (‘distance’ in a data point perspective) such as k-means. Not a lot of them attempt to predict race finishing times because the data is not there in most cases for anything other than the first place.
I believe that the real power lies in the prediction of the time it takes for a horse to compete a race at a specific distance, on a specific track, out of a specific starting barrier, along with a few more important variables or features that will come out as you follow along with my story.
Most race results include the distance that an individual horse finishes behind the winner, and I have created an algorithm that turns this distance into a pseudo race finish time. By using this race finish time I think I can then run my dataset through a neural network and begin the task of predicting race finish times. Once you have a set of race finish times for a pack of horses in a race, you can determine the probability that a specific horse will come in 1st, 2nd, 3rd, or even 4th. You can then used the published odds to calculate your return on a $1 bet for all the possible combinations of these outcomes.
Using this method, a punter can then place consecutive bets on the double or trifecta (or both) result for these combinations. For example if you have three horses that you think will come in 1st, 2nd and 3rd, but you are not sure which order then you can place 9 bets spread across all the possible finishing combinations. It is possible that a $9 spread of $1 bets could return hundreds of dollars (or millions, if your name ends in Benter). This same concept applies if you think that 4 horses have a possibility of finishing in the top 3, or even if 5 horses have, or even 6…
Using machine learning to predict the time it takes a horse to finish a race, along with the odds of that horse winning, applied to a group of horses that may finish 1st, 2nd, or 3rd, with a return matrix calculating the possible winnings, turns the odds into the punters favor.
The final observation regarding the use of Machine Learning for race result prediction is the fact that an algorithm continues to improve, all things remaining the same, with the more data you get. You can tweak hyper parameters, change algorithms, use regression, clustering, or neural networks, but the fact is, the more data you have to train your algorithm, the better the results.
Let the journey begin.
Next week I will begin to describe the architecture of how I gather data. This is not a trivial exercise and the initial data capture has taken me 248 consecutive days so far, and I am down to the last ‘problematic’ corpus of around 200 events (from 114k, where an event is a race day at a track with one or more races), that have errors that can only be corrected manually (typos that cause my code to crash as an example).