Writing a Sports Prediction Program - Computer Technical Help

Two Plus Two Forums Other Topics Computer and Technical Help

Writing a Sports Prediction Program

Post Reply Subscribe

...

07-05-2013 , 03:43 PM

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

A friend of mine and I are interested in trying to create some software that will make NBA predictions. No, we don't expect it to work in the sense that we can make all this money from sports betting, we're more interested in doing it for fun and the experience. He is about to graduate with a B.A. in Mathematics and I in CS, and we're both big NBA fans.

In a nutshell, we want to take about 20 years worth of stats, get them all in the same place, and then run a bunch of queries on the data set looking for various patterns. We also would like to do a lot of statistical analysis stuff.

So what is the best way of going about this?

SQL is the only DB I'm familiar with, but am open to learning something new of course. I found a site where I can export tons of data/stats into csv files, so it would be fairly straightforward for me to create a relational DB, but:

1. is this the optimal route to go?

and

2. Then what? Can/Should I use R or Python to start doing the "data mining" stuff?

We already know what kind of stuff we want to look for, and majority of the mathematical tools we will be using, but are a bit in the dark about how to actually implement it.

Quote

07-05-2013 , 06:57 PM

clowntable

Carpal \'Tunnel

Join Date: Jun 2006 Posts: 45,557

Set a very basic milestone like
1) Load all data into the DB
2) Write a Python script that lets you query the average minutes played by a given player (or whatever simple stat you want)

Then go from that. Some ideas for next steps:
- Automate the CSV->DB part+possibly automate further and automatically scrape+load as soon as new data is available
- Do whatever complex calculation you deem interesting
- Turn it into a web app with something like Django if you use Python (for your own internal use i.e. running on 127.0.0.1)
- Fancy graphing
- Think about how you'd add lines etc. eventually like my simplest vision would be finding a way to scrape the lines from one site, displaying all (maybe all point totals or some subset) and have your program highlight the line it thinks is juciest (or rank them whathaveu)

I'd say that would be a sweet milestone...use old data, run fancy-spancy math, get and display lines for you, fancy-spancy math shows you the line it likes most (just boldface vs normal or something simple) then you could place the bet manually.
Once you have that you can implement your own playmoney system to keep track of how you're doing if you don't want the hassle of betting

[Also you might consider dong NCAA-BB instead if you can get the data because I think there's some actual money to be made there alas it's capped due to market volume (kind of like higher stakes in poker not running enough etc.)
But if it's a fun project just stick with NBA if that's what you watch]

Quote

07-05-2013 , 07:21 PM

LA_Price

adept

Join Date: Feb 2004 Posts: 1,119

I think python and R would be a good combo for this. Think of python as what you use to get the messy data into a SQL /csv format or load into a SQL databaes.

In terms of doing analysis and modeling, I think this is where R's libraries come into play. Basically its the difference between very good statisticians and very good programmers writing the underlying libraries you're using. If production was a consideration, I'd use python, but this isn't likely to be stressed web app so I'd go with R.

I personally wouldn't delve into using Django, as that would add lot of web dev, that is unless you want to show it to other people. I would use MySQL or Postres for storage. Joseph Adler wrote a good book about doing this called "Baseball Hacks". It's a few years old (2006 I believe) but many of ideas will still apply, though he uses perl as his scripting language, the MySQL and R portions would still be relevant to you.

After Haralabous Voulgaris NBA model (most of which is combo of his data mining and viewing a gazillion games) the second best model I know of on the NBA was done at canis hoopus by an anthropology student called PA100.

Quote

07-09-2013 , 09:17 AM

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Thanks for the detailed responses. I plan to start making some headway tonight. Maybe I will keep this thread alive as sort of a blog tracking my progress.

As for picking up Python...I used it for a semester way back when in high school. Haven't touched it since. I know Java, C/C++, and a little bit of Scheme. I'm a bit tempted to just skip going through any kind of boring tutorial and dive right in to coding. Thoughts?

Quote

07-09-2013 , 11:07 AM

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

I'm also interested in what you guys think of this as a project to put down on my resume and use as a talking point in interviews?

I'm graduating this December and thanks to my sub-par CS program coupled with a bit of slacking, have realized I don't have much experience under my belt...

Quote

07-09-2013 , 11:56 AM

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

I don't know if a similar facility exists in postgres or other SQL db that you might end up using, but if you're going to start from CSV files then I would note that MySQL has a fairly nice mechanism for importing CSV

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

it's likely to be faster than anything you can script yourself, or at least that was my experience the last time I was looking at importing CSV files

Quote

07-09-2013 , 12:09 PM

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Yeah I've actually had to use that exact feature before for a project in school.

My understanding though was that the purpose of the Python script was to pull the CSV file from the web and then inject it into the DB via that exact command?

Or at the least, say I create a folder called "2012-2013" with subfolders for each NBA team, and inside those folders are CSV files of the stats for every game that season, then I would write a Python script to iterate through the main folder, loading each CSV file into MySQL?

Quote

07-09-2013 , 12:22 PM

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

yep.

perhaps you are just smarter than I, because my first inclination in all such circumstances is to write the script to parse the csv line by line and create an insert query that way. load data infile is much faster than that

Quote

07-09-2013 , 12:51 PM

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

lol really just got lucky there as I took a Database class in MySQL last semester

Quote

07-09-2013 , 01:00 PM

#10

kbfc

Pooh-Bah

Join Date: Mar 2005 Posts: 4,225

Hi derada,

I know a thing or two about this. It can definitely be a fun project. Let me suggest, though, that if you want to do anything interesting, you're going to want to use play-by-play data at the bare minimum. You won't find 20 years of it, but what you do get will be orders of magnitude richer than box scores, and you'll also get a lot of very valuable real-world experience dealing with imperfect datasets.

Quote

07-09-2013 , 01:25 PM

#11

Dudd

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 54,241

I played around with some college basketball box score stuff a year or so ago, I found R and SQLite fairly simple to deal with once you got your data in a .csv format. Pretty sure the reference sites let you directly download their stats in a csv format, don't think they have pbp available though.

Quote

07-09-2013 , 01:34 PM

#12

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Yeah I've read a lot about Voulgaris and pbp models, and that definitely is the ideal goal. Going to start out simple though to just get my feet wet first.

Quote

07-12-2013 , 10:34 AM

#13

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

So the first few steps of this project are most likely going to move at a snails pace, a result of the first stages not being very interesting, and me having a ton of other things going on in my life. Nonetheless:

Last night I got SQL/R/Python all installed and up and running on my computer.

Played around a bit familiarizing myself with Python's sqlite3 package.

Started to watch videos/read up on web scraping with Python so I can begin to build my database.

I'd love to hear some recommendations for a good texteditor/IDE to use with Python. Last night I was just using the stock Python Shell and IDLE.

Quote

07-14-2013 , 11:35 AM

#14

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Yesterday I spent a good chunk of time actual coding and getting things to work.

Basically, I wrote a script that scrapes all the season data for each team, as well as the Per Game data for every player. Right now it is just printing the info but the next step is to use sqlite3 to inject into a DB.

For anyone who cares, here's the script that grabs the info for Player Per Game. I still have to figure out how I'm going to do it across different seasons (i.e., handle players who switch teams)

Code:

from bs4 import BeautifulSoup
from urllib.request import urlopen

def grabTeam(url):
    soup = BeautifulSoup(urlopen(url))
    pergame = soup.find("table", id = "per_game")
    mydata = pergame.find("tbody")
    return mydata       

def grabPlayers(per_game):
    num = len(per_game.find_all('tr'))
    count = 0
    while (count<num):
        i = 0
        while (i<24):
            print (per_game.td.string)
            per_game.td.extract()
            i+=1       
        count+=1
        per_game.tr.extract()
        print()

teams = ['ATL', 'BOS', 'BRK', 'CHA', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NOH', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'TOR', 'UTA', 'WAS']
team = ""

x = 0
while (x<30):
    team_url = "http://www.basketball-reference.com/teams/"+teams[x]+"/2013.html"
    grabPlayers(grabTeam(team_url))
    x+=1

And seasonal info:

Code:

from bs4 import BeautifulSoup
from urllib.request import urlopen

def grabTeam(url):
    soup = BeautifulSoup(urlopen(url))
    pergame = soup.find("table", id = "team_stats")
    mydata = pergame.find("tbody")
    return mydata      

def grabStats(team_stats):
    count = 0
    while (count<2):
        team_stats.td.extract()
        team_stats.td.extract()
        team_stats.td.extract()
        i = 0
        while (i<19):
            print (team_stats.td.string)
            team_stats.td.extract()
            i+=1       
        count+=1
        team_stats.tr.extract()
        team_stats.tr.extract()
        print()

teams = ['ATL', 'BOS', 'NJN', 'CHA', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NOH', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'TOR', 'UTA', 'WAS']
team = ""
year = 2000


while (year<2013): 
    x = 0
    while (x<30):
        try:
            team_url = "http://www.basketball-reference.com/teams/"+teams[x]+"/"+str(year)+".html"
            print(teams[x])
            grabStats(grabTeam(team_url))
            x+=1
        except(AttributeError):
            x+=1
            pass
    year+=1

The exception above is there to handle situations where, the Charlotte Bobcats did not exist pre-2005. I have a lot of tweaking to do, just coded it this way at the moment so that the entire program would run through 2000->2013.

Thoughts and suggestions are more than welcome

Quote

07-14-2013 , 02:16 PM

#15

mburke05

Carpal \'Tunnel

Join Date: Mar 2008 Posts: 11,928

cool stuff derada! i'm following along, as i've never actually follower through with something in python (still a fish) but i'm a huge nba'er / tz'er and beginning to get into more advanced libraries like scrapy (which btw i think you should check out for web scraping/crawling. especially if this is eventually going to be live.)

one thing i used when i aggregated data on multiple seasons is a unique identifier for a player season, i dont know how exactly because i havent fully studied your code, but basically i would use boolean logic to say something like:

player id = concatenate(playername+season+team)
if 'TOT' + playername + season in previous scrape
delete any other player entries
continue with scrape using 'TOT' id

'TOT' is the combined season ofc via bref.

Quote

07-14-2013 , 02:20 PM

#16

mburke05

Carpal \'Tunnel

Join Date: Mar 2008 Posts: 11,928

an alternative is to just scrape everything and then parse it while going through sqllite, and in fact now that i think about it its probably much easier to do it that way

Quote

07-14-2013 , 03:08 PM

#17

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Quote:

Originally Posted by mburke05

cool stuff derada! i'm following along, as i've never actually follower through with something in python (still a fish) but i'm a huge nba'er / tz'er and beginning to get into more advanced libraries like scrapy (which btw i think you should check out for web scraping/crawling. especially if this is eventually going to be live.)

What I posted above is actually the first Python code I have ever wrote (lol).

Quote:

Originally Posted by mburke05

an alternative is to just scrape everything and then parse it while going through sqllite, and in fact now that i think about it its probably much easier to do it that way

Yeah I was assuming that the problem/solution lies in how I structure my DB as far as tables and primary keys/foreign keys go. My main issue isn't necessarily how to scrape and inject into the DB across multiple seasons for players who have changed teams. I'm trying to think down the line about how/why am I going to access this data, and what am I going to do with it? As the answers to those questions should pretty much determine how I should structure the relational database.

For example, if I'm going to be analyzing a player across the span of his career, then having a Carmelo Anthony table with the rows being his Averages Per Game for each season would be ideal (I think?). And so on and so on.

Quote:

Originally Posted by mburke05

one thing i used when i aggregated data on multiple seasons is a unique identifier for a player season, i dont know how exactly because i havent fully studied your code, but basically i would use boolean logic to say something like:

player id = concatenate(playername+season+team)
if 'TOT' + playername + season in previous scrape
delete any other player entries
continue with scrape using 'TOT' id

'TOT' is the combined season ofc via bref.

Would you mind explaining what is going on here a little more? Just trying to soak up as much as I can. I understand the first line but that's about it.

Quote

07-14-2013 , 06:51 PM

#18

RoundTower

ɹǝʍoʇpunoɹ

Join Date: Feb 2005 Posts: 14,563

i think this

Quote:

Originally Posted by derada4

having a Carmelo Anthony table with the rows being his Averages Per Game

would be a nightmare to work with. You will have hundreds of fundamentally identical tables with different data in them. if you write a query for one player's table, you will have to change table names to make it work for another. If you realise you need an extra column in one player's table, you probably want to add it to all your hundreds of player tables. You almost certainly want to make that a single table.

The standard way to do something like this (if you did not have playbyplay info) is to have a Player table, a Game table and a PlayerGame table. The PlayerGame table would have columns game_id, player_id, points_scored,...

Since you have play-by-play information, probably you want a Play table and a PlayerPlay table as well, and put all of pbp information into that.

This setup allows you to record all of the information you need, without duplicating that information in multiple places. If you want to find a player's average points per game over each season, you can write a query to do this. If you find you are doing this regularly, put it in a view instead.

The main reasons not to set up your database in a highly normalized manner like this are to do with performance. I don't believe you will have enough data where performance becomes an issue. Even if it did, you are probably best to use the normalized setup and run some batch jobs to create rollup tables with the extracted information.

Quote

07-14-2013 , 07:05 PM

#19

Dudd

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 54,241

You almost definitely shouldn't do a table for each player. If you're working with boxscore data, I'd store it with something like 'gameid playerid teamid stat1 .... statn'. Then, I'd have separate tables which store player and games, something like 'playerid lastname firstname height position ...' for the player table and 'gameid date hometeamid awayteamid ...' for the games. Then just use joins when you want to aggregate the data, so for instance, if you want Carmelo Anthony's stats, do something like:

SELECT SUM(points), SUM(fg), SUM(fga), etc FROM BoxScores JOIN Players ON BoxScores.playerid = Players.playerid WHERE Players.lastname = 'Anthony' and Players.firstname = 'Carmelo' GROUP BY BoxScores.playerid

That's how I'd initially set it up anyways, then maybe create some tables derived from the basic game stats if you wanted to create a table which stores per season averages or career averages or per team averages or whatever, but having the data in its most basic form allows you have the most flexibility when running queries.

Last edited by Dudd; 07-14-2013 at 07:07 PM. Reason: pretty much what RoundTower said

Quote

07-15-2013 , 07:32 AM

#20

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Ah, yes, I see exactly what you guys mean. Thanks a ton, lol.

Quote

07-18-2013 , 10:01 AM

#21

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

Difficult to implement (and to understand), but take a look at Neural Networks for finding patterns in this sort of data
http://en.wikipedia.org/wiki/Artificial_neural_network

Quote

07-18-2013 , 10:12 AM

#22

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Thanks, will definitely look into that as I love that kind of stuff.

UPDATE: I finally have the code running smooth as far as scraping the data and injecting it into a database. So far I have one table that holds all the Season Averages for every team from 2000 - 2012. I used the Team Name and Season Year as the Primary Key.

One issue: How to handle situations where the NJ Nets became the Brooklyn Nets? The tag that I scrape from the site goes from NJN to BKN. I guess I can just decide to do an

if x == BKN
x = NJN

But I would really like to implement something more general that can handle other similar situations (For instance, the Bobcats not existing prior to 2005, etc).

I think I'm going to make my next milestone loading into the DB the box scores for all teams for a number of seasons, and doing something simple like:

"For teams on a 4 game losing streak, what is the average win% for the 5th game. How does this differ between home vs. away?"

Quote

07-18-2013 , 10:35 AM

#23

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

You shouldn't use a string as a PK. For example, if the string needs changing (the team name) you have to update all FKs which becomes a headache. You will have a lot of repeating data which means your data is not properly normalised.

Even for tables where I sometimes feel I don't need a PK or a PK can be derived from another value, I pretty much always have an auto incrementing ID integer as the PK. It simplifies everything.

PKS derived from mutliple columns can also create a lot of follow on complexities. Keep it simple imo, an ID primary-key auto incrementing int field for every table.

Team name changes can be handled in lots of different ways. Without seeing your data structure I wouldn't be able to say which I would think would probably be best. Using a simple integer PK for every table will help you solve this problem though.

Last edited by Gullanian; 07-18-2013 at 10:40 AM.

Quote

07-18-2013 , 10:38 AM

#24

derada4

old hand

Join Date: Feb 2011 Posts: 1,544

Thanks a ton, that'll solve my NJN/BKN dilemma as well.

Quote

07-19-2013 , 02:17 PM

#25

Benholio

Carpal \'Tunnel

Join Date: Jun 2004 Posts: 8,292

In the vein of neural networks, there is a great introductory class to Machine Learning that you can take online: Stanford Machine Learning

It is ~2 months long and has you actually programming the algorithms to run regressions on exactly this sort of data. The professor is one of my favorites. I really enjoyed that class.

Quote

Post Reply Subscribe

...