Open Side Menu Go to the Top
Register
Scraping and analysing data Scraping and analysing data

01-26-2016 , 08:12 AM
Hi guys,

Context: I'm now a second year math/cs student - my classes have been in Haskell and Java, I've done no databases. I've also done an intro Udacity course which taught Python so I know a bit of that.

I wanna learn how to write customisable HTML scrapers to get sports data and run statistical analysis on that (for fantasy sports). I guess for I'll try to ask actual questions, recommendations of languages and/or courses is very much appreciated.

1 - Is writing good scrapers feasible for someone of my experience, or should I just get some browser extension? Any good resources to recommend?
2 - Either way, what should I learn in order to automate scraping and other scripts (I'm running Windows) at certain times of the day
3 - Best place to store the data? Should I import everything into excel for analysis, or is it worth learning about databases for this?

I've got about 2 months of after-class time in which I can learn/experiment with this, at which point I'd like to have a good system set up. Sorry if this is vague, feel free to ask for any clarification.

Thanks in advance
Scraping and analysing data Quote
01-26-2016 , 09:59 AM
what java book(s) you used? feel like one of the standard java books uni's use tend to have some rudimentary scraping program
Scraping and analysing data Quote
01-26-2016 , 01:45 PM
For fantasy football nfl has a json feed that is near real time I believe:

http://www.nfl.com/liveupdate/game-c...10101_gtd.json

Guy I work with has a friend that took over the code for a rudimentary fantasy league phone app and just maintains it for their group. He uses that feed for everything in season and I think has great update times/results in app.
Scraping and analysing data Quote
01-26-2016 , 06:07 PM
Quote:
Originally Posted by Noodle Wazlib
what java book(s) you used? feel like one of the standard java books uni's use tend to have some rudimentary scraping program
Thinking in Java by Bruce Eckel, however that was more recommended for keen students and not required to get good grades, so I didn't get it. I probably will as that'll be a good resource to have for software engineering stuff, but it's gonna be quite a while till I use java at uni again. I'm doing a very barebones major (focusing on math/stats more) and only doing one CS subject most semesters. So really this will be completely separate from uni.

Anyway I got a really basic scraper working in python but will need to customise it for each site, and still need to figure out whether I should learn excel or python data analysis + sql. It'd be cool to get this working for non-sports data too, maybe it's naive but I feel like if you could program this stuff and keep an eye out, you could find the occasional arbitrage opportunity, Nate Silver some less common events that have betting attached, etc etc. Are there any programmers here that have tried that sort of stuff?

Last edited by boganomics; 01-26-2016 at 06:15 PM.
Scraping and analysing data Quote
01-27-2016 , 12:14 AM
I think python would be a good choice. It's popular enough in the sports analysis world. I wrote something using RoR that scrapes fangraphs, fanduel, draft kings, and a site that grabs starting lineups. Did pretty well with it last year.
Scraping and analysing data Quote
01-27-2016 , 12:27 AM
Awesome! Did you store the data in RoR structures and do analysis there, or did you put it in a database or excel?

(sorry if that's a stupid question, I'm only now realising how little I know about where data is usually stored)
Scraping and analysing data Quote
01-27-2016 , 01:01 AM
If you intend to use java, then I highly recommend jsoup: http://jsoup.org/ and apache poi for excel with java.

I used it for a long time, then I learnt R. It was a time invested but it payed of since no longed I need to write a bunch of boilerplate code ie write model definitions for database schema or html files, etc.

I really like apache open office and I tend to use it whenever appropriate. But I find that plotting data is slow with larger data sets, so you have to have a decent machine. And I think you can't do 3d plotting. Also, regression algorithms aren't really accurate. In all of which I find R to be superior.

I use gdata for xls file in R and lattice for 3d plotting. So if you decide to go for it, I recommend learning http://adv-r.had.co.nz/ and http://www-bcf.usc.edu/~gareth/ISL/
under assumption that you are an experienced programmer.

Last edited by bex989; 01-27-2016 at 01:06 AM.
Scraping and analysing data Quote
01-27-2016 , 04:15 PM
I was the cofounder of a company that scraped oil and gas data from 30 different state websites and about 100 different pipeline websites. We used perl for everything due to the ****load of libraries available for scraping.

www::mechanize
html::parse
html::table::parse

I think we wrote about 450 scrapers total.

Python is fine for this, for sites heavy in javascript java is probably the best answer.

Last edited by _dave_; 02-09-2016 at 01:55 AM. Reason: [noparse] tags
Scraping and analysing data Quote
01-29-2016 , 02:18 AM
Thanks for the advice. Using java would be more ideal since I feel like it's probably a more useful language to stay practised in, however for data analysis it seems like Python is better.

I'm currently doing a course in excel data analysis, will learn how to combine that with the scrapers and write automatic scripts (get these stats at this time etc) which I'm guessing is VBA (?), will look to using Python to supplement and maybe replace excel for my analysis, and eventually, if it's working out well, learn R.

Any critique on that plan is appreciated!
Scraping and analysing data Quote
01-29-2016 , 12:15 PM
I actually just started doing this, Ive been using Ruby with Nokogiri.
Scraping and analysing data Quote
01-29-2016 , 01:15 PM
Quote:
Originally Posted by boganomics
Thanks for the advice. Using java would be more ideal since I feel like it's probably a more useful language to stay practised in, however for data analysis it seems like Python is better.

I'm currently doing a course in excel data analysis, will learn how to combine that with the scrapers and write automatic scripts (get these stats at this time etc) which I'm guessing is VBA (?), will look to using Python to supplement and maybe replace excel for my analysis, and eventually, if it's working out well, learn R.

Any critique on that plan is appreciated!
Python is excellent choice as well.
I'd hate to use vba, you can find some python library to generate xls/x files dynamically. With java you do that with apache poi.
Look around, I am not that familiar with python.
Scraping and analysing data Quote
01-29-2016 , 01:49 PM
I am partial to R; Python is probably also fine. The limiting factor for you will be analysis, not scraping. I mean you can use just about anything to scrape, but for quality data analysis there are few options. Excel is not a serious tool for data analysis. For example, you won't be able to run a K-fold cross-validation with multiple models in Excel, but you can do it in R quite easily. I've read that other people have had success with scikit-learn for Python, and that may be a valid option, but I have not used it.

Automating scripts should be easy enough in Windows. Seems like you could just use task scheduler for that. I've used databases for storing fantasy data, but I actually prefer using flat files (csv) and being smart about memory within R. If you decide to use data that is voluminous (e.g., play-by-play, PitchFX) then you might be better off using dbs. Either way, you're going to have to learn how to do joins since no single source of fantasy data will have all of the information you need.
Scraping and analysing data Quote
01-29-2016 , 03:36 PM
Quote:
Originally Posted by Alobar
I actually just started doing this, Ive been using Ruby with Nokogiri.
This is a good option, Also Watir for scraping sites with lots of AJAX in the page that doesn't play nice with Nokogiri (I'm looking at you espn.com)
Scraping and analysing data Quote
01-30-2016 , 01:17 AM
Nice, I'll have to try that Dudd.
Scraping and analysing data Quote
01-30-2016 , 01:19 AM
Quote:
Originally Posted by boganomics
Awesome! Did you store the data in RoR structures and do analysis there, or did you put it in a database or excel?

(sorry if that's a stupid question, I'm only now realising how little I know about where data is usually stored)
RoR generally hooks up to a postgres or mySQL database and you can then write functions in Ruby to interact with it. I actually don't know how to write SQL...bad for me. I want to learn one of these days.
Scraping and analysing data Quote
01-31-2016 , 07:30 AM
Good stuff, thanks a lot for all the replies
Scraping and analysing data Quote

      
m