Quote:
Originally Posted by goofball
1. Awesome
2. How did you do the scraping?
3. I'd love to see a scatter plot of post # and words/post
4. Are you open to constructive criticism on graphs
So apparently you got yourself banned, but I'll answer briefly anyway.
The scraper, written in java, is really just a series of tedious brute force extractions from the html pages based on based on deterministic triggers for each piece of data. Although on first glance the html source code for this web page looks like a monstrous pile of random ****, the forum software actually produces content that's highly structured and predictable.
Ultimately, it's just a matter of automatically walking through menu pages, extracting thread URLs, and then walking through the pages of each thread extracting content. The post data is inserted into a very simple database table via a very simple
DAO.
Although I doubt my operations would have any impact on a web-based ecosystem as gigantic as twoplustwo, I still throttled my requests to one every 2 seconds.
As for the presentation, I know the charts are not stellar, but for personal project work, I'm basically stuck with an old version of Apple Numbers (part of iWork '08). It gets the job done, but apparently more recent versions have a richer set of chart features (including bubble charts, which I would have liked to apply here also).
I played around with the scatter plot, but I couldn't figure out how to label the individual points, so what we're left with is more of a holistic view of the distribution. Here's one where I added a few labels for some notable outliers. These are the same top 50 posters from earlier charts.
The Politics plot looks compressed compared to Unchained, but that seems to be due to ikestoys putting a leftward squeeze on everybody else (visually at least).