Quote:
Originally Posted by Shame Trolly !!!1!
If you don't mind, a couple of questions... (a) did you write your spider from scratch? (b) what is your spider coded in?
I had
a post in the previous thread with some technical details, but the code used this quarter is all java and all original. The only extra work this quarter was dealing with the quotes instead of just stripping them out. Most of it is inelegant and brittle, just brute-force scraping and light parsing of html content into a mysql database.
Quote:
It's not like I can't do this myself, but I'm way too lazy. Two thingies I always thought would be cool would be (a) some kinda data analysis about regs fading away, or suddenly disappearing.
That's interesting, I was thinking of something similar, but for threads instead of people, charting the "burn rate" of the hottest threads of the quarter.
Quote:
The second, (b) assumes that there is some kinda API into a plagiarism checking engine, on something equivalent, and of course computationally increases exponentially. But... I think it would be interesting comparing selected regs, and perhaps some chosen 'gimmicks', to see who might be the same... or strangely similar.
Also interesting. I previously
did some analysis of "lexical similarity" within a specific thread. I didn't think of using this to match posters with gimmicks, but it's probably a step in the right direction. It seems like this would be especially difficult (and possibly futile) since a gimmick is deliberately adopting an alternate personality. But I'll bet the FBI has some kind of sophisticated software designed to tease out involuntary and subconscious tendencies that suggest shared authorship.