Losing some data isn't a big issue (even entire days wouldn't be that bad). It's basically a first exploration. We want to calculate some pretty trivial stuff like "what percentage of tweets was in German (roughly)", how many tweets/week contain keyword X. We have a list of keywords/keyword combinations for which we'll analyze all tweets that match manually. I did a quick check with the data already collected and we're talking about keywords that only match ~25 tweets/day.
Mongo queries are already pretty slow (couple of minutes for a count() with one keyword filtered over a 40GB dataset)
We're not trying to predict anything or do any fancy statistical analysis (I have already checked and can play with the data in R if I want to). Queries can take a while, too. It's not critical if say getting all tweets that match X from the DB would take a couple of days.
An example would be having a human wade through all German tweets for a time period that mention Merkel and manually rate them on a scale (max negative to max positive). Then we'll compare that with some sentiment algorithm.
Small stuff like that.
The main thing that we eventually want to do with the data is get a rough estimate of how worried people are about certain things. We have a standard questionnaire that is currently being developed for this and want to see if we can somehow mimic a mass poll with some sort of twitter data analysis (basically a passive, constant questioning of the twitter hivemind of sorts). We'll know in advance what we want to query about (basically experts construct the query and manually wade through the tweets as a first step)
Quote:
I'm kind of surprised someone isn't archiving the Twitter stream already.
It's being done, Twitter bought one of the companies that did it. They can also get more than the public 1%. You can buy access to the data, prices unknown/"ask us" but pretty high. If we can build anything useful with the data we collect ourselves we might eventually get a quote from one of these companies. But it's very possible that our ideas are dumb in the first place :P
Last edited by clowntable; 10-17-2015 at 02:56 PM.