Quote:
Originally Posted by JaredL
An acceptable but meh solution would be for me to work out a model for each product, run it on the data up to the previous day, plug in the inputs including the overnight numbers when they become available and generate a prediction which goes to a dashboard people can pull up when they start work in the morning.
I'm not sure if you realize just how much work is in here. Especially once you move to the scale of terabytes of data. You'll need to:
* Extract your data efficiently
* Use that data to generate your models.
* Build a dashboard application
* Build something that ties all of these pieces together in an automated way that will handle some basic scheduling and handling of job failures.
If you were to build this just using basic languages (like Python/SQL) I'd guess we're talking at least a man-year's worth of time. But luckily, this is a pretty common problem.
There are languages/tools like Hadoop (Pig, Hive, Cascading, ...), Spark, Vowpal Wabbit, etc. that will let you process large amounts of data easily.
If your data is already in a Data Warehouse like Redshift (or an internal one) there are tools that you can use for visualization that will integrate with your data. Even if you need to build your own dashboard type UI there are options out there with prebuilt stuff like DataHero, Tableau, and dozens of other options.
And in terms of tying everything together there are tools like Luigi or Amazon's Data Pipeline that make that pretty easy.
So my point is just that you're trying to tackle a pretty massive problem. And you're definitely not going to solve it well by reinventing the wheel at any point.
In terms of things you could do I think you have a few good options:
1. Learn Python. It's a useful language and is a core part of a bunch of the tools that I mentioned above. This isn't really a direct step to trying to build the solution you described above, but it will be helpful.
2. Figure out what tools your company is using now for managing/working with data. Learn how to use those.
3. If there's nothing promising in 2, try to figure out if there's one useful tool you can hook up to the data. Figure out one thing that's hard to do and do research for a way to make that easier using existing tools. Try some out, and if you can show something has value to try to get it in place at your company.
It's hard to give really concrete advice about what to do because so much depends on what data and tools you already have available to you and how much freedom you have to try new things.