Open Side Menu Go to the Top
Register
Programming for Data Analysis Programming for Data Analysis

12-27-2014 , 05:34 PM
Currently I'm a research analyst for a fairly large corporation in the transportation industry. I am helping to build an analytics product, which would give customers information on supply and demand in different markets, provide price forecasting etc. Despite the obvious promise of this project, which frankly should have come out several years ago, we are being help up by corporate bureaucracy. The most important consequence of this is that we won't have developers assigned to our team to build the software/app for a few months.

In the meantime, I've been digging through the data, creating metrics etc. I'd like to use this relative downtime to improve my skills. More specifically, I'd like to learn to program using languages, libraries and software that would allow me to improve my ability to analyze data, including and especially large datasets. Right now we have 15+ billion transaction records, which is cumbersome enough, but things could explode soon as we'll be adding GPS tracking data and social-network type stuff.

I do almost all of my work using some combination of SQL, Stata and VBA programming in Excel. I also begrudgingly use Tableau for visualization stuff. Not only would I like the ability to do bigger things, use machine-learning and so forth, I'd also like to improve my ability to work independently. I have a couple app or consulting ideas for a startup and would love to be able to handle all of the data stuff myself. Even if I don't go that route, it'd be better at my current job or similar corporate jobs not to need as much help from DBAs.

Where would you recommend I go from here? My first plan is to take the first two DS classes at Coursera and start learning R. I'm not really sure from there, but was thinking of either starting to learn Python or javascript. The good news about my current situation is that I can do a lot of this stuff at work if I can show some benefit to the future product, which will be quite easy if I'm able to do stuff with data I couldn't do before. On a related note, it will be easy for me to find problems to solve.

I've read through the DS bootcamp threads, but they seem far more advanced given my lack of programming ability. Any suggestions appreciated.
Programming for Data Analysis Quote
12-28-2014 , 10:57 AM
for your needs I think both Python and R would be good choices. Javascript not so much.
Programming for Data Analysis Quote
12-29-2014 , 11:32 AM
How advanced is your Excel VBA and SQL? Have you been able to connect to your sql databases through Excel VBA via ADO and from there build automated reports/dashboards?

If you don't know python, take an intro class online (there must be dozens by now) and then I would recommend getting Wes McKinney's Python for Data Analysis book. Its a bit dated but will give you an excellent overview of the pandas package, python's data munging hammer. You've read my thread where I've outlined a curriculum of sorts so I won't repeat myself further.

You sound like you are in a similar position as I was about 6 months ago with less programming skills. One thing I wish I had done better was manage my projects better. I think this is a very underrated skill towards making truly awesome projects come to life.
Programming for Data Analysis Quote
12-29-2014 , 02:26 PM
It sounds like your data volume is going to be such that I wouldn't worry too much about something like R.

Your time is probably better spent on big data tools rather than learning basic languages. Something like Spark / Pig / Hive - whatever will probably be more useful in this case.

Python isn't a bad choice because it's useful by itself and with a number of useful big data tools / libraries.
Programming for Data Analysis Quote
12-29-2014 , 03:07 PM
Quote:
Originally Posted by Greeksquared
How advanced is your Excel VBA and SQL? Have you been able to connect to your sql databases through Excel VBA via ADO and from there build automated reports/dashboards?
Thanks for the feedback.

Based on the second question the answer to the first is not very. I've never tried to do that sort of thing.

An example of a pretty typical project I've done in VBA and Stata is calculating the customer lifetime value of a subscription service. There are various price levels with different cancellation rates, which vary based on time as a customer. Customers also move from one level to another and this migration rate also varies based on how long they've been around and if they've changed products in the past.

To get at this fun little problem I pulled subscription data from the SQL database using Oracle SQL Developer, cleaned and organized it in Excel using macros and some VBA (mainly to keep excel from choking) then put it in Stata to run logit models to get transition probabilities. Making the simplifying assumption that at some point cancellation/migration rates stabilize (a 10 year customer is no less likely to cancel than a 9 year customer), so it's easy to calculate a final value for some point well into the future, I used backward-induction to solve for the present value of a customer who came in at price x, is now at price y, changed price level m months ago and has been here for n months total. This was done in VBA with transition probabilities, prices and the discount rate as inputs.

That explanation got out of control, but the short of it is that I'm proficient using VBA to clean data, manipulate arrays, make calculations I need etc. but I haven't used it to pull in data from databases. I have written code that scrapes data from web pages, but that's pretty simple.
Programming for Data Analysis Quote
12-29-2014 , 07:47 PM
Sounds like the above problem is ripe for a complete stand-alone application built inside excel. You can build some decent looking UI's with excel and automate the back-end data pulling through VBA. Personally, I think these applications provide the most value especially if they can be distributed to management or eventually turned into a web-app. If you do have start-up ideas you will have to be able to deliver a product and tie all your technologies together and this project seems like a good place to start.
Programming for Data Analysis Quote
12-29-2014 , 07:56 PM
How do you process 15 billion records in Excel?
Programming for Data Analysis Quote
12-29-2014 , 08:17 PM
Quote:
Originally Posted by Greeksquared
Sounds like the above problem is ripe for a complete stand-alone application built inside excel. You can build some decent looking UI's with excel and automate the back-end data pulling through VBA. Personally, I think these applications provide the most value especially if they can be distributed to management or eventually turned into a web-app. If you do have start-up ideas you will have to be able to deliver a product and tie all your technologies together and this project seems like a good place to start.
This is a great suggestion, thanks!
Quote:
Originally Posted by jjshabado
How do you process 15 billion records in Excel?
I don't, obviously. Excel limitations are a major reason I want to learn how to do other stuff. If I do something in Excel, or Stata for that matter, I will pull summary data first using SQL.

I only gave the number of records to give a feel for the size of the datasets I could be working with.
Programming for Data Analysis Quote
12-29-2014 , 11:26 PM
Ok, sure. I guess it depends on what you want to do. If you want to focus on actually analyzing the data I think you should pick a tool/language that is going to handle the volume you have otherwise anytime you want to do something new (or even just refreshing data for existing things you've built) you have a fairly tedious extraction process.

If it's easy to get your summary data and you're more worried about how to display that data than maybe something else makes sense. There are a lot of visualization/data analysis tools out there though.

But any solution that involves pulling data out with SQL into a custom Excel application is going to be both inefficient and much crappier than existing tools. Just as one example take a look at what you can do with something like http://www.looker.com.
Programming for Data Analysis Quote
12-30-2014 , 01:20 AM
Wait why are you pulling it into excel instead of just doing everything in sql developer. Before reading the op. I was going to come in here a recommend just downloading postgres or some other db(if your it would let you). But it seems like you have access to a database.
Why are you using excel to clean your data. You should definitely be using sql for that.

It's hard for me to tell what you want to use programming language for. What would you want to do with it? Different programming languages are good for different things.

If I was to blindly recommend one I would go with python it's the most noob friendly and has a lot of libraries.
Programming for Data Analysis Quote
12-31-2014 , 03:48 AM
Thanks for the feedback guys. Let me try to clarify the kind of things I'd like to be able to do.

First off, my background is as an Economist. So I'm quite familiar with regression models including time series, logit etc. My focus in school was in micro/game theory and I have an undergrad degree in math so I would say my strength relative to most economists is in model building and overall analytical skills and I'm relatively weaker at knowing the ins and outs of various econometric models. I've done some machine-learning stuff, but just using Weka and during a brief SAS trial. Other than clustering, ML models are black boxes to me.

At my current job, my biggest limitation by far is automation. I go through a rather clumsy process extracting and cleaning the data and importing it into Stata or even Excel. I feel pretty good about my ability to write .do files in Stata, which allows me to run a small number of models on various subsets of data, but even then there'd need to be something on the back end that would push the numbers out and do something with them. I'd like to be able to pull data, process it, analyze it and write the results to a database in one fell swoop so that when something works, it's there ready to be used.

Here's an example of a current project. In addition to tracking sales throughout the day, we get a significant number of automated, overnight transactions. I'd like to predict each day's sales numbers for each product based on history and a variety of inputs, including the overnight numbers from the previous night. An acceptable but meh solution would be for me to work out a model for each product, run it on the data up to the previous day, plug in the inputs including the overnight numbers when they become available and generate a prediction which goes to a dashboard people can pull up when they start work in the morning. A better solution would be all of that but with intelligence built in so that it chooses the best from several models, at least periodically. Either way, it would be better for work, and far better for me down the road, to be able to do all of these data-handling steps myself instead of relying on a developer to create a solution that links whatever models I come up with to the relevant databases.

Down the road, I'd like to be able to put together something like Amazon and Netflix do suggesting other things you might like, based on your history. Right now I feel like if you gave me a customer's history, and a fair bit of time, I could find similar customers to them and come up with something halfway decent but that's several galaxies away from having a system in place which would work automatically, generating suggestions for each customer quickly enough to be useful.

I realize I have a lot to learn to be able to tackle these things; I'm looking for guidance on how to get started.
Programming for Data Analysis Quote
12-31-2014 , 10:02 AM
Quote:
Originally Posted by JaredL
An acceptable but meh solution would be for me to work out a model for each product, run it on the data up to the previous day, plug in the inputs including the overnight numbers when they become available and generate a prediction which goes to a dashboard people can pull up when they start work in the morning.
I'm not sure if you realize just how much work is in here. Especially once you move to the scale of terabytes of data. You'll need to:

* Extract your data efficiently
* Use that data to generate your models.
* Build a dashboard application
* Build something that ties all of these pieces together in an automated way that will handle some basic scheduling and handling of job failures.

If you were to build this just using basic languages (like Python/SQL) I'd guess we're talking at least a man-year's worth of time. But luckily, this is a pretty common problem.

There are languages/tools like Hadoop (Pig, Hive, Cascading, ...), Spark, Vowpal Wabbit, etc. that will let you process large amounts of data easily.

If your data is already in a Data Warehouse like Redshift (or an internal one) there are tools that you can use for visualization that will integrate with your data. Even if you need to build your own dashboard type UI there are options out there with prebuilt stuff like DataHero, Tableau, and dozens of other options.

And in terms of tying everything together there are tools like Luigi or Amazon's Data Pipeline that make that pretty easy.

So my point is just that you're trying to tackle a pretty massive problem. And you're definitely not going to solve it well by reinventing the wheel at any point.

In terms of things you could do I think you have a few good options:

1. Learn Python. It's a useful language and is a core part of a bunch of the tools that I mentioned above. This isn't really a direct step to trying to build the solution you described above, but it will be helpful.

2. Figure out what tools your company is using now for managing/working with data. Learn how to use those.

3. If there's nothing promising in 2, try to figure out if there's one useful tool you can hook up to the data. Figure out one thing that's hard to do and do research for a way to make that easier using existing tools. Try some out, and if you can show something has value to try to get it in place at your company.

It's hard to give really concrete advice about what to do because so much depends on what data and tools you already have available to you and how much freedom you have to try new things.
Programming for Data Analysis Quote
01-01-2015 , 04:30 AM
If you go python route, getting it set up on windows may be cumbersome.

Take a look at https://store.continuum.io/cshop/anaconda/
Programming for Data Analysis Quote
01-10-2015 , 05:01 PM
I also use python/anaconda/pycharm and cannot recommend the the pandas/numpy/scipy suite of tools enough for these types of tasks. Ive written some additional wrappers for the standard suite to connect to our standard data stores to make interacting with them a breeze and my life has gotten much better since moving from matlab to python.

That said, pretty much everything I do with these tools is confined to 15 million records or fewer and with no real dashboard type needs. I know pandas is designed to deal with large datasets, but I cant speak to how well it would work when dealing w/ the challenges specific to those types domains.



Also, is there someone at your company that knows how to handle large amounts of data well, and can teach you about available tools? It doesnt feel like you have the knowledge base of someone who should be making these decisions at a large company (not that I do either)
Programming for Data Analysis Quote
01-15-2015 , 03:48 AM
Quote:
Originally Posted by CallMeIshmael
Also, is there someone at your company that knows how to handle large amounts of data well, and can teach you about available tools? It doesnt feel like you have the knowledge base of someone who should be making these decisions at a large company (not that I do either)
Thanks for the feedback. Python is definitely first on the list of things to learn.

I don't at all disagree with the quoted paragraph, but the answer is not really. The company is slow to make changes and especially hesitant to invest heavily in R&D, technology etc. People often liken working there to working for the government. Unlike the government, I'm pretty much the only data guy there at all. There are some plans to hire a developer with experience working with large data sets and doing statistical stuff, but my guess is that won't happen for at least several months.

It's often frustrating, but I'm seeing this as an opportunity to gain a lot of skills on the job. It would be great a year from now to be in a position where I can do a lot of the processing myself, including automating bringing in new data, re-running models and spitting out results.
Programming for Data Analysis Quote
05-07-2020 , 06:47 AM
Quote:
Originally Posted by all_by_myself
I'm interested in data science and stats. I started with Coursera and Lynda. Then I planned to get a certificate, but I'm not sure, because certifications cost a lot and are they really worth the money? And EVEN if I pass the tests and prepare for my next job interview, will employers actually care whether I have those certificates or not? I doubt it.

Alex
Hi Alex,
You may chech MicroMasters Program in Statistics and Data Science at MIT or NYU Center for Data Science. I think these might be good variants for you.
Programming for Data Analysis Quote
05-17-2020 , 12:48 AM
I would got for python first instead of R.
More learning material for python (including big data etc), it's good for working with SQL and very useful in general.
Programming for Data Analysis Quote
05-27-2020 , 03:16 AM
IDK why someone had to resurrect this thread instead of a making a new one.

Anyways, there is now software which does what OP wants, e.g., Tableau.
Programming for Data Analysis Quote
06-16-2020 , 08:19 AM
Quote:
Originally Posted by altenburgerPhD
Hello, I'm working in the lab and planning to become a Certified Data Scientist. Do you know any online proframs for programming in data analysis? What are the cheapest?
Best,
Adrian
Hello Adrian,

Some Australian universities (UTS and The University of Sydney) offer Programming for Data Analysis courses, but they are on campus. But you may also check Harvard courses (online) here:
https://online-learning.harvard.edu/.../data-analysis
Programming for Data Analysis Quote

      
m