This story tackles a very common question in the data science space. I do not advertise any commercial tools/packages here, but views are my own anyways and based on my personal experience so please feel free to judge me.
The first and most common question when operationally getting in data science is about the tools and programming language to use.
Get Free !
Set aside commercial tools like SAS & Matlab, open source projects have strong and very rich communities working on them and are extremely popular in the community. They have way more functionality, give complete freedom, and evolve much faster than tools like SAS thanks to all the new versions and state of the art packages.
They are arguably the best choice for data science. Plus they’re free, which is also nice, you don’t have to worry about your choice, because it costs you nothing, you can just switch tools whenever you want, or use a combination of them.
Please, this is very important, keep this comic in mind :
So what it the “best” toolset ? R ? Python ? Julia ? Supposing you have complete freedom of choice, the answer to this is pretty straightforward : IT DOES NOT MATTER AT ALL, tools are nothing more than tools, and they should be chosen with purpose in mind. R & Python being the most common choices, they pretty much have the same functionality and features. In short points, having used both of them in various contexts, my opinion is :
If your IT department is familiar with a tool, the deployment process to production will be easier if you develop your model in that particular tool. Most enterprise infrastructures contain Python jobs running, and interfacing with your Python model or integrating it in the existing process will probably be easier.
If you have a “traditional” Computer Science background, built around Java, C & similar languages, Python is a great choice because of its mature object-oriented programming features.
Coming from Statistics ? chances are you already have some R skill. Go for it ! despite the recent hype around Python, R is still an excellent framework for data science and has the same set of capabilities. Yes, you can use Spark,Keras and Tensorflow with R, you can deploy your model as a REST API, use version control, work with databases and “big data”, scrap the web, etc. (Well you guessed it, huge R fan over here and this is starting to look like an awful sales pitch …)
More of a GUI guy who prefers point-and-click over code ? R has many easy-to-use graphical add-ins for all kinds of tasks from data management to ML & modeling. The excellent Flow UI from H2O.ai is also worth citing for including an AutoML engine in a web GUI.
Looking for greater speed of training ? Seems like scikit-learn and the data management formalism it enforces (numerical variables only, which implies one-hot encoding or embedding categorical features) is slightly faster. Parallelism (distributing computation on many CPU cores) in R is a tiny bit trickier than in Python.
So just make your choice, depending on what will make you get stuff done fast in an easy way. Be lazy, and keep in mind that you can always go back and learn another language. Also, with all the resources and awesome communities on the interwebs, learning and using multiple languages at the same time have never been so easy.
NB : thanks to the guys at Rstudio, R now has a real, fully functional interface to Python. That means that you can now load your dataset with pandas, do all your data management with dplyr and the tidyverse, visualize the data with ggplot, and then fit a scikit-learn model or call Tensorflow. All of that in the same R script. It’s not going to end the eternal debate, but check it out !
Please share/comment :)