On Data Science Tools

Data Science is a modern field at the intersection of computer science, statistics and application fields. One of the most important parts of data science is about extracting insights from data. Therefore, the data scientist needs to be able to choose the right tools from the vast amount of possibilities.

Computer scientists tend to claim that the tools they use are more sensible than others (I am a computer scientist and did that for a long time, too). This is, of course, nonsense. However, it is also evident in every-day life: each computer scientist solves tasks in his every-day toolkit virtuously, even if another one would better fit the problem or be more efficient. This is surely a lock-in effect and we can only react by keeping open-minded and starting looking deeply into new tools from time to time increasing the average virtuosity in various toolkits.

Let us look at what I am using today (after several years in computer science), we see a quite diverse tool set with quite diverse objectives. In this text, I will concentrate on when I would choose this tool for a task. So this is very subjective, but still might help you to think about which tools you want to take a closer look at.

Java

Java is a widely used programming language and can be used to generate quite stable software systems. Some of the most important data science software stacks are based on Java including Apache Cassandra, Spark, Storm, HBase, and many others. What is very positive about Java is its clean, object oriented approach, its platform-independent byte-code compilation, and (most importantly) the wide user base. However, there are some things I don't like about Java for Data Science, however, it is important enough such that every data scientist should have decent knowledge and capabilities in programming Java.

Personal Usage: I am using Java if and only if a large and distributed software system is to be deployed and there is already a lot of work done (e.g., extending Cloud databases or writing software for Apache Spark or Apache Storm).

Python

Python is a scripting langauge, which has a quite clean design and a lot of functionalities using integrated package managers. From the point of view of capabilities, it is one of the most mature systems for data science. However, it is designed as a general purpose scripting system and data science is not at the heart of it. This might lead to situations in which several packages are not actually compatible with each other and a lot of data and format transformation are to be done.

Personal Usage: I choose Python over other systems whenever I need to bring algorithms into the web. These algorithms, however, are never designed in python - python would be to slow for that. They come in as python packages or C++ implementations of myself.

R

The R language is the most classical data science language. It started as an open source statistics tool and follows many ideals of the statistics community. For example, more or less every R package providing an algorithm (at least on CRAN) contains a paper (including references, algorithm descriptions, limitations etc.) on what the algorithms does exactly and how it has been implemented in concrete. Additionally, almost all R packages contain the source code and are extremely portable (you can use them on Windows without any hazzle). So if anything is unclear, we have both: Scientific documentation (paper, etc.) and source code (for details omitted in the paper). This is something that is often missing in python or similar communities: There is a vague description of what a package / software does, however, it is not provided in a science-friendly style. Just as python, R is a swiss-army knife: Extremely powerful, but in itself quite slow. Therefore, real extensions are written in C, C++ or Fortran. Especially the package Rcpp provides a very flexible and easy way to include high performance functions (e.g., the most important part of a data analysis) directly into an R script or into a package.

Octave

Octave is a open source software package modeled after the famous Matlab software. As such, it provides numerical computations (e.g., all sorts of matrix computations) in a high level language. It is quite efficient in such tasks and provides plotting capabilities and a huge amount of numerical algorithms. It is worth looking at for data science in data spaces that have a natural representation as a matrix such as computer vision or time series analysis. It is extendible very easy and a very efficient environment for numerical computation.

Modern C++

In the past, the C language was an extreme language: First, it is at the heart of very many software systems. It is fast, easy to use, expressive, targets more or less any hardware and software system, provides portability (including Linux, Windows, Mac, Android, iOS, and embedded platforms). However, it is not easy to write secure software in the time of Internet connectivity and the available libraries are often low level.

This situation has motivated a lot of other approaches to programming including Java. However, it also motivated people to create a better C experience. Unfortunately, this better C experience is not known by enough people. So students should have a close look on Modern C++ starting with C++11. There are merely no pointers (though you can use them, if you need them), there are no memory leaks (as you won't allocate and deallocate yourself, unless you explicitly need it), there is support for most modern language entities and ideas and a lot of flexibility. And what is the best: C++ can be used inside any of the previous software systems. So if you design an algorithm in C++ using some generic programming, you can directly integrate it into R (using Rcpp and some sugar, for example), into Python (by writing a set of C functions exporting the functionality to Python or by using Boost.Python), into Java via Java Natives. And all modern technologies are accessible directly including GPU computing, in memory parallelism, message passing parallelism for cluster applications, and more).

Summary

As you can see from the described tools (there are many more tools and environments out there, so keep your eyes open), there is a large body of tools each of which has its special area, where it is best applied. Therefore, you should be actively using all of them, such that you can choose the best environment for a given task from experience after a while. And what you should really learn and accept is the fact that the C++ language - though it is possible to create bad software with it - has been modernized into a language that is full of features and that you can provide your algorithms for all major platforms. In this list, there is no platform incompatible with C++, but you can't extend R with Java in a reasonable way. Hence, if you are a student or a data scientist willing to learn something useful, learn C++.