
Why Python for Data Analysis?
For many people (myself among them), the Python language is easy to fall in love with.
Since its first appearance in 1991, Python has become one of the most popular dynamic,
programming languages, along with Perl, Ruby, and others. Python and Ruby have
become especially popular in recent years for building websites using their numerous
web frameworks, like Rails (Ruby) and Django (Python). Such languages are often
called scripting languages as they can be used to write quick-and-dirty small programs,
or scripts. I don’t like the term “scripting language” as it carries a connotation that they
cannot be used for building mission-critical software. Among interpreted languages
Python is distinguished by its large and active scientific computing community. Adop-
tion of Python for scientific computing in both industry applications and academic
research has increased significantly since the early 2000s.
For data analysis and interactive, exploratory computing and data visualization, Python
will inevitably draw comparisons with the many other domain-specific open source
and commercial programming languages and tools in wide use, such as R, MATLAB,
SAS, Stata, and others. In recent years, Python’s improved library support (primarily
pandas) has made it a strong alternative for data manipulation tasks. Combined with
Python’s strength in general purpose programming, it is an excellent choice as a single
language for building data-centric applications.
Python as Glue
Part of Python’s success as a scientific computing platform is the ease of integrating C,
C++, and FORTRAN code. Most modern computing environments share a similar set
of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration,
fast fourier transforms, and other such algorithms. The same story has held true for
many companies and national labs that have used Python to glue together 30 years’
worth of legacy software.
Most programs consist of small portions of code where most of the time is spent, with
large amounts of “glue code” that doesn’t run often. In many cases, the execution time
of the glue code is insignificant; effort is most fruitfully invested in optimizing the
computational bottlenecks, sometimes by moving the code to a lower-level language
like C.
In the last few years, the Cython project (http://cython.org) has become one of the
preferred ways of both creating fast compiled extensions for Python and also interfacing
with C and C++ code.
Solving the “Two-Language” Problem
In many organizations, it is common to research, prototype, and test new ideas using
a more domain-specific computing language like MATLAB or R then later port those
2 | Chapter 1: Preliminaries