Analytics Blog

Shifting from Pandas to Spark Dataframes

Like most data scientists, I frequently use a lot of the same tools: Python, Pandas, scikit-learn, R, relational databases, Hadoop and so on. As part of Alpine Data’s Labs team I am frequently exposed to tools that other companies use – and every company has a different stack. This won’t come as a surprise, but the one popular technology that many companies are using is Apache Spark.

Spark represents a different computing paradigm than what I’m used to. My normal workflow is to load a csv file into my language or program of choice, on my laptop, and start working. For larger files (plug incoming!) I will use Alpine, its suite of map/reduce and Spark-based algorithms.

But lately I’ve been jumping into the world of Spark Dataframes. Besides learning the API, I’ve had to make a few choices and change my way of thinking about data analysis. Please note that I’m not a Spark expert, I still consider myself a novice. But, the learning curve is steepest at the beginning and these are tips that I wish I had when I wrote my first program.

Why use Spark?

Spark lets you run distributed computing jobs in-memory. It is simpler and faster to write Spark code than map/reduce code and computation is much faster than an equivalent Hadoop job.

Check out the Apache Spark homepage for details of the Spark architecture.

The fundamental data structure of Spark is the Resilient Distributed Dataset (RDD). Among other properties, RDDs are immutable and distributed. A Dataframe is an RDD of tabular data with a schema. We’ll have to keep these properties in mind when working with data. As a consequence, don’t expect a Spark Dataframe to have the same feature set as R or Pandas Dataframes.

What language should I choose?

Take a look at the official Spark code samples page. The best language to use is subjective, but there are clear differences in functionality and ease of use that you should think about. Here are my thoughts:

Scala – Spark is written in Scala. By using Scala you’ll always have access to the latest and greatest features. For the same reason, Spark will also usually be as fast or faster in Scala than in any other language. On the other hand, Scala isn’t as widely used by data scientists as Python and R. If that’s the case, you’ll be learning a new language at the same time as learning Spark. I recommend to go ahead and learn Scala. There are good online resources and becoming comfortable with functional programming will help you write parallelizable Spark code.

Python – Another good choice. If you already know Python then there will be less of a learning curve. I found the strong typing of Scala quite useful when using Spark. In Spark 2.0, the DataFrame API becomes the main way to do machine learning, and Python is missing some features.

Java – It’s too verbose and I don’t really enjoy reading or coding in Java.

R – Although improved in the recent 2.0 release, SparkR has significantly less functionality than the other three choices.

Getting started

Let’s install Spark! I’m assuming you are on Mac OS X. First you’ll need Java. Open a terminal window and type “java -version”. You should get a version number that starts with 1.8 (aka Java 8). If not, go install the latest java jdk.

Next, we need to install Spark itself and setup some convenient environment variables. From this page, select the latest Spark release and download the package “Pre-built for Hadoop 2.7” and uncompress the file where you want your Spark installation to be located.

We want to add Spark to your path. Put this in your .bash_profile file:

export SPARK_HOME=<path to your Spark folder>
export PATH=$PATH:$SPARK_HOME/bin

From the command line type “source ~/.bash_profile”.

You should now be able to type “spark-shell” from the command line and see Spark start-up. You’ll notice a lot if INFO and WARN notices. Let’s fix those. Type “:q” to exit the shell.

Make a copy of conf/log4j.properties.template as conf/log4j.properties and edit the file so that the line “log4j.rootCategory=INFO, console” reads ““log4j.rootCategory=ERROR, console”.

Let’s run a small program to test your installation. Open the spark-shell and enter the following:

def approximatePi(num: Int) : Double = {
sc.parallelize(1 to num).map{ i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0}
.reduce(_ + _) * 4.0 / num}

println(“Pi is approximately ” + approximatePi(100000))

If you get a reasonably accurate value of Pi in return, then congrats, you’ve installed Spark!

Now that you’ve installed Spark, it’s crucial to learn how to read data. I will be posting another blog next week which will cover numerous topics from transformations and actions to data movement and aggregations. Stay tuned for more information!