Analytics Blog

Category: Automation

Shifting from Pandas to Spark DataFrames Pt 2

Welcome to the second part of this introduction to Spark DataFrames! Using Spark DataFrames If you successfully installed Spark you should be able to launch the Spark/Scala shell with the “spark-shell” command from any terminal window. A few things automatically happen when you launch this shell. This command defaults to local mode, which means that… Read more »


Shifting from Pandas to Spark Dataframes

Like most data scientists, I frequently use a lot of the same tools: Python, Pandas, scikit-learn, R, relational databases, Hadoop and so on. As part of Alpine Data’s Labs team I am frequently exposed to tools that other companies use – and every company has a different stack. This won’t come as a surprise, but… Read more »


The Need for Open Standards in Predictive Analytics

This week I had the opportunity to participate in a panel discussion at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. The panel discussion was part of the “Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data” organized by the DMG (Data Mining Group). The panel session… Read more »


Getting Better Performance with Pyspark

Holden Karau is the Principal Software Engineer for IBM’s Spark Technology Center. She is an expert in using Spark’s open source cluster computing system to make data fast to run and fast to write. Holden has co-authored two books on the subject, Learning Spark and Fast Data Processing with Spark, and is currently working on… Read more »