Analytics Blog

Category: Spark

Shifting from Pandas to Spark DataFrames Pt 2

Welcome to the second part of this introduction to Spark DataFrames! Using Spark DataFrames If you successfully installed Spark you should be able to launch the Spark/Scala shell with the “spark-shell” command from any terminal window. A few things automatically happen when you launch this shell. This command defaults to local mode, which means that… Read more »

Shifting from Pandas to Spark Dataframes

Like most data scientists, I frequently use a lot of the same tools: Python, Pandas, scikit-learn, R, relational databases, Hadoop and so on. As part of Alpine Data’s Labs team I am frequently exposed to tools that other companies use – and every company has a different stack. This won’t come as a surprise, but… Read more »

Getting Better Performance with Pyspark

Holden Karau is the Principal Software Engineer for IBM’s Spark Technology Center. She is an expert in using Spark’s open source cluster computing system to make data fast to run and fast to write. Holden has co-authored two books on the subject, Learning Spark and Fast Data Processing with Spark, and is currently working on… Read more »

Enterprise Scale Topological Data Analysis Using Spark

Last week I had the opportunity to speak at Spark Summit in San Francisco. It was great to learn about how different businesses are utilizing Spark to meet their needs and the speed at which Spark has been evolving. Spark has become one of the most powerful open source processing engines and I was able… Read more »

Harnessing Big Data with Spark

Last week I was in NYC, speaking at Data Summit 2016 and was quite impressed with the leading technologies and strategies that were presented. My presentation on enterprise Spark performance not only gave me the opportunity to reflect on the recent advances in big data, but also talk about the evolution from MapReduce to Spark… Read more »