Analytics Blog

Getting Better Performance with Pyspark

Holden Karau is the Principal Software Engineer for IBM’s Spark Technology Center. She is an expert in using Spark’s open source cluster computing system to make data fast to run and fast to write. Holden has co-authored two books on the subject, Learning Spark and Fast Data Processing with Spark, and is currently working on a new guide to improving Spark performance that is due out early next year.

Holden stopped by Alpine Data headquarters earlier this week to deliver a technical presentation on how to get better performance using PySpark. She discussed general tools to make Spark run faster, as well as tips to improve PySpark performance given specific design constraints. Holden then looked ahead to the future and shared plans to make PySpark much, much faster — something she has given the working title “PySpark Vroom, Vroom”.

Check out the video here.