Analytics Blog

Spark Summit East 2017: Spark Autotuning with Alpine Data

Last week at Spark Summit East 2017, Alpine Data presented details about technology we have developed for autotuning Spark jobs. Spark can deliver amazing performance allowing data scientists to apply complex machine learning algorithms on large data sets and quickly deliver actionable insights. However, Spark is extremely sensitive to how the Spark job is configured and resourced, requiring data scientists to have a deep understanding of both Spark and the configuration and utilization of the Hadoop cluster. Failure to correctly resource Spark jobs will frequently lead to failures due to out of memory errors, leading to inefficient and time-consuming, trial-and-error resourcing experiments by the data scientists.

This requirement significantly limits the utility of Spark, and impacts its utilization beyond deeply skilled data scientists. Alpine Data’s Spark Autotuning technology eliminates this inefficiency, and automatically resources and configures the Spark jobs launched by the data scientists. This is not at static configuration, but rather at run-time the software makes a determination of the correct resourcing and configuration for the Spark job that is based on i) the size and dimensionality of the input data, ii) the algorithmic complexity of the Spark job, iii) and the availability of resources on the Hadoop cluster.

This technology started shipping as part of Alpine’s data science platform in Fall 2016. And going forward Alpine plans to leverage this technology not only to help launch Spark jobs, but to leverage this deep understanding of Spark resource requirements to dynamically manage the sizing of elastic Hadoop instances, including AWS EMR.

Alpine Data did not present the details of the algorithm it uses (due to time limitations and IP considerations), but rather used the presentation to start the conversation about the feasibility of Spark autotuning using a simple example algorithm — the slides from the presentation can be found here. Additionally, in the next few weeks, we plan to provide additional details about the algorithm we presented, and open-source related technology (which will be available here).