Analytics Blog

Harnessing Big Data with Spark

Last week I was in NYC, speaking at Data Summit 2016 and was quite impressed with the leading technologies and strategies that were presented. My presentation on enterprise Spark performance not only gave me the opportunity to reflect on the recent advances in big data, but also talk about the evolution from MapReduce to Spark and the new opportunities that Spark provides. I thought I’d provide today a brief summary of some of the areas covered in the presentation.


When we talk about Spark, and the performance benefits compared to MapReduce (MR), it’s important to realize that, in many companies, there is a significant amount of legacy software written using MR. Accordingly, it can represent a significant effort to port all of this existing code to Spark. It does make me wonder whether we should be coding to these evolving frameworks directly, or leveraging a level of abstraction to future-proof our code. Otherwise, we may find in a couple of years we face the prospect of porting our Spark code to a next-generation technology. Technologies such as Apache SystemML, or Alpine’s visual workflows provide this separation. This allows the data-scientist or the developer to not only focus on building analytics rather than getting caught up in the details and optimization of the execution framework, but also ensures that their work can be easily retargeted to subsequent developments in the big data space.


The introduction of Hadoop MapReduce (MR) allowed the application of algorithms to data of unprecedented scale using systems built from cheap commodity hardware. However, MR is slow, significantly curtailing its applicability to advanced iterative machine learning (ML) algorithms. These algorithms frequently need to be run multiple times in order to effectively train and optimally parameterize. Spark changed this and, by providing speedups of 100X or more, fundamentally introduced the possibility of applying ML to big data and extracting meaningful insights in actionable timeframes.

There are many ways that companies can benefit from these performance improvements: reduced time to insights, more frequent model updates, elimination of subsampling, and reduced cluster size. However, one often overlooked improvement is the ability to start to leverage AutoML on enterprise scale data.

As data continues to increase in both size and complexity, it’s impossible that data-scientists will intuitively know the optimal algorithm and the parameterization of that algorithm, to extract meaningful and accurate insights from their data. Did they choose the right algorithm? Did they investigate the possibilities of ensembles? Did they leverage constructive feature engineering? How about correctly parameterizing the model and effectively training and testing it? Today this process is manual, repetitive, time consuming and expensive. And the outcomes are highly correlated with the skill and experience of the data-scientist.

Alpine has long been leveraging the performance of Spark to accelerate our ML algorithms. However, we can go beyond the acceleration of a single ML algorithm, and automate the detailed exploration of both the feature, algorithm and parameterization spaces. Accordingly, rather than a data-scientist being forced to “guess” the best algorithm, the data-scientist would merely instruct Alpine about the nature of their problem, and the software would automatically explore potential feature engineering strategies, its portfolio of ML algorithms, the optimal parameterization of these algorithms and rapidly return a number of suggested solutions.

Hybrid Operators

Enterprise data lakes are often measured in petabytes. However, while the aggregate data under management is large, these data-lakes are often composed from a multitude of smallish data sets. In fact, many datasets are often sufficiently small to fit within the memory of a single cluster node. Over the last ten years since Hadoop was introduced, the per node memory has increased dramatically, with many nodes now populated with up to 256GB. And this per-node memory is likely to further increase by 10X over the next couple of years with the introduction and adoption of NVDRAM (Non-volatile DRAM) [Intel demoed a 512GB DDR4 DIMM a couple of months back]. As a result, a significant fraction of analyses performed on the data lake do not need to be distributed across multiple nodes. Is Spark the best choice for analyzing these datasets?

Spark is an essential tool for analyzing large data sets that are difficult to analyze via traditional means. However, the constraints imposed via the generic map-reduce paradigm, coupled with the current lack of support for GPGPUs and vector instructions, ensures that significant performance is left untapped. Consequently, if a data-set fits within the memory of a single-node there exists a large number of highly performant ML libraries that can very significantly outperform Spark.

However, the choice of when to deploy Spark versus these other libraries is complex. What is the size of the dataset? How much space is needed for intermediate results? How much memory does each node contain? What is the current load on the cluster? Accordingly, it doesn’t make sense for a data-scientist or a developer to try and maintain multiple ML libraries & manually determine the optimal choice. However, software can do this very efficiently. Alpine Data has been experimenting with what we like to call “hybrid implementations” that automatically and completely transparently to the user, choose the optimal approach by dynamically monitoring data set size, cluster size, and cluster utilization. Accordingly, the SW will continue to leverage Spark for truly large data, but will opportunistically attempt to deliver accelerated processing using highly-optimized single-node algorithms when appropriate, delivering optimal performance across the entire spectrum of data-set sizes on any cluster configuration. Finally, the requirements for hybrid operators also dovetails with the previously discussed benefits associated with abstraction — when coding to higher level APIs or leveraging code-free visual design tools, not only do we obtain future-proofing, but the SW is now freed to make intelligent discussions about the best possible methodologies for high-performance analytics.


Much of the focus in the analytics space is centered around building out toolkits that deliver high-performance machine learning algorithms. While this is important, there is a tendency to neglect operationalization. In reality, the performance and accuracy of the models developed is of little importance if there is no easy way for a business to benefit from these insights. In many toolkits, delivering a Powerpoint filled with BI screenshots is the only operationalization option, while in some there is the ability to at least export a PMML model. While PMML is good at describing ML models, it lacks the ability to describe the complex pre-processing (feature engineering) that is required in many scoring flows, frequently requiring that the PMML be accompanied with hand-rolled preprocessing code (frequently hacked in Python). While this does enable model operationalization, it has the potential to be messy, error-prone and lacks enterprise standards of governance.

Alpine Data recently joined the working group driving the successor to PMML — PFA, or the Portable Format for Analytics. PFA supports rich support, for not only describing complex models, but also complicated pre- and post-processing required in modern data science. As a result, we believe that it provides an elegant, turn-key solution to create, QA, manage and deploy end-2-end scoring flows. Stay tuned for announcements from Alpine about PFA support in the coming months!

I have attached the slides from my presentation here. Take a look and let us know what you think!