As VP of Engineering at Alpine, my charter is to build a product that helps enterprise customers leverage the latest and greatest in data science and machine learning technology to create tangible business outcomes. In some cases, this takes form in integrating various open source algorithms into the product for emerging areas such as deep learning, but in many cases the work that has to be done is far deeper in the technology stack. One of these areas of focus is Alpine AdaptiveML.
Alpine’s AdaptiveML operators leverage the deep understanding Chorus already has of a user’s workflow and its detailed understanding of a cluster’s real-time resources to move beyond vanilla Spark implementations of ML algorithms when it is performant to do so.
The Chorus Visual Workflow Editor acts as a layer of abstraction when leveraging machine learning algorithms in secure database or Hadoop environments. When a user drags a logistic regression ML operator onto the canvas, they are requesting a logistic regression be performed, and not tying themselves to a specific execution framework. This provides Chorus the freedom to interrogate the Hadoop cluster when the user’s flow is executed and make a determination about the optimal execution framework to leverage. This is Alpine AdaptiveML, and it is being adopted to improve ML performance by up to an order of magnitude compared to typical Spark implementations.
When Chorus was initially released, most of the ML algorithms leveraged MapReduce. Over time, we have upgraded these operators to Spark, without the user having to change their Workflows. Alpine AdaptiveML provides this flexibility going forward, ensuring that the user’s workflows are future-proofed and are dynamically targeted at the best available execution framework as they continue to be developed.
The hardware used to construct Hadoop clusters has improved dramatically in recent years, with many compute nodes now providing close to 1TB of memory, and numerous cores. At Alpine Data, we’re starting to see customers introduce compute accelerators (such as GPGPUs, Xeon Phi, or FPGAs) into their Hadoop clusters. Additionally, in the near future, we may start to see introduction of non-volatile memory, where nodes could have 15TB or more of system memory — where many data sets would fit in the memory of a single, or handful, of nodes. These high-performance compute clusters (with a wealth of memory and compute resources per node) provide a rich environment for AdaptiveML and provide many opportunities for significant performance optimization beyond vanilla Spark implementations.
Alpine AdaptiveML technology will take a variety of forms, from leveraging GPGPUs when available to leveraging highly-optimized, multi-threaded, vectorized, in-memory implementations of compute intensive algorithms. Alpine recently demoed our first AdaptiveML algorithm to our financial customers showcasing the integration of Intel’s Data Analytics Acceleration Library (DAAL) with Spark. This implementation delivered a 10X improvement in performance (on a per-node basis) compared with a MLlib Spark implementation (Alpine’s implementation was also capable of leveraging Intel Xeon Phi Compute Accelerators if they had been available).
Going forward we are continuing to build out additional operators with AdaptiveML support, leveraging off-the-shelf high performance analytics libraries, and proprietary implementations. Compute accelerator support will be leveraged when appropriate for the algorithm, as will OpenCL based approaches to broaden our support beyond CUDA.