The “All About Deep Tech” blog series is about just what the title suggests: in-depth looks at the cool technology that makes Chorus run smoothly and how we leverage the latest and greatest in the big data industry to keep our customers ahead of the curve. If you missed our last post on AdaptiveML, be sure to check it out here.
In this post, we will discuss the execution framework flexibility that is available in Chorus. We’ll use the Visual Workflow below to call out some key pieces of technology that we leverage to bring the power of performance with the ease of usability to our users.
1) Chorus extensibility: One of the key benefits of Chorus is the Extensions SDK that allows users to create their own operators and have these operators interact seamlessly with the out-of-the-box operators which ship with Chorus. Many users use the Extensions SDK to integrate their own proprietary algorithms, or integrate open source algorithms of interest. Additionally, Alpine provides hooks into the Chorus Spark Autotuning engine from the SDK, allowing a user’s propriety algorithms and open source libraries to benefit from this technology. In this example, the operator performs dimensionality reduction using the H20 PCA algorithm (accessed via Sparking Water). This particular integration was accomplished with just a handful of lines of code and access to the Chorus autotuning support significantly improved the scalability and performance of this algorithm compared to its standalone use.
2) Out-of-the box machine learning:
2.1: Alpine Forest – This is Alpine’s proprietary implementation of a random forest and is implemented using Spark. On of the key advantage of Alpine’s Spark operators is our Spark Autotuning technology. While Spark can deliver amazing performance improvements for complex ML algorithms, it is extremely sensitive to how the Spark job is configured and resourced, requiring data scientists to have a deep understanding of both Spark and the configuration and utilization of the Hadoop cluster being used. Failure to correctly resource Spark jobs frequently leads to failures due to out of memory errors, leading to inefficient and time-consuming, trial-and-error resourcing experiments. This requirement significantly limits the utility of Spark, and impacts its utilization beyond deeply skilled data scientists.
Our Auto-tuning technology removes this inefficiency, and automatically resources and configures the Spark jobs (as discussed here). This technology started shipping as part of the Alpine Data Science platform in Fall 2016, and Alpine plans to leverage this technology not only to help launch Spark jobs, but to leverage this deep understanding of Spark resource requirements to dynamically manage the sizing of elastic Hadoop instances, including AWS EMR, Openshift and Kubernetes clusters.
2.2: TensorFlow DNN – Like H20, TensorFlow is a popular open source framework that can present usability and stability challenges to the user. For instance, when there is a big change, such as an alteration to the API, users are stuck fiddling with their code to compensate for the changes that have been made by the community. Chorus eliminates this problem by wrapping TensorFlow functionality within Chorus operators. In this example, we used TensorFlow on Spark to provide a GPGPU accelerated Deep Neural Network trainer.
2.3: Alpine AdaptiveML Kmeans – With the rapid increase in memory and compute resources available in enterprise Hadoop clusters, coupled with the recent explosion in execution frameworks for algorithms on big data, the correct choice of execution framework that fully leverages the available resources (including GPGPUs) is not always clear. However, there are significant performance advantages associated with leveraging the correct framework, with the potential to deliver an order of magnitude in improvement over current Spark performance.
Manually considering these various execution choices when building a model is time consuming and tedious. Alpine AdaptiveML looks at the size of the dataset and the size and configuration of the cluster being used and automatically employs the optimal strategy. This hybrid approach lets the data scientist focus on the data science, not the execution framework being utilized. AdaptiveML has the potential to greatly improve machine learning performance along with a host of benefits including a reduced time-to-insight, reduced cluster size, and an elimination of unwanted sub-sampling. In this example, we leverage Intel’s Data Analytics Acceleration Library to perform the KMeans computation, delivering over 10X performance improvement over Spark MLlib on a per node basis.
2.4: PySpark script leveraging MLlib Random Forest – Python is quickly becoming the most popular language among data scientists. However, versioning Python notebooks and pushing code into production still presents a significant problem when attempted at scale. Chorus offers an enterprise Python experience that allows you to develop your data pipelines in Python and leverage that code within the Visual Workflow Editor.
The Python Execute operator allows you to select the notebook you wish to run and automatically substitute the input and output datasets (Chorus automatically detects and enforces schemas). For PySpark Notebooks, you can also configure the Spark job appropriately to efficiently process different data set sizes.
3) AutoML: Alpine is debuting our AutoML support over the coming months, starting with support for hyperparameter optimization in the ML operators, followed by automated model selection, and then finally assisted feature selection.
Our method of AutoML includes an intelligent approach to hyperparameter optimization which efficiently identifies the optimal set of parameters and adjusts them accordingly. Because Chorus is a collaborative application, teams of data sets frequently work in concert on projects. Chorus retains information about all of decisions data scientists make when interacting with their data sets, including their choice of columns to leverage, ETL operations, and modeling decisions. When other data scientists start interacting with these data sets, Chorus optionally provides insights into what other data scientists in their team have undertaken, preventing unnecessary duplication of effort. By blending classic AutoML techniques with these collaborative insights, Chorus provide a unique perspective on AutoML.