We’ve had a lot of interesting developments brewing in engineering at Alpine, the most recent of which is the recent release of Alpine Touchpoints. Touchpoints is a new layer on top of the core analytics product that allows users to build predictive applications on top of the models they have developed within Alpine, which can then be used by any business user. However, in order for this layer to have the most sophisticated machine learning models beneath it, it requires a reliable and extensible platform to develop different types of models. In this blog, I’ll start to discuss some concrete examples that showcase the extensibility of the Alpine stack, including our R integration, our custom operator framework, and JDBC support.
Leveraging existing functionality in R: R has long been an important tool for data scientists. As a result, many data science teams have invested heavily in R, and have amassed significant code, developed over a span of years. Similarly, many important algorithms have been developed in R and open-sourced, ensuring that its not necessary to constantly reinvent the wheel when leveraging common algorithms, significantly accelerating development. Alpine protects these investments and allows a customer’s existing R implementations of key operations to be seamlessly incorporated into workflows.
This is achieved using Alpine’s R Execute operator. The operator enables arbitrary R code to be invoked (including the use of R packages), with the ability to precisely control data input and output from the operator, allowing it to be incorporated into Alpine workflows and intermixed with native Alpine operators.
For example, I’m a big fan of Topological Data Analysis, and wanted to be able to leverage insights from TDA as inputs into a broader workflow I was developing. Happily, there is an open-source R implementation of a number of key TDA algorithms, and, in a matter of minutes, I was able to incorporate TDAmapper algorithms into my Alpine flow:
Custom Operator Framework: Alpine leverages Apache Spark extensively to provide high-performance implementations of data transforms and machine-learning algorithms. While Alpine provides a wide portfolio of optimized algorithms, customers may need to use algorithms that exist in various open-source libraries or incorporate custom algorithms their data-science teams have developed internally.
It’s easy to incorporate these algorithms into Alpine workflows using Alpine’s recently released custom operator framework. These operators can be easily written, either by Alpine engineers or the customer’s own data scientists. They can be written either in Spark against HDFS, or in SQL against RDBMS. The resulting jar can then be dropped into Alpine and used in any workflow.
As an example, there was recent ask to add support for Collaborative Filtering algorithms. A Spark implementation of these algorithms already exists in the open source Spark MLlib. Using the Alpine Customer Operator Framework, it was simple to create an operator that fully integrates into the broader alpine ecosystem, but, internal to the operator, just calls out to MLlib, as illustrated below:
Similarly, in many instances, customers may have existing implementations of algorithms in Spark or MapReduce. These again can be easily incorporated into Alpine using the Customer Operator Framework. For example, in our most recent release, we introduced support for gradient boosted trees. While an implementation of this algorithm exists in MLlib, it was insufficiently performant. One of the members of our ML team at Alpine had previously written a high-performance implementation of GBT as a fun project during one of the hackathons that we hold periodically in engineering. He was able to readily integrate his code into Alpine’s custom operator framework and deliver a high performance implementation of GBT in Alpine.
JDBC support: Alpine provides native support for a wide variety of RDBMS (e.g. Oracle, Greenplum, and Postgres), and Hadoop data-sources. However, it’s also readily possible to leverage within Alpine any data source that provides a JDBC driver. For example, should a user want to source/sink data from a Redshift instance, it’s just necessary to download the Redshift JDBC driver, add the driver to Alpine Chorus and point Alpine at the instance. Its then possible to use the Redshift instance in Alpine flows. Similarly, the figure below illustrates an Alpine flow using a JDBC connector to Teradata Aster.
The above are a few simple examples of how Alpine has been engineered to protect a customer’s existing investments in analytics, and ensure that Alpine is readily extensible and configurable to the customer’s specific use-cases. In the following blogs in this series, we will go into greater depth about the extensibility of Alpine’s analytics engine, and how it can be leveraged in the other layers of the product.