Analytics Blog

Announcing Chorus 6.1

Last week we announced the availability of Chorus 6.1. With this latest release, we’ve continued to deliver new enterprise analytics features, including several marquee items such as enterprise data governance, Spark Autotuning, and support for a developing model interchange specification, PFA.

Enterprise data governance: Chorus 6.1 introduces support for administrators to exert fine-grain control over the visibility of data sources to Chorus users. Starting with Chorus 6.1, admins can manage data source availability and credentials at the global, workspace, or individual level.

Spark autotuning: Using visual editors can rapidly simplify the process of developing even complex analytical flows. However, too often, it can still be necessary for the developer to manually configure the underlying Spark jobs (including number of executors and size of the executors), necessitating a low-level understanding of Spark and detailed cluster sizing information. With Chorus 6.1, Alpine Data removes this requirement, and delivers sophisticated auto-tuning for Spark jobs. Using information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, Chorus automatically determines the optimal Spark driver and executor configurations for peak performance.

PFA support: Alpine Data has long supported the export of analytical models using PMML. With Chorus 6.1 we have also introduced support for PFA. PFA, or the Portable Format for Analytics, is a next-generation model interchange format that builds on the lessons learned over the almost two decades of PMML use. One notable enhancement is the rich support for the often complex data preprocessing that is required before the application of the actual analytical model. With PFA an entire scoring flow can be captured in a single document, rather than relying on code fragments or scripts to perform the necessary data cleanup and transformation. PFA has the potential to significantly simplify the process of operationalizing models and will be an active development focus for Alpine over the next few releases.

Algorithmic performance: When discussing the benefits of visual workflow editors for advanced analytics, the key advantages are normally associated with the ease with which data scientists can develop complex workflows, without needing to manually develop the associated code in either Java, Scala, Python or R. This reduces the possibility for subtle coding errors, and significantly improves readability, comprehension and maintenance. However, it is also worth considering the implications of the rapidly changing big data technology landscape. While it is exciting to witness the rapid evolution of this space, the rate of technology obsolescence can be problematic. When coding directly to the execution frameworks, moving from MR to Spark would require a significant porting effort. Even moving from RDDs to the dataset API introduced in Spark 2.0 potentially requires significant developer effort and revalidation. In contrast, leveraging a visual editor introduces a level of abstraction and provides the opportunity to decouple the analytical flow from the underlying execution framework. Accordingly, Alpine is able to dynamically retarget the execution framework as technology advances, without requiring any changes to the analytical flow. As with every release, we take full advantage of this ability and continue to refine and improve the performance and scalability of our algorithms. In this release the performance of SVM was improved by around 3X! No changes are required to existing workflows, they just run faster!