Analytics Blog

All About Deep Tech: Model Operationalization

Model operationalization is a core component of effective data science, and is a key focus at Alpine Data. In previous blogs, I’ve written frequently about model ops, especially the support Chorus provides for exporting models using the PFA and PMML formats. However, what about scoring on data platforms that don’t yet provide PFA or PMML support?

Sadly, the options for scoring in these scenarios are quite grim.

One option is to copy the data from the data platform to a PMML or PFA scoring engine, score the data and then push the predictions back to the original platform. Clearly, this is suboptimal.

Another option is to use the proprietary model format supported by each data platform, enabling easy training and scoring on the same platform. However, many enterprise big data ecosystems are composed of multiple data platforms. For instance, consider the following; data is primarily retained in Teradata, and selectively copied into Hadoop for analysis and analytics. In this scenario, the models are trained on historical data in Hadoop, but need to be applied to new data resident in Teradata. Clearly, this is not feasible if we retain the trained models in the Spark ML format.

In Chorus, we provide the capability to train models on Hadoop using Spark, and perform in-database scoring of these models across any of the JDBC data platforms supported by Chorus. Accordingly, models can be trained on Hadoop, and then scored on data resident in Teradata, SAP HANA or AWS RedShift, without requiring any data movement!

This is achieved in Chorus by retaining model details using a rich intermediate representation. From this IR, we can generate PMML, or PFA for export. Additionally, to support scoring on a wide variety of RDBMS platforms, Chorus can also use this information to generate a series of vanilla SQL statements that perform in-database model scoring. This can be performed directly using Chorus, as shown below, or a SQL script can generated and exported, such that in-database scoring can be performed outside of Chorus.

Additionally, as hinted in the above screenshot, Chorus provides the ability to include upstream transformation operations in the saved models. In the above example, the model object not only contains information about the logistic regression model, but also the upstream normalization operation that needs to be applied to the data prior to scoring.

Leveraging Chorus, it’s possible to seamlessly perform cross-platform model scoring, without having to move data between platforms and work across multiple UIs. By providing users with multiple options for operationalizing models, Chorus gives enterprises the most robust tool for moving models into a variety of production environments.