Analytics Blog

All About Deep Tech: Creating Scoring Engines with PFA

In recent blogs we have talked extensively about model operationalization and the support Chorus provides for the PFA (Portable Format for Analytics) standard. PFA provides a standardized way of representing analytical models, providing much needed model portability i.e. the ability to train a model on one data platform, serialize the model as PFA, and then score the model across any other data platform supporting the PFA standard. Alpine has developed PFA scoring engines for deployment across a wide variety of environments, including two standalone engines:

  • RESTful engine
  • Kafka engine

And two integrated engines:

  • Storm engine
  • Spark Streaming engine

To conclude my blogs on model operationalization, I thought it would be interesting to highlight how easy it is with PFA to write your own basic scoring engines, thanks to the availability of open source PFA implementations. They have been developed by the Open Data Group, and are made available under the Apache license.

Getting started with PFA

Probably the simplest way to create a PFA scoring engine is to use Python and Titus, the open source Python PFA engine. Using Titus, a PFA engine can be created in just a couple of lines, as shown below.

import json

import sys

from titus.genpy import PFAEngine

# Leverage the PFA doc specified on the command-line

pfa_model = sys.argv[1]

engine, = PFAEngine.fromJson(json.load(open(pfa_model)))

where the location of the PFA document (in this case assumed to be a JSON-formatted PFA file) is supplied as a command line argument. Once the engine is created, it’s then just a few more lines of code to use the engine to start scoring.

# Invoke any initialization functions

engine.begin()

# Score example input

input = {“Sepal_length” : “1.0”, “Sepal_width” : “1.0”, “Petal_length” : “1.0”, “Petal_width” : “1.0”}

results = engine.action(input)

print results

In this example, the resulting output is

[ec2-user@ip-10-0-0-169 blogs]$ python demo_pfa.py example.pfa

{u’INFO’: {u’Iris-virginica’: 2.2905557894937507e-15, u’Iris-setosa’: 0.9999999999122311, u’Iris-versicolor’: 8.77666387876342e-11}, u’PRED’: u’Iris-setosa’, u’CONF’: 0.9999999999122311}

example.pfa is a logistic regression model trained on the well-known Iris data set, and we are making a prediction based on a single sample of data. As illustrated below, the PFA engine expects the input data to be scored to presented in a dictionary, with the input features labelled as in the PFA documents input specification, and the output consisting of a prediction, confidence information and information on the likeliness that the sample belong to each of the other species of iris.

Once this basic engine is created, it’s then just a few more lines of python to create a RESTful scoring engine, or attach the engine to a Kafka stream as discussed below.

RESTful PFA Scoring Engine

This basic code can then be easily used to create a RESTful scoring interface as follows:

#(r”/demo/score/([a-zA-Z0-9_]+)”, scoreModel),

class scoreModel(tornado.web.RequestHandler):

   #Score model

   def post(self, id):

       engine, = PFAEngine.fromJson(json.load(open(‘models/%s.pfa’ % (id))))

       dd = tornado.escape.json_decode(self.request.body)

       self.write(str(engine.action(dd)))

In the above excerpt, I’m using the Tornado web server, and the score endpoint uses Titus to score the data in the incoming payload. In this example, the model to be used is specified in the URL (it is assumed that the model is already uploaded) and is loaded into an engine, the JSON payload of the incoming message is decoded, passed to the engine for scoring, and the result generated by the PFA engine returned as the response. [N.B. In this example a new PFA engine is created for each incoming message, which, while inefficient, is OK for illustrative purposes.]

Streaming scoring engine

Similar to the RESTful scoring engine, the PFA engine for scoring Kafka streams can be created in a few lines of Python:

from kafka import KafkaConsumer

from kafka import KafkaProducer

pfa_model = sys.argv[1]

engine, = PFAEngine.fromJson(json.load(open(pfa_model)))

kafka_topic_to_score = sys.argv[2]

kafka_topic_to_emit = sys.argv[3]

consumer = KafkaConsumer(kafka_topic_to_score, bootstrap_servers=server_address)

producer = KafkaProducer(bootstrap_servers=server_address)

#Consume messages

for msg in consumer:

   #Score next message

   score = engine.action(json.loads(msg.value))

   #Emit prediction

   producer.send(kafka_topic_to_emit, str(score))

In the above code, the PFA engine is created in an identical fashion to the previous examples, and the kafka-python library is used to create a kafka client, connect to the specified Kafka broker and create a message consumer and a message producer. Incoming messages are read by the consumer, scored using the PFA engine and then written by the producer back to Kafka.

PySpark Scoring Engine

Finally, it’s simple to leverage the Titus PFA engine from PySpark, and a simple Spark mapPartitions integration is illustrated below.

In this example, the scoreModel function iterates row by row over the data in the partition, scores each row and results a list of predictions.

While it’s obvious that the above examples need significant hardening and error handling before they could be used, hopefully they illustrate the ease with which it’s possible to get started using PFA for model scoring!