Last year we announced the addition of Python Notebooks into the Chorus Platform. This was a long-requested feature which gave teams of data scientists and analysts even more flexibility within Chorus. Notebooks let you easily perform interactive Python analysis on data from any of the sources, Hadoop or database, connected to Chorus. Like Visual Workflows, you can collaboratively edit Notebooks and Chorus will track versions, tags, and notes. In subsequent releases, we’ve continued to refine our Notebooks feature with support for important new technologies such as PySpark and tighter integrations between the Notebook environment and our Visual Workflow Editor.
In Chorus 6.2 we added support for PySpark and scheduling Python Notebooks as jobs. Using our PySpark integration you simply select a Hadoop cluster from the list of those you have connected to Chorus, and we take care of all the details of setting you up with a Spark Context on that cluster so that you can immediately start submitting Spark jobs.
We will continue to enhance our Python integration in Chorus 6.3 with a new Python Execute operator for use in Visual Workflows. Python Execute lets you insert Python Notebooks directly into the Visual Workflow compute graph, much like our R, SQL, HQL, and Pig Execute operators.
Within the Python Notebook, you simply need to define the input and output schema the Notebook expects (most typically by inferring the schema from a data frame). We’ve also improved our Chorus bridge APIs to allow for easy import and export of assets in the Chorus Workspace, so reading or writing a PMML or PFA model is a one-line affair.
Python Notebooks run in Docker containers that are pre-loaded with a useful set of default packages (see the full package list here). If you’d like to install extra packages when inside the Notebook environment, type !pip install yourpackagename in the first cell. If your environment is configured such that the Notebook container can’t access the internet, we can work with you to spin a custom Docker image that has the packages you’d like already in place.
Bringing Python Notebooks to the Visual Workflow Editor will add a new layer of flexibility and power to Chorus. Stay tuned for more feature spotlight blogs on topics like model management, custom operators, and Spark Autotuning!