Hadoop data warehouses have continued to gain popularity with solutions such as Hive, Impala and HAWQ now frequently deployed at customer sites. Access to these warehouses is typically tightly controlled using Ranger or Sentry — ensuring comprehensive data security. Due to the ease with which data can be governed in Hive, an increasing number of IT departments are locating all of their Hadoop data there and requiring data scientists only interact with their data via Hive.
Data residing in these warehouses is typically accessed via SQL. This is great for ETL and reporting, but can be limiting when data scientists wish to leverage advanced analytics to train a random forest, for example. Fortunately, Chorus provides functionality that allows users to seamlessly leverage advanced analytics on Hive data sets. To deliver this functionality, Chorus transparently leverages MapReduce and Spark to perform these operations, as shown below:
In the above flow:
1. The initial data resides in a Hive table
2. The data can be optionally manipulated using HQL queries
3. The data is split into test and train components using a basic MapReduce operation
4. A logistic regression model is trained on the training split, using Spark
5. The model accuracy is evaluated using test split of the data
6. All interim results can be optionally retained as Hive tables
Hive functionality in Chorus provides data scientists with an intuitive visual interface for performing advanced analytics without having to write complex SQL queries or even a single line of code. It is also fully compatible with secured Hadoop clusters, including those secured with Kerberos (including support for impersonation), Ranger and Sentry — ensuring that IT policies and data governance is enforced.
All this is great news for our customers and we’re finding significant interest in this functionality. Stay tuned for subsequent blogs where I’ll discuss how we achieve this support.