Last week I had the opportunity to speak at Spark Summit in San Francisco. It was great to learn about how different businesses are utilizing Spark to meet their needs and the speed at which Spark has been evolving. Spark has become one of the most powerful open source processing engines and I was able to share how the data science team at Alpine deploys Spark at scale.
One issue we consistently face at Alpine with our customers is that they lack the ability to analyze complex data at scale. This is not a limitation of analytics platforms in general, but a limitation in conventional machine learning techniques. One of the problems with machine learning is that complex data carries the curse of dimensionality – the more the number of dimensions in a dataset, the worse the performance is of standard machine learning. As a consequence, instance learners and anomaly detection tend to perform poorly on datasets with a large number of dimensions.
One of the techniques I have learned to tackle this problem is topological data analysis (TDA). TDA helps by capturing both global behavior as well as localized structure while attenuating information loss in complex data. Specifically, the Mapper algorithm generates a topological summary of the whole dataset while simultaneously preserving structural information.
There are open source implementations of Mapper available in Python and R. However, these implementations have significant limitations in terms of their ability to process large amounts of data. We thought it would make sense for us to build a scalable version of Mapper on top of Apache Spark, in order to take advantage of Spark’s ability to distribute computations across multiple nodes. Distributing this would enable the analysis of enterprise-scale data, extending potentially to terabytes.
I truly believe that this is the first scalable implementation of Mapper on Apache Spark. We are working hard to clean up the code and contribute it to the open source community as well!
If you would like to learn a little more about this, I have attached slides from my presentation here. Take a look and let us know what you think!