Integrating interactive data visualization with data analysis systems is highly desirable, but difficult to achieve. Visualizing data gives data scientists the opportunity to leverage visual perception to identify patterns and correlations in data sets during the initial data exploration phase. Visualizations can also be a powerful communication tool to explain insights from data science to a larger audience.
At Alpine, we are building an Open Source framework for interactive data visualization called Chiasm. This project provides is a browser based runtime environment and component architecture for interactive data visualizations. It allows plugins for data access, data transformation, and interactive visualization to be loaded and configured dynamically.
We recently held an event at Alpine Data as part of the SF Big Analytics Meetup: Open Source Project: The Chiasm Data Visualization Platform. The event was lively with an engaged audience and lots of great discussions. Here’s the recording of the event on YouTube: The Story of the Chiasm Project.
The core concept of the Chiasm project is that data visualizations can be instantiated, configured with data, arranged on the screen, and coupled together to produce interactive linked views by simply manipulating the application configuration (a JSON data structure). This organization allows a dynamic configuration structure to drive the state of the application, and also allows changes resulting from user interactions with runtime components to be propagated back to the configuration. This makes it possible to build a system that stores and retrieves user-generated Chiasm configurations.
So far, examples have been created that demonstrate nested box layout, loading data, bar charts, scatter plots, line charts, choropleth maps, pie charts, donut charts, various patterns of linked views, and interactive multidimensional filtering. This is only the beginning. The Chiasm project aims to provide a path by which any data visualization example found on the Web can be turned into a reusable Chiasm component, including interaction techniques such as hovering, picking, brushing, and tooltips.
At Alpine, we are currently integrating the Chiasm system into our analytics platform. Some of the interesting issues we have been dealing with involve the question of how to visualize “Big Data”. Since there are only so many pixels on the screen, it is not feasible to visualize every element of large data sets directly. In order to visualize “Big Data”, it must first be reduced to small data, then visualized.
There is a great paper about the “ImMens” system that contains an overview of various methods of data reduction that can be used in conjunction with visualization (see the section “3. Data Reduction Methods”). For example, rather than show a scatter plot with billions of points, you could compute a 2 dimensional histogram, where each square in, for example, a 100 X 100 grid, can show the count of points in that bucket. This kind of data reduction is called “binned aggregation”, and can be used to preserve salient features in the data set (like the distribution) while reducing the actual number of visual marks that needs to be drawn (and reducing the amount of data that needs to be sent to the browser). In addition to binned aggregation, filtering and sampling are also useful and practical data reduction techniques.
Within the Alpine analytics platform, we are investigating two alternative ways of visualizing large datasets, summarized in the diagram above.
The first option (top) is to leverage cluster compute resources to perform filtering and aggregation on the entire “Big Data” data set. In this scenario, every time the user interacts with the visualization to modify the filtering or aggregation parameters, a full data reduction job would be run over the entire data set (using Apache Spark). This has the benefit of being accurate, which is good for long-lived documents like reports, but has the disadvantage of making visual analysis iterations slow.
The second option (bottom) is to leverage cluster compute resources to perform a random sample of the data set once at the beginning of a visualization session. Once the random sample is in browser memory, it can be interactively filtered and aggregated on the client side. In this scenario, the user pays a single up-front cost of waiting for the random sample to be computed, then can interact with the filtering and aggregation parameters interactively. This has the benefit that visual analysis iterations are extremely fast (milliseconds, because the data is local in the browser) which is great for exploratory data analysis, but has the disadvantage that the results are not completely accurate, because they are based on a random sample.
Overall, the Chiasm project is just getting started (it was started in February 2015), and there are many directions for new features that we would like to explore. There are many places in a data science workflow where interactive visualization would be useful. We would love to hear any feedback you may have on the project, and Open Source contributions are welcome. For more information, check out the Chiasm GitHub repository, and feel free to join and post to the Chiasm mailing list.