Researchers and business analysts have witnessed two parallel paths of development in analytics technology over the last few decades: databases and other datastores as repositories of information to be analyzed (together with an engine for querying that information); and analytics applications that provide users with mechanisms for easily generating business insights from their data. These two layers of the analytics technology stack have been trying to keep up with each other for over thirty years, leading to ever-increasing sophistication and scalability in analytics capabilities.
The history of databases properly begins in the 1960s, but the sort of relational databases that we would recognize really got going in the 1970s, substantially driven by work at IBM and at the University of California, Berkeley. By the early 1980s there was an abundance of commercial databases including DB2 and Oracle, and the market matured during the 1990s as large numbers of software vendors produced increasingly complex products built on top of these data platforms, such as customer relationship management and enterprise resource planning applications.
Offline, the data generated by all these new applications created a rich source of new analytics. In the 1980s Teradata pioneered the commercial development of large-scale data warehouses through massively parallel processing (MPP), in which data queries are executed in parallel on a number of independent or semi-independent servers; and companies like banks and telcos started to create rich repositories of financial audit trails and customer histories.
In recent years, there has been a well-documented explosion in the size of data to be managed as hosted applications, web-sites and finally mobile applications served ever larger audiences. Internet companies like Google and Facebook have created a need for rapid and concurrent access to more dynamic data structures, and in the rush to fill the gap left by the traditional databases there was suddenly a number of open-source NoSQL systems such as Cassandra and HBase.
Meanwhile, for handling data with an almost extreme level of flexibility, the Hadoop framework that essentially came out of Google and Yahoo! represented an effort to reliably and efficiently handle massive loads of data processing across many low-end computers. At its heart is the MapReduce paradigm, an analog of basic SQL aggregation queries but able to work with almost any data. It has since multiplied into a variety of commercial and open-source versions that are increasingly becoming the default platform for large-scale ETL and analytics.
Newer MPP databases such as Vertica and Greenplum then tried to leapfrog Teradata by emulating the scalability, cost and flexibility of Hadoop and NoSQL, and even creating their own hybrid platforms. Meanwhile the less traditional technologies have been in a race to provide simple SQL interfaces and application layers to mask the complexity of MapReduce and NoSQL.
Sitting atop these data platforms (to a large extent) is the world of analytics tools and applications, a spectrum of technologies which is itself split broadly into two areas: descriptive analytics, or business intelligence, which generally offers the ability to summarize datasets through basic aggregations and grouping, with facilities for drilling down into areas of interest; and advanced or predictive or inferential analytics, which apply statistical and mathematical modeling techniques to data to draw conclusions that aren’t already explicitly present in the data, such as correlations or time series predictions.
It was advanced analytics that got off to an earlier start through academic projects. The grandfather of them all, SAS, started as an academic project to analyze agricultural data for the US government, centered at North Carolina State, and was then founded in 1976 as a commercial entity. Similarly, SPSS started at the University of Chicago as a statistical package for the social sciences (hence the name) and became a successful commercial venture in the 1980s.
Meanwhile, at Bell Labs in 1976 a group in statistical computing invented the S language for data analysis and graphics. It never really took off, but it did inspire two professors at the University of Auckland in New Zealand to create R, an open-source implementation of the S language that has become enormously popular in the last two decades.
On the other end of the analytics spectrum, business intelligence (or BI) got going in the1980s with QUIZ from Cognos, and basic reporting and OLAP tools aimed at particular functions or verticals (e.g. Express from IRI for marketing, Comshare’s System W for financial applications) and the first ROLAP application, Metaphor, designed for CPG. BI and reporting reached a broader audience with general applications like Crystal in 1991, Actuate in 1993, and so on.
The powerful combination of BI applications and data warehouses started in the late 1980s and exploded in the 1990s, with Essbase providing the template, followed by a move to hybrid MOLAP/ROLAP in the mid ’90s, (MicroStrategy, BusinessObjects). The database vendors have responded with OLAP extensions to SQL that make it even easier to aggregate and analyze data within the database.
Since then, BI has followed two broad trends: to make it easy and to make it scale. There is an increasingly visual and accessible approach; applications are now largely web-based and use lots of simple drag-and-drop metaphors; there is the broad appeal of tools like Tableau; and then there is widespread support for MPP databases; and new vendors like Platfora and Datameer are trying to see if BI can be made to fit on top of Hadoop. After twenty years, BI has arguably reached a level of maturity.
Now appears to be the time for advanced analytics to catch up with BI, and there’s a broad acceptance that the methods of mathematical modeling and statistical analysis should be a standard part of the analytics arsenal for every organization, and for a much broader set of users. The parallel with BI is clear: make the solutions scale by integrating with the most powerful data platforms, taking advantage of even more advanced SQL extensions and open-source analytics libraries for Hadoop; and reduce the complexity and use visual metaphors to make it more accessible.
In general, as databases and other data platforms have increased in complexity and power, the analytics applications have raced to take advantage of them – and then generated yet more advanced requirements. The query layers and the parallelism (in particular) of modern platforms can now support sophisticated analytics applications on even the largest datasets. This is really the definition of ‘big data’: a powerful array of analytics capabilities from reporting to mathematical modeling, made accessible to the broadest possible audience, fueled by unlimited data.
Learn more about Alpine Data Labs here.