According to Amazon, “A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”
At the start of this decade, the release of commercially supported distributions of Apache Hadoop software enabled service providers to deploy vast data lakes containing a wide variety of data types (structured and semi-structured) aggregated from a broad range of sources. In practice, however, collecting and storing the raw data proved easier than extracting actionable insights in real-time for informing network, service and field operations.
Front-line Operations Needs Real-Time, Actionable Intelligence
Complex queries on massive Hadoop clusters might involve running multiple MapReduce jobs that could take hours to complee. Resource-intensive queries would have to be scheduled less frequently and it could take a day or longer before useful intelligence is delivered to operators. This is fine for teams using business intelligence to track trends in network and service utilization or to inform long term planning decisions (like capacity planning), but front-line operations teams want actionable intelligence in seconds or at least minutes, not hours or days.
This obvious shortcoming led the Hadoop open source community to develop Apache Spark, which processes Big Data using an in-memory, “micro-batching” technique that facilitates continuous analysis of data as it is collected. Spark-based streaming analytics can generate insights within seconds or minutes, depending on query complexity and the total volume of data processed.
Real-time analytics serves a broad range of operational use cases
A Decade of Big Data Innovations
Hadoop and Spark firmly established the value of Big Data analytics with active open source communities developing a rich set of tools and analytics engines to support a wide range of needs from complex batch processing to real-time analytics. Over the past decade, innovation in Big Data software has proceeded at an amazing pace, including the development of NoSQL key-value stores, column-oriented databases for multi-dimensional analysis of time series data, the “ELK” stack for analyzing text-based data such as log files and machine learning libraries with algorithms optimized for Big Data.
Why is there such diversity in the realm of Big Data analytics software? Because it is impractical to serve the needs of a wide range of stakeholders from a single central repository containing only raw data. This is certainly the case in service provider operations where many different types of data are collected from a wide array of sources. In these environments, a “data lake” becomes a collection of repositories and analytics engines that support multiple applications and use cases.
The Art of Data Curation
More importantly, these Big Data repositories need to be carefully curated and data properly formatted during the ingestion process so that raw data is transformed into records structured for rapid data storage, retrieval and analysis. In this context, the art of data curation requires Big Data architects and data scientists to have a full understanding of the operational needs of all stakeholders. They need to know in advance what types of data have to be collected and how the data should be processed during ingestion so that various analytics engines can efficiently execute the set of anticipated queries.
Extracting actionable intelligence from Big Data in real-time requires significant up front investment to define operational requirements for each stakeholder and carefully curate data so that repositories are properly structured to facilitate real-time analytics. This is quite different than the approach of simply collecting raw data, storing it in a central data lake and then analyzing it later.
Data transformation might involve translating and combining data from different sources into a single record, or enriching raw data with additional data from other sources. While a handful of raw data types might be stored directly in a repository without modification, it is usually necessary to transform or enrich collected raw data before storing.
Complement In-house Talent with Proven Vendor Expertise
Developing, deploying and maintaining real-time analytics solutions for operational intelligence is challenging for service providers and large enterprises alike. Software engineers, solution architects and data scientists need the relevant operational experience coupled with proven Big Data expertise. Commercially supported Big Data open source software is readily available from multiple vendors, but Big Data talent is in short supply – across all industries. For this reason, operations teams on the path of moving beyond data lakes would be wise to seek out vendors who have proven expertise delivering productized Big Data analytics solutions for real-time operational intelligence in their domain.
Image attributions: Bigstockphoto.com