How data lakes are revolutionizing information management

Big Data Meets Data Lake: Big Data, Their Challenges and Data Lake Architectures as a Solution

The role that data plays today is undeniable: it has a profound impact on our lives and permeates many aspects of it. The importance of the exponential increase in data availability was clearly recognized in the early 2000s. People started talking about big data and, more importantly, realized that the specific characteristics of big data pose challenges that require the use of specific technologies that are radically different from those offered by the software industry up to that point. Data lakes provide an answer to the challenges of big data.

In this context, we introduce big data and its challenges, and then focus on data lakes as a possible solution. Finally, we review the major architectural blueprints for data lakes that have emerged over the past 10 years.

Big Data

Over the past three decades, the development of new information technologies has made it possible to cost-effectively produce an ever-increasing volume of structured, semi-structured, and unstructured data from a variety of sources at an accelerating rate per unit of time. This data is commonly referred to as big data. Big data has opened up new possibilities in the field of business intelligence, allowing analytical models for predictive and prescriptive purposes to be combined with those for descriptive and diagnostic needs, also thanks to recent advances in the field of machine learning.

The research attributed this development trend mainly to five socio-technological factors [1]:

The development of increasingly inexpensive data storage media, sensors, and smart devices,
The emergence of social networks, multiplayer video games, and the Internet of Things,
The emergence of cloud computing and major advances in multi-core CPUs,
The availability of open-source software,
The democratization of data, i.e., making it available to a wide range of users regardless of their technical skills and computing resources.

In the late 1950s, Hans Peter Luhn recognized this phenomenon and coined the term business intelligence. However, it wasn't until the early 2000s that industry and academia began to converge on the key aspects of big data and the terminology to describe it [38]. In 2001, Doug Laney, then a Data & Analytics Strategy Innovation Fellow at Gartner [23], wrote: “Big data is high-volume, high-velocity, and/or high-variety information assets that require cost-effective, innovative forms of information processing that enable improved insight, decision making, and process automation” [22, 35]. This quote has become a catchphrase. It is important to note that the three characteristics that Laney ascribes to Big Data, namely high volume, high velocity, and high variety (also known as the 3 Vs), are not enough to make data Big Data: it is necessary that it serves as a support for business process automation, for the generation of business insights, and for the facilitation of decision-making processes [35].

Let us briefly introduce the 3 Vs. Volume refers to the amount of data that needs to be stored and processed. Although it often ranges from a minimum of terabytes to several petabytes, it is preferable not to set a minimum threshold for deciding what should be classified as big data: this size varies not only with the evolution of available storage capacity, but also depends on the type of data: For example, two datasets of the same size may require different data management technologies, as well as significantly different complexity and computation time, depending on the type of data. Velocity refers to the rate at which information is generated and the time available to evaluate and process it, and can vary over time depending on its source. Variety represents the heterogeneous nature of the data. Because this type of data often comes from very different sources, such as social networks, e-commerce hubs, or sensors, it varies widely in structure and format and can be structured like CSV, semi-structured like XML or JSON, or unstructured like text and video [2, 20, 32]. In addition to Laney's three Vs, several other attributes have been proposed over the past decade to characterize big data. Among these, veracity and value have been deemed relevant by research and industry and are now commonly associated with big data. Veracity was introduced by IBM to address the uncertainty inherent in data from certain types of sources. Value, on the other hand, emphasizes that data must add value to the business in order to be considered big data. It is interesting to note that in most cases, big data is characterized by a rather low "value density": in its raw state, it offers little value in relation to its volume; only through analysis can its true value be revealed [20, 32]. In summary, when we talk about the 5 Vs of big data, we are referring to the dimensions of volume, velocity, variety, veracity, and value by which big data can be characterized [2, 10, 30, 45].

Big Data Challenges for Traditional Storage and Processing Systems

The 5 Vs present a new challenge to traditional data management technologies. While the speed of communications hardware continues to increase, the same cannot be said for the speed of data processing. This creates a dichotomy between communication and computation, and as the volume of data increases, it raises the issue of time-to-information, or the time it takes to process the data from the moment it is received. Time-to-information, in turn, is relevant for defining the architecture of the processing system and the choice of computational engines and algorithms [32].

Another point to keep in mind is that data arrives at different speeds, frequencies, volumes, and complexities, creating a stream that can change dynamically in its formats, if not in the nature of the data itself. In both cases, data transformation and analysis processes must adapt [32].

In addition, with respect to the value of data, the relationship between the quantity and quality of data is tied to the present and varies over time. As the development of new technologies and algorithms makes it possible to extract more and more information from data, it is a priority to be able to reprocess data as needed to meet new analytical needs. Thus, assuming that the analytical importance of data is likely to emerge over time, the tendency with big data is to keep everything. But how can we effectively identify data that is useful for ad hoc analysis, knowing that it is only a very small fraction of all available data? [32]

Ensuring the quality of data from external sources and its impact on the reliability of analytical results is also a challenge. For example, when verified data is combined with unverified data from external sources, as may be the case with social media data, the quality of the dataset as a whole cannot be guaranteed, and the conclusions drawn from its processing are of variable reliability. Is an approximation of accuracy sufficient? How much data is needed for accurate and reliable analysis? What is the "value" of data for decision making? Does more data necessarily lead to better results? [32] Finally, how can you ensure data security and compliance with applicable laws and regulations? What privacy rules apply when combining information about an individual from multiple sources? [32]

Distributed Systems and Parallel Computing: The Emergence and Consolidation of Big Data Technologies

To answer to such limits, a new breed of software technologies based on the use of distributed systems and parallel computing has emerged in the 2000s. Over the past few decades, the challenges posed by the five Vs of big data have been addressed by industry and research, and several solutions have been proposed. One early approach was systems based on a mix of technologies, such as relational databases and data warehouses. However, mainly due to the diversity of data and its sources, this approach often led to the consolidation of information silos, with data management systems characterized by heterogeneous and insufficiently integrated schemas, query languages, and APIs [2, 28, 41, 43].

To overcome these limitations and address the challenges of big data, a new generation of software technologies based on the use of distributed systems and parallel computing emerged in the 2000s.

In 2003, Google's Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung presented the Google File System in a paper that would make big data history. The Google File System was a scalable distributed file system for large, data-intensive distributed applications, designed to run on inexpensive commodity hardware and provide high fault tolerance and aggregate performance to a large number of clients. At the time, the Google File System was widely used within Google to store data for the search engine and research and development activities, providing hundreds of terabytes of storage across thousands of disks on over a thousand machines accessed by hundreds of clients [46].

This was followed in 2004 by a paper entitled “MapReduce: Simplied Data Processing on Large Clusters”, in which Jeffrey Dean and Sanjay Ghemawat, again for Google, introduced MapReduce, the technology Google developed to optimize the performance and cost of processing the massive amounts of data required by its search engine. MapReduce was intended to be both a functional programming model and a concrete implementation for processing and generating large datasets. As a functional programming model, MapReduce is straightforward: a “map” function processes a key/value pair and produces an intermediate output of new key/value pairs...

Big Data Meets Data Lake: Big Data, Their Challenges and Data Lake Architectures...

Big Data Meets Data Lake: Big Data, Their Challenges and Data Lake Architectures as a Solution

Big Data

Big Data Challenges for Traditional Storage and Processing Systems

Distributed Systems and Parallel Computing: The Emergence and Consolidation of Big Data Technologies

Recommend

On AI-assisted Software Development

How to write Rust unit tests for your Compute@Edge application

Sony Xperia 1 V review: think different (again)

How to reduce the low risk of battery fires from your EV and on planes - The Was...

Making Twitter my toxic “X”

Hotel In Bucaramanga

Syntax at a Glance: Java Versus Python

Australians want faster government digital transactions: survey

Modern AI for the JVM - Xef.ai 0.0.2 - Xebia

privacy: no nominal visibility for assoc fns by davidtwco · Pull Request #11409...

About Joyk