7

Batch, Streaming, and Relational Data | Voice of the DBA

 2 years ago
source link: https://voiceofthedba.com/2022/02/03/batch-streaming-and-relational-data/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Batch, Streaming, and Relational Data

This is part of a series on my preparation for the DP-900 exam. This is the Microsoft Azure Data Fundamentals, part of a number of certification paths. You can read various posts I’ve created as part of this learning experience.

The first part of the DP-900 skills document has these items:

  • describe batch data
  • describe streaming data
  • describe the difference between batch and streaming data
  • describe the characteristics of relational data

These are concepts that are important to this exam. I lightly blew these off when I started studying, but every other person with guides and the practices tests has lots of focus here. I’m glad I spent time here.

This post covers these concepts a bit. Note, these are more ETL/analytic concepts, not really

Batch Data

Most of my career deals with batch data, meaning a bunch of data that arrives at once and is imported into a system. This is different than a connection and query submitted to an OLTP system. The general idea is:

  • Lots of data
  • Processed periodically
  • Latency doesn’t matter.

Think these key words:

  • Not real-time
  • periodic
  • large/big/lots

There is an MS Docs article on this. The general idea is that you want to think about a scheduled (or some periodic) processing of lots of data for a purpose.

Examples of where batch is used.

  • Total up all hours worked last week for employees
  • Load and transform log files from all web servers each day
  • Import files from regional offices into a main database server

In the analytics space, you’d be using Azure Data Factory (ADF), HD Insight (U-SQL, Hiuve, Pig, Spark), Azure Data Lake (ADLS).

Streaming Data

There is a course on this topic. When you think of streaming, think of these key words:

  • real-time
  • stream
  • data processed as soon as created
  • few transactions
  • monitoring or instant decision making

Streaming is really about time series, about tumbling windows, about data like a stock ticker that you need to constantly and/or quickly process.

Differences

These items helped me:

  • Lots of data – Batch
  • Low latency – Stream
  • Long latency, latency doesn’t matter, periodic work – Batch
  • Small, constant sets of data – Stream

Relational Data

The workload here is that you are handling regular changes to data, lots of insert/update/deletes, for a business process. Really this means you are thinking some sort of CRUD application the gets and sends data to users in real time, but not with low latency issues. We are thinking a web server, a data entry business app, something that operates on time scales for humans, seconds. Not real time, IoT millisecond work.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK