ON DATA ENGINEERING

The path to learning SQL and mastering it to become a Data Engineer

SQL is one of the key tools used by data engineers to model business logic, extract key performance metrics, and create reusable data structures. There are, however, different types of SQL to consider for data engineers: Basic, Advanced Modelling, Efficient, Big Data, and Programmatic. The path to learning SQL involves progressively learning these different types.

Basic SQL

What is “Basic SQL”

Learning “Basic SQL” is all about learning the key operations in SQL to manipulate the data such as aggregations, grain, and joins, for example.

Where to learn it

Basic SQL can be learned from websites such as W3C or looking for a more practical approach to learning from websites such as Datacamp or DataQuest. These websites allow us to get a decent grasp of SQL's core concepts, such as the different operations, functions, subqueries, and joins. Some of the core concepts in data engineering, such as working with the grain of a table/dataset, are often not as extensively discussed as they deserve.

Practice challenges

One of the main challenges of learning SQL is setting up the database and access to datasets. These days installing a local database has become quite easy, but it does require some time to set up the database. After that, the tables need to be created, and datasets uploaded onto them before they can become useable for practical learning.

For interviews

This type of knowledge is generally tested during screening interview questions, such as that of the histogram, to understand how candidates have grasped concepts such as granularity or joins. This type of interview question is also at the typical SQL knowledge level expected for fresh graduates embarking on data engineers' careers.

Advanced Modelling

Data engineers need to be able to model complex transformations. Learning some advanced analytical SQL helps model these types of behavior. Two main things help support this kind of use case 1) Advanced Queries 2) Data Models.

Advanced queries

Window functions are a special class of functions with added properties. They help more easily define transformations that would have otherwise required a mix of subqueries and joins to achieve the same results. They can help handle many heavier data transformations, such as sessionization, preference picks, or optimization based on actual data.

REGEXP and their associated functions provide a way to do more extensive text pattern matches than the regular LIKE operator.

Collection aggregates functions, allow to summarize the dataset contained within it, typical functions are LISTAGG and array_agg. They can be useful for providing some data that would naturally be a more granular level than the summarized dataset. For example, take the use case of knowing if certain product types were included in a given e-commerce order. This could be computed from the raw order items data, but it could also be very easily computed from an orders table with an aggregated field based on jsonb_object_agg .

Recursive CTEs / hierarchal queries are generally less used than the other types of transformation, such as window function/regexp or collection aggregate functions. They are, however, essential to solving certain types of problems.

Data Models

Data models, the type of data models that are used can be particularly impactful. Three types of

Normal Form: It is important for data engineers to understand the benefits of database normalization, what the different normal forms mean, and where they should be used.

Slowly Changing Dimensions: Setting up and leveraging type II slowly changing dimension (SCD) tables can make it particularly easy to handle certain use cases such as calculating the evolution, such as an insurance premium across time for a given insurance contract. Type I SDC tables would be more appropriate for providing visibility to the current user demographic split by age group.

Denormalized Design: denormalization is the typical way data tends to be exposed in a data warehouse, and understanding the pre-requests and different approaches used in denormalized design, such as star and snowflakes schemas, is important to keep a sane data warehouse.

Where to learn Advanced Analytical SQL

One of the key learning materials for advanced analytical SQL is the database's official documentation. It is essential to understand what is available and what could be of interest when needing to model business problems. Some books also provide a good overview of how to best apply SQL to these modeling use cases, such as Practical Oracle SQL, which provides practical applications to window functions, examples of applying recursive queries, or handling row pattern matching.

For data models, there are several database classes [1] [2] [3] that teach some of the critical modeling principles. These need to be applied in practice. Books such as Modelling Business Information or Database Design for Mere Mortals provide useful references to deep dive into the subject.

In interviews

Questions on advanced modeling tend to come up in interviews either as part of a SQL case study, for instance, designing a table structure and transformation for a given specific use case, or can be more theoretical, such as explaining the difference between normal form versions.

Efficient

Part of the job of being a data engineer is to write efficient code. There are a few things that are important to know in that regard. 1) having a good grasp of complexity 2) a good understanding of data structure, efficient data model, and query patterns, and 3) being able to analyze and explain plan and understand what is being done by the optimizer.

Complexity

Besides, having a general knowledge about some approaches is particularly expensive, such as using correlated subqueries; it is essential for data engineers to understand the time and space complexity of the different transformations done, both in a naive way and in the way optimized by the databases.

Data structure and data model design

Understanding concepts such as index and partitions and how they can contribute to higher efficiency, and how modeling around them can bring benefit. Take the example of a daily snapshotted table — it can be a great way to support the entity's update process, leveraging the previous day partition and applying the modifications on top.

WITH 
daily_updates AS (
SELECT 
   customer_id,
   name,
   created_on,
   ROW_NUMBER () OVER (PARTITION BY customer_id ORDER BY created_on DESC) as rn
FROM customer_updates cu
WHERE 
   cu.ds = '2020-02-02')
INSERT INTO dim_user 
SELECT 
   '2020-01-02' AS ds,
    COALESCE(daily.customer_id, du.customer_id) AS customer_id,
    COALESCE(daily.name, du.name) AS name,
    COALESCE(daily.created_on, du.last_login_on) AS last_login_on 
FROM dim_user du
FULL OUTER JOIN daily_updates daily 
     ON du.customer_id = daily.customer_id
     AND daily.rn = 1
WHERE 
   du.ds = '2020-01-01'

Let’s look at the above SQL query, where ds represent the daily partition. The process is rn daily to update with a different partition, leveraging a full outer join, and a (filtered, i.e., rn=1) CTE at the same grain (i.e., customer_id); it is possible to compute the update looking at only a subset of the data at the time — the daily updated data, and an already consolidated view of customers — much more efficient than recomputing every time the state of the customers based on the full update logs.

Explain plan

Understanding how to read an explain plan is necessary when dealing with databases. It tells you what decision the optimizer has taken and the different steps that are being performed. This gives data engineers the ability to see what operations are expensive and could be optimized. Take, for example, the use of a nested loops operation happening when it should be a hash join .

There are different methods to optimize the queries based on what is returned by the explain plan, and the analysis of the explain plan should be tied to these optimizations techniques be it 1) query rewrites 2) introduction of indices, presorting steps, partitions … 3) use of optimizer/planner hints 4) update of the table statistics.

Learning efficient SQL

The most typical way to learn how to write efficient SQL code is to be confronted with inefficient or slow operations and optimize them out of necessity. There are books and courses on databases that provide some of the information to learn, articles such as this one from online learning platforms, which provide insights into what can be done to write more efficient SQL. Nevertheless, one of the most important sources of information for this comes from reading the official database documentation and learning from experience.

Efficient SQL in interviews

Questions of efficiency usually pop up in interviews for candidates with experience. They usually come up as follow up questions to a modeling exercise, asking how the transformation could be rendered more efficient. Certain companies, however, place a particular focus on writing efficient code and on query optimization. For these types of interviews, it is primordial not just to understand some of the methods mentioned but to truly understand what is happening under the hood and how the optimizer actually rewrites the query top render them more efficient, such as techniques like flattening, inlining, predicate pushdown, partition pruning, …

Big Data

There are specificities involved in dealing with SQL for Big Data technologies. Certain data structure options such as buckets or cluster by is very traditional to big data technologies.

The use of probabilistic data structures in certain functions, such as approx_distinct in presto, is also something typically found in the SQL code leveraging big data technologies despite some of these functions being available in traditional databases.

A lot of the aspect in dealing with Big Data SQL is understanding what is happening under the hood in the specific distributed systems, similarly to how the system plans and executes the queries.

Where to learn Big data SQL

A lot of learning big data SQL comes from working with big data systems and attempting to debug what is happening. The specific documentation of the platform used, such as Spark or Hive, will provide a high-level view of what is happening. Courses such as Harvard’s Algorithms for Big data give a much more in-depth explanation of what is happening under the hood of some of the probabilistic algorithms and data structures used in Big Data.

Programmatic

Programmatic SQL offers a way to orchestrate a flow of transactions through procedures. It provides the core of a programmatic language, introducing control flow( loops, ifs,…), variables, handling exceptions, and even generating and executing SQL queries dynamically.

Although the use of programmatic SQL has fallen a bit in disfavor with some of the modern data engineering movements favoring data immutability and full re-load vs. incremental updates, as well as the rise of workflow engines such as Airflow or Prefect, it is still useful and in use in a large number of companies.

Learning Programmatic SQL

Most RDBMS support a flavor of programmatic SQL such as T-SQL for MsSQL, PL/SQL for Oracle, or PL/pgSQL for Postgres, and the language is therefore very much vendor-specific.

The documentation provided by the specific vendor offers a good first pass to give the procedural language a try or to search for certain functions specifically. There are, however, some decent books to get introduced to these languages, such as Oracle PL/SQL Programming or PostgreSQL Server Programming.

Programmatic SQL in Interviews

In my experience, programmatic SQL questions do not tend to be directly asked in an interview, in part because of the vendor specificity of the language. However, some companies looking for people with experience with a specific database vendor may ask to complete some automated tests that include some questions around programmatic SQL.

The path to learning SQL and mastering it to become a Data Engineer

ON DATA ENGINEERING

The path to learning SQL and mastering it to become a Data Engineer

Basic SQL

What is “Basic SQL”

Where to learn it

Practice challenges

For interviews

Advanced Modelling

Advanced queries

Data Models

Where to learn Advanced Analytical SQL

In interviews

Efficient

Complexity

Data structure and data model design

Explain plan

Learning efficient SQL

Efficient SQL in interviews

Big Data

Where to learn Big data SQL

Programmatic

Learning Programmatic SQL

Programmatic SQL in Interviews

Recommend

A C++ API for Vega-Lite

Pandas for Data Analysis

How to setup Django with React

A Tale of Two Deployments

Vulnerability, leadership and paternity leave ft. Erran Berger

Evennia 0.9.5 released!

Exploration and Explanation in Computational Notebooks

How to Build Python from Source

Economics of Home Ownership Deep Dive

Replacing ReadTheDocs with GitHub

About Joyk