8

Teradata and its Architecture for Data Engineers

 1 year ago
source link: https://www.analyticsvidhya.com/blog/2022/10/teradata-its-architecture-for-data-engineers/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

This article was published as a part of the Data Science Blogathon.

Introduction

The concept of Teradata began nearly 50 years ago when researchers at the California Institute of Technology and Citibank’s Advanced Technology Group began brainstorming ideas. That was in the 1970s. In 1979, several formed Teradata as a corporation in Brentwood, California. They started in an RV garage four decades earlier as another great tech company.

Teradata
Source: stambia

What is Teradata?

The name of the company is Teradata because it reflects its ambitions. They were aiming high with a revolutionary plan to manage massive amounts – trillions of bytes – of data. In 1980, a seed and venture capital round allowed the founders to create a research and development team. They started turning concepts into reality. Just before the end of 1983, they shipped their first beta system to Wells Fargo.

Since then, the list of awards and “firsts” has continued to grow. A few early highlights are below:

  • 1986 – A Fortune magazine named Teradata “Product of the Year”.
  • 1992 – The first system over one terabyte goes into production for Wal-Mart.
  • 1996 – The Data Warehouse Institute presents Teradata with the Data Warehouse Best Practices Award. The Teradata database sets the storage record at 11 terabytes.
  • 1997 – Teradata wins The Data Warehouse Institute’s Best Practices Award and DBMS Readers’ Choice Award. This year, a Teradata customer’s 24-terabyte database set the record for the largest production database in the world.
  • 1999 – A Teradata customer set the latest world record for the largest production database at 130 terabytes.
  • 2000 to 2006 – Busy years of acquisitions, mergers and new product launches.
  • 2007 – Intelligent Enterprise magazine named Teradata the best global business intelligence data warehouse.

You get the idea. Teradata has quickly established itself as a unique innovator, especially regarding scalability. In 1992, it was possible to process a terabyte of data for the first time, and in 7 years, they processed 130 terabytes. That’s scalability.

Parallel Processing

Firstly computer systems worked on a straightforward model. The CPU processes the data. The data was stored on disk platters. The data was fed into the CPU when the user requested information and processed it. Once processing was complete, the edited data was exported back to the repository.

The system worked well until the data stores became too large to handle. A large amount of data meant a long wait for the system to load. A large amount of data can overwhelm a single processor. Queries could run indefinitely, for days or even more. Sometimes they failed to return answers.

Teradata Solutions

Teradata solved these problems by designing systems that use multiple parallel processors. Parallel processors are called Access Module Processors (AMP). Teradata distributes table rows among multiple processors. I/O times are significantly improved because processors are loading less data. If we have three processors instead of one, the I/O time is reduced to a third of the original time. Parallel processing is the most important concept in Teradata.

Linear scalability

As you add more AMPs, you get predictable, linear system improvements. We call this linear scalability. Its limits, if any, are unknown. There are 4,000 AMP systems in production today.

Every time we invest in hardware and add it to a Teradata system, you get a predictable return on investment. When we add hardware, we can handle the same amount of data faster. Or you can process large data without degrading any system.

Parallel processing- Then vs. Now

I remember walking into a huge data center 20 years ago. It was almost like a football pitch, and there was hardware everywhere.

All around us, hundreds of cabinets held platters of discs. The room was silent until all the disks suddenly had a green light. The discs clattered together, shaking the room, then stopped.

Those days we could see and hear parallel processing in action. The mainframe sent a request for information, and all AMPs worked together to process their data share.

But those days are long gone. True to its innovative origins, Teradata has aggressively evolved and embraced the VM revolution. Now that room full of disks is stored on something the size of a laptop. I’ll talk more about this in the Teradata Architecture – Deep Dive section below.

The Basic Architecture of Teradata

I just mentioned AMP, one of the three main elements of Teradata’s core architecture.

Here are the elements at a high conceptual level:

Parse Engine (PE): The Parse Engine is the brain behind successful query processing. Parsing Engines:

  • Accept a SQL user query.
  • Make sure the user has permission to run the query.
  • Check the SQL query syntax.
  • Use primary indexes to allocate rows of the AMP table. More on this important feature in the next section.

BYNETs: BYNETs is the messaging layer. They are part software and part hardware. The software element controls communication. Hardware carries the communication.

The systems always have two BYNETs – BYNET 0 and BYNET 1. Two BYNETs enable a faster system and provide coverage if one fails.

Access Module Processors (AMPs): AMPs represent the computing power in the system. They receive data from BYNET, process it and return results via BYNET. The AMP has its own disk space for processing.

AMPs do not share data or memory. This is the “Shared Nothing Architecture”.

The following figure shows how the basic architecture works. The lines between the analysis module and AMP represent BYNETs. Notice that you can see two benefits.

All about primary indexes

Each SQL table has a primary index you define when setting up the table. The primary index is critical because the syntax module needs to hash it. Then the analysis engine uses the hash results to find your data.

This is a critical feature of Teradata. The analysis engine can quickly hash the primary index and target the desired data. It will find the correct AMP and the correct row—one approach. Even if we have one of those 4,000 AMP warehouses in less than a second, this feature allows Teradata to perform an impossible analysis on other systems.

If you forget to set a primary index, Teradata will likely use your first column as a non-unique primary index (more on that later). So don’t forget. Choose it so you don’t have to live with the defaults.

Unique Primary Indices (UPI)

The Unique Primary Index (UPI) is exactly what it sounds like. It is the primary index that is unique in your table. Specify UNIQUE when setting up the SQL table. Then you cannot enter another record with the same index. The system will reject your attempt to enter a duplicate UPI and send an error message.

The image below shows a common example. Emp_No (employee number) is a unique primary index. Note that all rows of the table are equally distributed across all AMPs. This is another characteristic of UPI.

When the user enters an SQL query, the analysis engine is executed—hashes Emp_No in the query. The process works as described above and returns the correct row within seconds.

Non-Unique Primary Index (NUPI)

What if you’re not searching for employees based on their unique employee numbers? You may want to ask and report by department number. The department number is not unique: many employees work in the same department.

In this example, the user submits a SQL query that contains Dept_No. The parsing engine hashes the department number of each row and routes all rows with the same department to the same AMP. The rows are stored together so you can find them in one AMP load, just like UPI. In this case, the parser returns multiple rows.

When creating a NUPI SQL table, just omit the UNIQUE modifier. Note another difference between UPI above and NUPI below. In NUPI, the rows are not evenly spaced across AMP pages. They can’t be because the departments are different sizes.

But what happens when NUPI results in a wildly uneven distribution?

Primary index with multiple columns

You can combine one or more columns to create a multi-column primary index.

Here is an example of a distribution that is so uneven that it could begin to offset the benefits of Teradata’s design. If your parser hashes on Smith, it sends all Smiths to one AMP.

Change it to a combination of last name and first name. Smiths now hash into smaller groups. John Smiths goes to one AMP and Mary Smiths to another AMP. The distribution becomes more even.

Multi-column primary indexes are also useful if you tend to query more than one field. Imagine frequently querying by department and shift (day-night, for example).

There is a small penalty associated with multi-column indexes. If you want that awesome single AMP load, you’ll need both pieces of information (columns) for your SQL query.

No primary index (no PI)

Last, and certainly least, is a table with no primary index (No PI).

If you set the table to No PI, the rows are evenly spaced in AMPS. It’s a perfect distribution every time. However, the query engine cannot hash the rows without a primary index. To run a query, you need a full-table lookup.

This makes No PI impractical for production. DBAs sometimes use it in staging and column design.

Teradata Architecture: A Deep Dive

As I mentioned, the days of data centers and the size of football fields are over.

Teradata node architecture

The node combines four analysis modules and 40 amplifiers! Each AMP gets the memory it needs because it owns a virtual disk in the disk farm. Everything is laid out nice and evenly, which is why we call it a symmetric multiprocessing (SMP) node.

To summarize, even though AMPs are co-located on a node, each AMP has its central memory, processing capabilities, and disk space in the disk farm. Teradata’s modern architecture remains a share-nothing system.

Teradata

Teradata stores nodes in what they call “racks.” They’re about the size of a kitchen cabinet, yet have a hundred times the computing power we had in football field-sized installations 20 years ago.

How do we scale this architecture? When you need to upgrade, use BYNETs to combine SMP. Now you have massive computing power – massively parallel processing (MPP) system.

Inside the Teradata nodes

Nodes are the server. That server contains:

• Linux operating system.

• PDE – parallel database extensions that control BYNETs.

• memory – which stores the analysis module and amplifiers. Each AMP contains a Vproc (virtual processor).

You can see that the node is connected to the mainframe. Unlike earlier architectures, the Teradata node is also connected to a LAN. Remember that each node contains four analysis modules? Well, each analysis module can handle 120 users.

Conclusion

The concept of engine analysis is important here. The analysis engine needs access to each AMP that contains table rows. Think about doing a full desk scan. The analysis engine responsible for this scan must control each AMP. The graphic below is a reminder of the underlying architecture that enables this massively parallel processing.

  • We could see and hear parallel processing in action. The mainframe sent a request for information, and all AMPs worked together to process their data share.
  • Teradata has quickly established itself as a unique innovator, especially regarding scalability. In 1992, it was possible to process a terabyte of data for the first time, and in 7 years, they processed 130 terabytes.
  • AMPs represent the computing power in the system. They receive data from BYNET, process it, and return results via BYNET. The AMP has its own disk space for processing.
  • Multi-column primary indexes are also useful if you tend to query more than one field. Imagine frequently querying by department and shift (day-night, for example).

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK