10

AlloyDB for PostgreSQL Columnar Engine | Google Cloud Blog

 2 years ago
source link: https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-columnar-engine
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

AlloyDB for PostgreSQL Columnar Engine

Recently, at Google I/O, we announced AlloyDB for PostgreSQL, a fully-managed, PostgreSQL-compatible database for demanding, enterprise-grade transactional and analytical workloads. Imagine PostgreSQL plus the best of the cloud: elastic storage and compute, intelligent caching, and AI/ML-powered management. Further, AlloyDB delivers unmatched price-performance: In our performance tests, it’s more than 4x faster on transactional workloads, and up to 100x faster on analytical queries than standard PostgreSQL, all with simple, predictable pricing. Designed for mission-critical applications, AlloyDB offers extensive data protection and an industry leading 99.99% availability SLA, inclusive of maintenance. 

Multiple innovations underpin the performance, and availability gains of AlloyDB for PostgreSQL. In the first part of our “AlloyDB for PostgreSQL under the hood'' series, we discussed AlloyDB’s intelligent storage layer, and today, we are covering AlloyDB’s vectorized columnar execution engine, which enables analytical acceleration. 

PostgreSQL and hybrid workload patterns

General-purpose databases like PostgreSQL often support a wide variety of workloads. Some of those workloads are purely transactional in nature, and in the previous post, we discussed how our intelligent storage layer contributes to making AlloyDB more than 4x faster than standard PostgreSQL for such workloads based on our performance tests. However, PostgreSQL also has rich querying functionality that many users leverage both inside and outside of their applications. Analytical queries, i.e., queries involving scans, joins, and aggregations over a large amount of data, are a core part of many relational database workloads. These can include:

  • Queries within an application that serves end-users, for example: an ecommerce application that shows most popular products segmented by region based on recent sales across multiple properties
  • Real-time business insights for analysis that requires access to the most up-to-date data. For example: a recommendation engine in a retail application that shows suggested add-on purchases based on what the user has in their cart, the time of day, and historical purchase behavior
  • Ad-hoc querying, where a developer or user might need to query the database directly to answer a question, for example, running an audit on recent transactions for a regulator

Tuning operational databases to perform well in these varying use cases can be difficult. Historically, users with these types of workloads have had to create indexes and optimize schemas to ensure sufficient query performance. This not only increases management complexity, but can also impact transactional performance. Slow query performance can also constrain what developers are able to deliver to their end users, and deter development of real-time business insights.

AlloyDB provides a better way. Powered by a columnar engine, AlloyDB performed up to 100x faster than standard PostgreSQL for analytical queries based on our performance tests, with no schema changes, application changes, or ETL required. This technology keeps frequently queried data in an in-memory, columnar format for faster scans, joins, and aggregations. 

The embedded machine learning in AlloyDB makes accessing this technology easier than ever. AlloyDB automatically organizes your data between row-based and columnar formats, choosing the right columns and tables based on learning your workload, and converting them to columnar format automatically. The query planner smartly chooses between columnar and row-based execution plans, so transactional performance is maintained. This allows AlloyDB to deliver excellent performance for a wide range of queries, with minimal management overhead.

A refresher on column-oriented data representation

Traditional databases are row-oriented and store data in fixed-sized blocks. This organization is optimal for access patterns that request information across a whole row, for example, when an application needs to look up information pertaining to a particular user. Row-oriented storage is optimized for these types of access patterns. 

Analytical queries require different access patterns. In order to answer analytical queries from a row-oriented data store, whole tables need to be scanned, reading through every column of every row, even though much of the data stored in the table is not relevant to answering the questions. 

Column-oriented data representation enables answering analytical questions faster, by keeping the values in a single column together. By focusing access to only the relevant columns, column-oriented databases can deliver faster responses to analytical queries.

AlloyDB columnar engine 

Google has a long history of innovation in large-scale data analytics, especially with services like BigQuery. These services offer fast, scalable query processing through the use of optimized, columnar data layouts, state-of-the-art query processing techniques, and hardware acceleration. 

AlloyDB embeds some of the same technological advancements directly into a PostgreSQL-compatible operational database. It combines a row-based format for transactional processing and storage with a columnar format and execution engine to provide the best of both worlds.

The columnar engine is a modern, vectorized query processing engine that efficiently processes chunks of columnar data by making optimal use of the system caches and vector processing instructions provided by today's CPUs.

Beyond leveraging the capabilities of modern hardware, the columnar engine includes several algorithmic optimizations to further speed up query processing. It makes use of column-specific metadata, such as minimum and maximum values, to speed up scans, and can perform other operations like aggregation directly on the relevant columns without materializing the results of a scan. Finally, hybrid execution combines both columnar and row-oriented query processing techniques where beneficial.

Columnar data format

The columnar engine intelligently determines the data format and metadata for each column; it learns from both the content of the column and the type of query operations that are executed. Learned metadata is used both to encode the data values efficiently and to accelerate query processing. For example, when a string column that has a small number of distinct values is used in filters, the columnar engine may decide to generate a list of the distinct values as metadata. This can then be used to accelerate both equality and range-based filters. As another example, the columnar engine may keep minimum and maximum values of a date column (for a given range of rows) as metadata; this may then be used to skip that range of rows when processing certain filters. In addition, the columnar engine may also use compression techniques to make efficient use of memory and speed up query processing. 

Query acceleration

The columnar engine transforms queries into a series of operations on columns and their metadata. Typically this involves first looking up the metadata to determine the most efficient type of operation to perform on an array of column values. These operations on column values, called vectorized operations, are designed to be executed using hardware-accelerated vectorized (SIMD) instructions that are available on modern CPUs.

The columnar engine can also perform table scan operations efficiently without fully materializing the result of the table scan; for example, if an aggregation operation needs to be performed after a table scan, it may use the result of evaluating the filters to directly perform vectorized aggregation operations on the relevant columns. Join operations are transparently accelerated using bloom filters, depending on selectivity; this optimization uses the power of vectorized filtering to reduce the number of rows that need to be processed by the join operation.

Let's take a look at how some of these optimizations apply to a simple query involving a table scan based on a schema similar to the Star Schema benchmark. This query scans the list of sales, filtering by shipping mode and date. The following figure illustrates how the table scan with two filters is executed. For the filter lo_shipmode = ‘MAIL’, the columnar engine first checks the column’s metadata to see if the value ‘MAIL’ is present in this dataset. If ‘MAIL’ does occur, the columnar engine proceeds by searching using SIMD instructions. The resulting bitmap of passing rows is further filtered using the result set from the next filter. 

Alternatively, if the value ‘MAIL’ had not occured in the column metadata, the columnar engine could have skipped searching a large number of values. Similarly, the columnar engine may also use min/max metadata on the lo_quantity column to skip some rows based on the range filter on that column.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK