7

The Architecture of Google BigQuery and it's Features -

 1 year ago
source link: https://www.analyticsvidhya.com/blog/2022/10/the-architecture-of-google-bigquery-and-its-features/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This article was published as a part of the Data Science Blogathon.

Introduction

The Dotcom boom spawned several web-based companies such as Google, Amazon, Facebook, Twitter, YouTube, and many others. Data generation for web-based companies is significantly greater than for traditional 500 businesses. Every user click, every search performed, every social media post, and every Like button is generated by billions of rows of data every day.

Google BigQuery
Source: cloud.google.com
Traditional relational database technologies were not designed to handle the volume and variety of data generated by web-scale technology companies, leading to classes of data storage and retrieval technologies being created to respond to the increasing performance demands of users. These technologies. Imagine a Google search query taking a few seconds to return results; Google’s entire search-based revenue model would be at risk, as users are generally unwilling to wait long to see the results of their actions.

Google BigQuery

Google BigQuery is a fully managed cloud data warehouse for analytics from Google Cloud Platform (GCP), which is one of the most popular cloud analytics solutions. Due to its unique architecture and seamless integration with other GCP services, certain elements should be considered Google BigQuery best practices when migrating data to Google Cloud.
These GCP best practices ensure cost and performance optimization of BigQuery even in your existing Google Cloud environment. To solve the problems of storing petabyte-sized data, networks, and sub-second query response time, Google engineers invented new technologies, initially for internal use, which are codenamed Dremel, Jupiter, and Colossus. The expression of these technologies is called Google BigQuery.
Dremel: Dremel is a query engine that powers BigQuery. It is a scalable system designed to query petabytes of datasets. It uses a combination of columnar data layout and a tree architecture to handle incoming query requests. This combination allows the Dremel to process trillions of lines in seconds. Unlike many database architectures, Dremel is able to independently scale compute the nodes to meet the demands of even the most demanding queries.
Google BigQuery

Source: cloud.google.com

It is also the core technology that powers the functionality of many Google services, such as Gmail and Youtube, and is also widely used by thousands of users at Google. It relies on a cluster of computing resources to perform parallel tasks on a massive scale. Based on the incoming request, Dremel dynamically identifies the number of computing resources needed to fulfill the request, pulls those computing resources from the pool of available computing resources, and processes the request. This extensive pooling of computations takes place under the hood, and the operation is fully transparent to the user entering the query. From the user’s point of view, they enter a query and get results at a predictable time every time.

Colossus: Colossus is a distributed file system by Google for many of its products. In each Google data center, Google runs a cluster of storage disks that offer storage capacity for its various services. By selecting appropriate replication and disaster recovery strategies, Colossus ensures that data stored on disks is not lost.

Jupiter Network: The Jupiter Network is the bridge between the Dremel execution machine and the Colossus storage. The network in Google’s data centers offers unprecedented levels of two-way traffic, allowing large volumes of data to move between Dremel and Colossus.

Google BigQuery

Source: cloud.google.com

Google combined these technologies to create an external service called BigQuery under the GCP. It is a cloud-native data warehouse that provides an excellent choice as a fully managed data warehouse. With its decoupled compute and storage architecture, BigQuery offers exciting possibilities for companies large and small. Let’s see some of the aspects of BigQuery which make it a compelling candidate for data warehousing.
Manageability: Google Bigquery is fully managed. Other services claim to offer this feature, but when it comes to BigQuery, the management aspect of the service is completely handled by Google. Patching, upgrades, storage management, and compute allocations are all inherently managed by service, leaving nothing on the plate for users using the system. It is one service that does not require the administrator to manage services. By offering serverless execution, BigQuery removes all the traditionally complex activities such as server/VM management, server/VM sizing, and memory management.
Scalability: It relies on massively parallel computing and a highly secure and scalable storage engine that provides users with true scalability and consistent performance. A comprehensive software package manages the entire infrastructure, which runs on thousands of computers in each region.
Storage: BigQuery allows users to retrieve data in various formats, such as AVRO, JSON, CSV, and more. The conversion mechanism converts the data loaded into BigQuery into a columnar, internal storage-based representation. Columnar storage has many advantages, including optimal storage utilization and the ability to scan data faster than the traditional row-based storage format. Bigquery transparently optimizes files loaded into the storage tier to ensure optimal query response time. From the user’s perspective, traditional backup, restore, and clone operations have no place in Google BigQuery.
The design of Google BigQuery, and the separation of computing and storage, allows Google to offer Google BigQuery in exciting pricing models. When not in use, Google BigQuery on Demand charges only for storage and no additional computing fees until the query is issued. This capability is a significant departure from traditional data models that bill customers for computing resources, whether in use or idle.
Data processing: Google BigQuery supports both streaming and batch data processing. Google Bigquery does not charge for batch data processing, while receiving streaming data is charged separately. Data streaming capabilities in Google BigQuery allow users to process millions of data every minute while ignoring the complexity of infrastructure management.
data processing in the cloud

Source: cloud.google.com

Pricing

BigQuery offers flat pricing with an on-demand pricing model. Pricing model decisions can be made based on traffic volume. Due to the segregation between storage and computing, customers with infrequent query requirements, such as a mid-sized company or department, can benefit significantly from the infrequent use of computing resources. It only applies to resources used to process queries. Larger customers can pay for dedicated resources. On-demand polling does not offer the same predictability as the flat rate model, but it still makes sense for many use cases. Click here for a more detailed article on this topic.

Security

Google BigQuery supports several different authentication models. Models based on OAuth and service accounts enable granting access to Google BigQuery resources. Users, groups, or service accounts can be granted access to Google BigQuery resources at different levels. The granularity of access control is limited to the dataset level, and any tables or views below the dataset automatically inherit permissions from the dataset.  New data loss prevention features extend BigQuery’s security capabilities by allowing Google BigQuery users to redact, create, and discover sensitive data.

Usability

Google BigQuery offers access patterns expected from a data warehouse. It supports CLI, SDK, ODBC, JDBC, REST API, and Google BigQuery Console, where users can log in and run queries. All of these access patterns invoke the REST API under the covers and return the requested data to the user. Commonly used GUI tools such as DataGrip can be used to connect to the Google BigQuery data warehouse and explore data in Google BigQuery.

Conclusion

Google BigQuery has native capabilities for retrieving data from some Google services, such as Google Analytics and Adwords. However, for greater consolidation efforts, a data replication product like Daton can speed up data consolidation into Google BigQuery. It is a fully managed cloud data warehouse for analytics from Google Cloud Platform (GCP), one of the most popular cloud analytics solutions. Due to its unique architecture and seamless integration with other GCP services, certain elements should be considered Google BigQuery best practices when migrating data to Google Cloud.
  • Google search queries take a few seconds to return results; Google’s entire search-based revenue model would be at risk, as users are generally unwilling to wait long to see the results of their actions.
  • BigQuery allows users to retrieve data in various formats, such as AVRO, JSON, CSV, and more. The conversion mechanism converts the data loaded into BigQuery into a columnar, internal storage-based representation.
  • Google BigQuery supports several different authentication models. Models based on OAuth and service accounts enable granting access to Google BigQuery resources.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK