6

A Data Warehouse Alone Won’t Cut It for Modern Analytics

 1 year ago
source link: https://dzone.com/articles/a-data-warehouse-alone-wont-cut-it-for-modern-anal
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

A Data Warehouse Alone Won’t Cut It for Modern Analytics

The data warehouse market has evolved since the creation of OLAP; now we need to evolve as well in order to meet the needs of modern analytics.

by

·

Jan. 06, 23 · Opinion

The world of data warehouses as we know it can be traced back to 1993 when Dr E.F. Codd conceived the idea of OLAP. A marvel for its time, the then-new processing paradigm for decision support systems paved the way for modern day data warehouses. Because of Codd's influence the idea of a purpose-built data system to  process analytics queries and aggregations on large data sets was conceived. It was clear that having a separate relational database for transactions and analytics made sense.

Since OLAP came about, the data warehouse market has seen a bit of an evolution. In the last 20 years we saw the market adopt columnar storage with Vertica, then it became cloudy with Snowflake,and now we see the market behind to morph into lakehouses. However, despite all this change in technology, they all address the same use case: the classic business intelligence and reporting workflow.

The workflow always starts the same with a business user wanting to get a better grasp on metrics, and in order to do that they contact a data analyst who then works with the business user to work through all their SQL queries. Once those are all ironed out the analyst will then use their BI tool to run the SQL queries against the data warehouse and then notify the business user once the report is done.

It is important to recognize that this is the standard workflow for every data warehouse, but there are some thing to take notice of:

  1. The amount of latency e introduced when the data set being analyzed is large (greater than 1TB or a few million rows) is meaningful. But this is typically acceptable for a report as query time is less of a concern as the report is usually sitting in an inbox or cached in a dashboard.

  2. There is a users-performance-cost relationship similar to the ideal gas law ie. temperature increases as pressure increases. Just the same, an increase in the number of queries increases the response time (unless a high expense is paid to scale the warehouse). But this is not really an issue for the reporting workflow. The number of users initiating data requests is few so the use of the BI tool typically doesn't put a strain on the warehouse performance. 

  3. Data warehouses are made for batch-ingested historical data that is typically refreshed overnight so as to not impact query performance during the working day. This is acceptable when the insights are based on long historical windows, but not when insights on operational, real-time data is also needed.

It’s understandable to think that these would be limitations in data warehouses, while they definitely can be, they’re also kinda not. Data warehouses are purpose built for the standard reporting workflow. With that in mind, these three points are extremely important for an emerging new analytics use case: analytics applications. 

You can see an example of this at Salesforce. They built an observability application that allowed engineers, customer service representatives, product owners and others to interactively query real-time application performance metrics ingested at a rate of billions to trillions of Kafka log lines a day.

You may be asking “Do they need performance at that scale?,” “Do their team have unrestricted abilities to query raw data without latency?,” “Are they able to analyze streaming events as they are created at an enormous scale?” The answer to all of those questions is yes!

The reality is that thousands of companies have come to the conclusion that there is a whole new generation of analytics beyond traditional BI. Their developers are creating analytic applications that have the ability to provide operational visibility at scale, enable slice/dice at high volume, support real-time decisioning and extend insights to customers. 

What Are the Database Requirements for Analytics Applications?

When you’re building an analytics application but aren’t sure what database to use, you typically start with what you already know and go from there.

So you think about what you need. You need analytics capabilities to process terabytes of data for large aggregation and GROUP BY queries on high dimensional and high cardinality data. That would lead you to the idea of data warehouses.

Hold on, you also need application traits of sub-second response with potentially thousands of queries per second, as well as, high throughput up to millions of real-time data ingested per second. That would lead you to the idea of a transactional database. But wait, that won’t work for the analytic requirements you need. 

Enter Apache Druid

This issue is why Druid was born. Druid was made for the intersection of analytics and applications. Not a play at HTAP, but rather a real-time analytics database that provides the speed and scale needed for both analytics and applications. 

Druid exists to handle complex queries on large data sets at the fast responsiveness expected for user-facing applications and automated pipelines. That is why Reddit chose Druid. With Druid, Reddit can process user data to serve real-time ads in less than 30 milliseconds. 

Uniquely, Druid has also achieved high queries per second without taking a heavy toll on your wallet. This is made possible due to its CPU’s efficient architecture. This can be seen with Twitch, their engineers gladly run 70,000 queries per day with the help of Druid. 

Apache Kafka and Amazon Kinesis pipelines also have a database built for data in motion with Druid, and this is supported by the fact that Druid supports true stream ingestion. That is the reason why the engineers at Confluent, the original creators of Apache Kafka®, use Druid to power their observability application that has an ingestion rate of more than 3 million events per second.

New Analytics Use Case, Purpose-Built Open Source Database

The classic BI and reporting workflow isn’t going anywhere. Business users will still get insights from data analysts who will still write SQL queries for BI tools that query a data warehouse — and the whole data warehouse and lakehouse gang will still continue to fight over that workload. 

It is just important to remember that for those who are looking to build analytics applications there is a more interesting database for you to check out.

Apache Druid isn’t a traditional OLAP database as the developers at 1000s of companies aren’t using it for traditional OLAP queries. Follow the lead of Netflix, Confluent, and Target. They’re building and investing in the emerging use case that sits at the intersection of analytics and applications, analytics applications.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK