27

Five Steps to an Awesome Data Model in Apache Cassandra

 4 years ago
source link: https://scotch.io/tutorials/five-steps-to-an-awesome-data-model-in-apache-cassandra
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Congratulations, you’re starting out on Apache Cassandra – a favorite choice among architects and developers for its performance, scalability, continuous availability, geographic distribution, and ease of management. Whether you plan on using it to build customer apps, ECommerce, IoT, fraud, or anything else, your first stop isn’t only understanding the technology, but how to create the right data model to fit your application’s performance and scalability goals.

Fixing a poorly designed data model after an application is in production is an experience that nobody wants to go through–so it’s better to take some time upfront and use a proven methodology to design it right—and that’s exactly what you’ll learn here. We’ve broken it down into five steps:

  • Step 1: Understand your application workflow
  • Step 2: Model the queries required by the application
  • Step 3: Design the tables
  • Step 4: Determine primary keys
  • Step 5: Use the right data types effectively

And if you want to start playing with Cassandra right now , the fastest way to get started practicing and prototyping is to deploy the free DataStax Apollo beta—it’s Cassandra as a cloud-native service, so you can deploy in just a few clicks.

First Things First

You’re likely already familiar with relational databases such as Oracle, MySQL, and PostgreSQL, so it’s helpful to understand the big architectural differences with Cassandra when it comes to data modeling:

  • Denormalization is expected. With relational databases, designers are usually encouraged to store data in a normalized form. In Cassandra, storing the same data redundantly in multiple tables is not only expected, but it also tends to be a feature of a good data model.
  • Writes are **( almost ) free.** Due to Cassandra’s architecture, writes are shockingly fast compared to relational databases.
  • No joins. Relational database usually reference fields from multiple tables in a single query by joining tables. With Cassandra, this functionality doesn’t exist, so developers must structure their data model accordingly.
  • Consistency is tunable . Relational databases are ACID compliant (Atomicity, Consistency, Isolation, Durability). In contrast, Cassandra supports tunable consistency, allowing trade-offs between consistency, availability, and performance.
  • Indexing is different. With relational databases, queries are usually optimized by simply creating an index on a field. In Cassandra, tables are usually designed to support specific queries, and secondary indexes are useful only in specific circumstances, rather than being a “silver bullet.”

How Cassandra Stores Data

Cassandra clusters have multiple nodes running in local data centers or public clouds, and data is typically stored redundantly across those nodes. Tables in Cassandra are similar to RDBMS tables. Physical records in the table are spread across the cluster at a location determined by a partition key that identifies the Cassandra node where data and replicas are stored. A Cassandra cluster can be conceptually represented as a ring, where each cluster node is responsible for storing tokens in a range.

FJn6Jnn.png!web

Queries that look up records based on the partition key are extremely fast because Cassandra can immediately determine the host holding required data using the partitioning function. Since clusters can potentially have hundreds or even thousands of nodes, Cassandra can handle many simultaneous queries because queries and data are distributed across cluster nodes.

Three Data Modeling Best Practices

  1. Spread data evenly around the cluster. For Cassandra to work optimally, data should be spread as evenly as possible across cluster nodes.
  2. Minimize the number of partitions to read. When Cassandra reads data, it’s best to read from as few partitions as possible since each partition potentially resides on a different cluster node.
  3. Anticipate how data and requirements will grow. For example, would you design the data model differently if certain transactions grew in volume 100X or 1000X?

To learn more about Cassandra’s distributed architecture, and how data is stored, check out the free DataStax Academy class, DataStax Enterprise Foundations of Apache Cassandra . You will master Cassandra's internal architecture by studying the read path, write path, and compaction. Topics such as consistency, replication, anti-entropy operations, and gossip ensure you have a strong handle on the technology and the data modeling implications.

Essential Reading : Learn React from Scratch! (2019 Edition)

Five Steps to Building an Awesome Data Model

It’s always helpful to focus on a concrete example. In the sections that follow, data modeling will be discussed in the context of the DataStax’s reference application, KillrVideo , a fictitious company operating an online video service.

Step 1: Understand your application workflow

When building applications using relational databases, developers often start with the data model, creating the entities and relationships. But with Cassandra, best practice is to start with the application workflow; an approach referred to as “query-first design”–understanding first what types of queries the database will need to support. For example, in the KillVideo example below, the sequence of workflow steps matters because it helps us determine that a userid , or videoid is required to support subsequent queries, which then impacts table design.

EFR36vN.png!web

Step 2: Model the queries required by the application

Taking a query-first approach, means not only thinking through the sequence of tasks required but it helps to mock up what each screen will look like, and decide what data will be required and when, for each of the key entities. For example, below, we have our application UX flow for KillrVideo, with the database transactions that are required to support it, for each step.

QfiEn27.png!web

Step 3: Design the tables

In Cassandra, tables can be grouped into two distinct categories:

  • Tables with single - row partitions. These types of tables have primary keys that are also partition keys. They are used to store entities and are usually normalized. In our example, Users, Comments, and Videos are entities. In our example, our Videos table includes columns, such as the user that uploaded the video, a description, details about where the video was taken.

YziUfmF.png!web

  • Tables with multi -* row partitions. * These types of tables have primary keys that are composed of partition and clustering keys. They are used to store relationships and related entities. Remember that Cassandra doesn’t support joins, so structure tables to support queries that relate to multiple data items.

yuayAbn.jpg!web The latest_videos table illustrates what is meant by “query-first design.” The application will need to query the most recently uploaded videos every time a user visits the KillrVideo homepage, so this query needs to be very efficient.

Use a Chebotko Diagram to Represent Your Schema

A good tool for mapping the data model that supports an application is known as a Chebotko diagram to develop the logical and physical data models required to support the application.

YRR73iU.png!web

The Chebotko diagram captures the database schema, showing table names, partition key columns (K), clustering key columns (C) and their ordering, static columns (S), and regular columns with data types. The tables are organized based on the application workflow to support specific workflow steps and application queries.

Get complementary training with our DataStax Enterprise Practical Application Data Modeling with Apache Cassandra online class at the DataStax Academy . Access all course materials, exercises, data files, and scripts, that show you the fundamentals of creating a good data model, a data modeling methodology as well as more common issues you should avoid. This course will up your data modeling game!

Step 4: Determine primary keys

In Cassandra, the primary key is made up of a partition key, followed by one or more optional clustering columns that control how rows are laid out in a Cassandra partition.

In the latest _ videos table, yyyymmdd is the partition key, and it is followed by two clustering columns, added _ date and videoid , ordered in a fashion that supports retrieving the latest videos.

VBzENvR.png!web

Good examples of unique keys are customer IDs, order IDs, and transaction IDs. Relational databases often use simple auto-incrementing integers to assign unique keys to records, but this approach isn’t practical in a distributed system like Cassandra. To address this problem of unique keys, Cassandra supports universally unique identifiers i (UUIDs) as a native data type. UUIDs are 128-bit numbers that are guaranteed to be unique within the scope of an application.

Some developers might prefer to devise their own naming schemes to make keys easier to understand, but it’s important to think about the maintenance impact if the business changes, rendering the scheme obsolete. It means that UUIDs can sometimes be more maintainable in the long run.

Step 5: Use the right data types effectively

Cassandra supports a wide variety of data types that will be familiar to most developers–BigInt, Blob, Boolean, Decimal, Double, Float, Inet (IP addresses), Int, Text, VarChar, UUID, TIMEUUID, etc.

Collections, another Cassandra data type, can simplify database design and reduce the number of tables required. Collection data types include sets, list, maps, tuple, and nested collection.

Another data type in Cassandra that provides flexibility is a user-defined type (UDT). UDTs can attach multiple data fields—each named and typed—to a single column. For example, in the KillrVideo example, rather than add multiple address-related fields, an address type can be created…

bqUfuu3.png!web And used in multiple Cassandra tables…

miqIn2Y.png!web

Learning more

Getting the data model right is a critical first step in building a successful, scalable Cassandra database that is easy to manage and maintain. A five-step approach can help, including using a “query-first” approach, employing Chebotko diagrams, carefully thinking through keying approaches, and utilizing all the data types at your disposal. Download the DataStax whitepaper “Data Modeling in Apache Cassandra”, which goes into what’s covered in this article, but drills into way more detail .

Getting Started

Get started at warp speed by deploying the free DataStax Apollo beta—it’s Cassandra made easy in the cloud. Or download the DataStax Distribution of Apache Cassandra and run on-premises or in any cloud. It’s never been easier to get starting data modeling and deploying with Cassandra!

This content is sponsored via Syndicate Ads .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK