Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, & Azure Functions

Lucas Borges Fernandes

May 19th, 20230 3

One of the top Citus features is the ability to run PostgreSQL at any scale, on a single node as well as a distributed database cluster.

As your application needs to scale, you can add more nodes to the Citus cluster, rebalance existing data to the new Postgres nodes, and seamlessly scale out. However, these operations require manual intervention: a) first you must create alerts on metrics, b) then, based on those alerts, you need to add more nodes, c) then you must kick off and monitor the shard rebalancer. Automating these steps will give you a complete auto scale experience—and make your life so much easier.

In this blog post, you will learn how to build a full-fledged auto scaling setup for the Citus database extension running as a managed service on Azure—called Azure Cosmos DB for PostgreSQL. You’ll also learn how you can easily add nodes to the Azure Cosmos DB for PostgreSQL cluster and use any metrics available to trigger actions in your cluster! Let’s dive into the following chapters:

You can also see a demo of the auto scaling setup in action in a talk I gave recently at Citus Con:

Figure 1: Video of the conference talk I gave at Citus Con: An Event for Postgres 2023, titled “Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, & Azure Serverless”

Overview of the components for Auto Scaling

The auto scaling architecture this blog proposes combines multiple components that could be either managed in the cloud or on-premises. These components are:

Azure Functions
Azure Resource Manager (ARM)
Grafana
Azure Cosmos DB for PostgreSQL (aka Citus on Azure)

For the examples presented in this blog post, you will use the managed version of Citus on Azure, by using the Azure Cosmos DB for PostgreSQL service. The same architecture and steps outlined in this blog post will work for a Citus open source cluster, with the only change being the API used to trigger cluster modifications (e.g., add new nodes, scale up compute power, etc.). In the Azure example in this blog, cluster modifications are done through the Azure Resource Manager (ARM) APIs.

First let’s walk through each of these components. Feel free to skip ahead to Step 1: Creating & populating the cluster if you are already familiar with all of the components.

What are Azure Functions?

Azure Functions is a serverless solution offered by Microsoft that allows you to deploy your applications without having to worry about servers. (I sometimes refer to it conversationally as Azure Serverless.) The cloud infrastructure is all managed by Azure and you only have to focus on the code that matters to you. It is also a great way to save costs and run workloads that don’t require servers up and running 24/7.

These functions can be created in many languages, and Azure also provides a CLI to create boilerplates and deploy these functions in the cloud. See the following link for more details: Azure Functions Core Tools

After deploying the functions, you can use REST APIs to trigger them. It’s important to mention that a function may have a maximum execution time, so it’s advisable to not block them on I/O or other external integrations. You should consider an asynchronous model when possible.

What is Azure Resource Manager (ARM)?

Azure Resource Manager (also known as ARM) is a deployment and management service for Azure. ARM can be used to interact with the Azure cloud to create, update, and delete resources.

ARM uses templates that provide a declarative way to define your resources, and you can apply these through Azure APIs such as REST clients, SDKs, CLI and even in the Azure portal itself.

What is Grafana?

Grafana is an amazing open source tool that helps improve software observability with operational dashboards. It provides analytics and interactive visualizations for data from many sources, including PostgreSQL and Azure Monitor. These visualizations can exist in the form of tables, graphs and even alerts to external applications. One common usage of Grafana is to monitor application health and create alert triggers based on certain events, such as a high CPU usage spike, for example.

Grafana can be installed on-premises but can also be found in the form of a managed service in the cloud, such as the Azure Managed Grafana.

Step 1: Creating & populating the cluster

CREATE TABLE public.companies (
  id bigserial PRIMARY KEY,
  name text NOT NULL,
  image_url text,
  created_at timestamp with time zone NOT NULL,
  updated_at timestamp with time zone NOT NULL
);

CREATE TABLE public.campaigns (
  id bigserial,
  company_id bigint REFERENCES public.companies (id),
  name text NOT NULL,
  cost_model text NOT NULL,
  state text NOT NULL,
  monthly_budget bigint,
  blacklisted_site_urls text[],
  created_at timestamp with time zone NOT NULL,
  updated_at timestamp with time zone NOT NULL,
  PRIMARY KEY (company_id, id)
);

CREATE TABLE public.ads (
  id bigserial,
  company_id bigint,
  campaign_id bigint,
  name text NOT NULL,
  image_url text,
  target_url text,
  impressions_count bigint DEFAULT 0,
  clicks_count bigint DEFAULT 0,
  created_at timestamp with time zone NOT NULL,
  updated_at timestamp with time zone NOT NULL,
  PRIMARY KEY (company_id, id),
  FOREIGN KEY (company_id, campaign_id)
    REFERENCES public.campaigns (company_id, id)
);

CREATE TABLE public.clicks (
  id bigserial,
  company_id bigint,
  ad_id bigint,
  clicked_at timestamp with time zone NOT NULL,
  site_url text NOT NULL,
  cost_per_click_usd numeric(20,10),
  user_ip inet NOT NULL,
  user_data jsonb NOT NULL,
  PRIMARY KEY (company_id, id)
);

CREATE TABLE public.`worker_count`(
  id bigserial PRIMARY KEY,
  created_at timestamp with time zone NOT NULL,
  count bigint
);

SELECT cron.schedule(
    'worker-count',
    '5 seconds',
    $$
        INSERT INTO `worker_count`(created_at, count)
        SELECT
            now(),
            (select
                count(*) as `worker_count`
            from
                pg_dist_node
            where
                nodename ilike 'private-w%' limit 1);
    $$
);

SELECT create_distributed_table('public.companies', 'id');
SELECT create_distributed_table('public.campaigns', 'company_id');
SELECT create_distributed_table('public.ads', 'company_id');
SELECT create_distributed_table('public.clicks', 'company_id');

import { DefaultAzureCredential } from "@azure/identity"
import { ResourceManagementClient } from "@azure/arm-resources";

const azureSubscriptionId = process.env.AZURE_SUBSCRIPTION_ID;
const credential = new DefaultAzureCredential();
const resourceClient = new ResourceManagementClient(credential, azureSubscriptionId);

const deployment: Deployment = { properties: { mode: "Incremental", template: template } };
await client.deployments.beginCreateOrUpdate(resource_group, deploymentName, deployment);

import * as postgres from "postgres";

const pgClient = postgres(process.env.PG_CONNECTION_URL);

await pgClient`
    SELECT citus_rebalance_start(rebalance_strategy:='by_shard_count', shard_transfer_mode:='block_writes');
`;

SELECT
    to_timestamp(trunc(extract(epoch from created_at))) as time,
    count as worker_count
FROM
    public.worker_count
WHERE
    created_at > now() - '5m'::interval;

\set company_id1 random(1, 2000)
\set company_id2 random(1, 2000)
\set company_id3 random(1, 2000)
BEGIN;
    SELECT
        ads.company_id,
        ads.id,
        count(*)
    FROM
        public.ads ad
    INNER JOIN
        public.campaigns ca ON ad.campaign_id = ca.id AND ad.company_id = ca.company_id
    WHERE
        ads.company_id IN (
            :company_id1,
            :company_id2,
            :company_id3
        )
    GROUP BY
        1, 2;
END;

\set company_id random(1, 2000)
BEGIN;
    SELECT
        co.id as company_id,
        co.name as company_name,
        SUM(ca.monthly_budget) as monthly_budget
    FROM
        public.companies co
    INNER JOIN
        public.campaigns ca ON co.id = ca.company_id
    WHERE
        co.id = :company_id
    GROUP BY 1;
END;

pgbench 'postgres://citus:<password>@<cluster-url>:5432/citus?sslmode=require' -j 4 -c 16 -f query01.sql -f query02.sql -T 2000 --progress=1

Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, & Azure Fun...

Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, & Azure Functions

Overview of the components for Auto Scaling

What are Azure Functions?

What is Azure Resource Manager (ARM)?

What is Grafana?

Step 1: Creating & populating the cluster

Recommend

Technical Product Management Like a Boss: Craig Kerstiens of Citus Data | The Ac...

Microsoft acquires Citus Data (YC S11)

Citus: Scale-Out Clustering and Sharding for PostgreSQL

Auto Scaling of SAP Systems on Azure – Part I

Citus Con: An Event for Postgres (AMERICAS livestream)

Migrating Citus on AWS to Citus on Azure: our playbook and testing strategies

Citus 11 for Postgres goes fully open source, with query from any node

Microsoft Introduces Azure Cosmos DB for PostgreSQL

Distributed PostgreSQL Benchmarks: Azure Cosmos DB, CockroachDB, and YugabyteDB

Data Redundancy With the PostgreSQL Citus Extension

About Joyk