10

Evolution of Redash at Blinkit

 2 years ago
source link: https://lambda.blinkit.com/evolution-of-redash-at-blinkit-fb50a64770bf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Evolution of Redash at Blinkit

Juicing up Redash into the ultimate analytics beast

Image Credits:

Blinkit was one of the earliest adopters of Redash, dating back to 2016 when it was still in version one. Written entirely in Python, Redash fit the requirements of a fast-growing startup like Blinkit perfectly, and boy, are we happy with our decision!

A glimpse into the action

Redash is the most common and widely used data platform at Blinkit, with over 50000 active queries over 50 data sources on 1000s of dashboards.

Apart from all the awesome features, Redash comes with, we at Blinkit have been juicing it up. With the following custom features to make it a single-stop solution for all the analytic needs; let’s briefly discuss some of the issues we faced and how we tackled them.

Waiting Queue

Issue

The user had zero visibility of the queue size on a data source and often confused the waiting stage with the execution stage, leading to unwanted anxiety and confusion about their queries.

Solution

Display a waiting queue count along with every query execution in the execution status bar, displaying the current position of the query in the execution queues in Redash.

Impact

This really helped reduce our user anxiety as they now knew what to expect given the queue size; this also helped us with alerting and significantly reduced the load on our data sources.

One-click Deployment

Issue

Deployment used to be a pain point while developing any new feature in Redash and took nearly an hour of downtime while adding a new feature and deploying it for the organization. Furthermore, separating celery workers for different sources only added to the complexity. Also, it did not help that we had two separate Redash instances (one for Adhoc queries & the other for all reporting), and we required downtime on both instances for a successful deployment.

Solution

Although Redash provided docker files for deployment, we went a step further and wrote helm charts for the deployment. This gave us the ability to control everything through a single YAML file and spinning separate instances was a piece of cake now. We did launch a 3rd instance ( Yeah! we are quite obsessed with Redash 😜) for the internal data engineering team to separate it from regular users, later on, thanks to the simplicity of deployment.

Impact

  • Deployment downtime reduced from 1 hour to less than 1 second (the time it takes for a Kubernetes pod to rotate)
  • Ability to manage more instances with no additional effort

Nightly-skip, Reduced-weekend-runs tags

Issue

Scheduled queries kept on executing when not required, like during the night and weekends as well, and choked databases unnecessarily.

Solution

Use the existing Redash tags feature and skip query execution during the night for queries having the nightly-skip tag and reduce the refresh frequency on weekends for queries having the tag reduced-weekend-runs.

Example of Redash query with tags

Impact

More than 100 queries that weren’t viewed at night stopped executing, resulting in significant resource-saving.

Full scan query abort

Issue

People tend to write unoptimized queries on OLTP databases like Postgres, Aurora, and MySQL while confusing them with OLAP databases like Redshift. Unoptimized queries lead to unnecessary cost increases and databases getting down in cases of increased loads. Considering most of the queries were being run from Redash we needed to restrict such queries.

Solution

Even after putting query time limits on databases, we observed that the primary database bloating up our infrastructure cost was Aurora by AWS, which is optimized for query performance and used to charge per IOPs consumed.

We needed to make sure people used proper filters and did not query whole tables instead. To enforce this, we modified the query executor in Redash to parse the query before execution and throw an error in case any large table in the query was being fully scanned.

An error pops up when executing an unoptimized query

Impact

Aurora cost decreased by US$100/day, which was a critical portion of the data infrastructure cost at Blinkit.

Catalog integration

Issue

With the rise of data catalogs in the data platforms, Blinkit also felt the need for a data catalog keeping in mind the number of people who were consuming data. After careful consideration, we decided to go with Datahub by LinkedIn (Read more about it here)

To drive its adoption and Redash being the primary query tool at Blinkit, we needed users to have better clarity on which all tables had a catalog available for them and make sure they could find the table inside the catalog with ease.

Solution

Using the APIs of the data catalog itself, we integrated it with Redash such that any new table being added to the catalog will automatically have a link alongside the table name in the schema browser in Redash.

A link to a catalog for certain facts and dimensions tables

Impact

  • More confidence while writing queries
  • Lesser doubts for the platform team

Prometheus monitoring + Legend

Issue

Better monitoring and alerting were required on top of Redash to act up better on large queues and if anything was down.

Solution

Prometheus is the primary monitoring and alerting solution at Blinkit, so it was a no-brainer to integrate it with it. Luckily our infrastructure team was already working on a tool (now open-sourced, make sure to check out our aptly named github.com/grofers/legend and give it a star 😃) to make integrations of different types of servers easier with Prometheus and it fit our use-case perfectly and we were able to set alerting and monitoring over our Redash deployment successfully.

Impact

Better alerting helped in reducing downtimes and increasing the overall resiliency of the system.

Destinations

Issue

No safe and scalable solution to export Redash query results to commonly used destinations like Google Sheets, Slack, etc. More details here.

Solution

Extend currently available alert destinations to allow syncing query results to destinations like Google sheets, slack, etc.

A list of destinations for Redash

Impact

  • We now have over 10000 daily destination syncs running on Redash powering reports that have now become reliable.
  • The earlier approach to export to Google Sheets required IMPORTDATA() and API keys that the user could have access to even after leaving the organization, With destinations, we now have control over this and an overall better security.
A dashboard made with Redash

Upcoming/Planned features

  • Limiting destinations and scheduled queries at the user and group level.
  • Better integration with catalog i.e. adding more information regarding lineage, queries & stats from the catalog.
  • Adding more destinations like Slack, Mail, etc. from Redash.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK