4

Slack Leverages Bespoke Tracing Architecture for Message Notifications

 1 year ago
source link: https://www.infoq.com/news/2023/06/slack-notification-tracing/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Slack Leverages Bespoke Tracing Architecture for Message Notifications

Jun 28, 2023 2 min read

Slack leveraged its bespoke tracing architecture to help with investigating notification-delivery issues. Tracing helped resolve notification issues 30% faster and reduced escalations to the development team. It also simplified the analytics pipeline and unlocked new use cases for the data science team.

Message notifications are a key element of Slack’s user experience. However, since the notification flow spans many components of Slack’s overall platform, both server-side and client-side, they can be tricky to investigate in case of any issues reported to the customer experience team. Development teams quite often had to spend many days looking through multiple systems with different logging backends and formats.

1scala-notification-tracing-1687908205243.jpeg

Source: https://slack.engineering/tracing-notifications/

Slack previously created a bespoke SlackTrace tracing architecture and uses it for tracing regular message delivery, where one percent of client requests are traced. The company decided to create its own tracing solution as it concluded that none of the available 3rd party solutions met its needs fully.

For tracing message notifications, the team mapped the flow to a trace by identifying notable events and determining attribute mappings. They decided to separate notification traces from the message request traces. This way, they could support 100 percent sampling for notification flows, which Slack’s customer experience team requested.

Notification tracing has improved issue triage and debugging. Customer experience team members can use trace data themselves to understand what went wrong and answer a customer’s query without involving the development team. The new functionality also helped iOS and Android engineers to start using Grafana to monitor notification delivery in mobile applications. Lastly, the data science team has derived insights from the tracing data. They computed funnel analytics to understand notification open rates better and identified bugs in the application and the instrumentation code using historical notification traces.

Suman Karumuri, the senior staff software engineer at Slack, summarizes the benefits of tracing:

Modeling product analytics data as traces provides high-quality data in a consistent data format across all of our complex stack. Further, the built-in sessionization of trace data simplified our analytics pipeline by eliminating additional jobs to de-dupe and sessionize the trace data.

SlackTrace architecture consists of a Go webserver application publishing trace span events to Apache Kafka and a Go consumer service responsible for persisting events into the real-time store (ElasticSearch) and the data warehouse. Backend services use Zipkin and Jaeger instrumentation libraries to report spans that are converted into the internal span representation, while desktop and mobile apps use the span API directly.

1slack-tracing-architecture-1687908205243.jpeg

Source: https://slack.engineering/tracing-at-slack-thinking-in-causal-graphs/

Slack has opted for a simple representation of trace spans, which makes the solution more flexible and less centered around the request and network tracing. A simple span structure, which allows the data to be stored in a single table, also supports a wide range of querying options, where engineers can extract the data they need to answer specific questions.

About the Author

Rafal Gancarz

Rafal is an experienced technology leader and expert. He's currently helping Starbucks make its Commerce Platform scalable, resilient and cost-effective. Previously, Rafal has been involved in designing and building large-scale, distributed and cloud-based systems for Cisco, Accenture, Capita, ICE, Callsign and others. His interests span architecture & design, continuous delivery, observability and operability, as well as sociotechnical and organisational aspects of software delivery.

Show more

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK