Enabling Node Apps To Do More With Less

By Janani Parathasarathy and Thangamani Jayaseelan

Introduction:

At PayPal, we have thousands of different services in different stacks across data centers. Following the pandemic, consumers in emerging economies made the greatest shift to online shopping we’ve ever experienced, even surpassing holiday season peaks. Thus, peak traffic is now ever-increasing, and unpredictable. This means efficient right sizing of apps in production is a must!

As our tech stack was organically evolving, we saw huge benefits in infrastructure scalability. That’s when some critical questions came up:

Can we do more with less?

Can our applications scale to handle additional traffic at current capacity without any performance degradation?

In other words, can our applications efficiently handle increased traffic:

by not adding capacity to current fleet (CPU & Memory)
by not hurting applications’ performance (latency & error)

These questions were the foundation for creating a Scalability Canary exercise, where applications would be subjected to traffic increases of 2x, 3x, and more, for a selected duration to identify the scalability of each app. Scalability Canary was conducted on Java and Node apps for effective right-sizing.

Scalability Issues in NodeJS:

When more traffic was routed to node applications to validate scalability, we noticed an increase in application response time (95th percentile), even though most node applications have enough CPU available to scale.

enabling-node-apps-to-do-more-with-less-7ea8ff1a4e97

There are many reasons why applications cannot scale, and each reason can be different from another — such as memory leaks, connection management, CPU bottlenecks, and so on. We continued our analysis to figure out the root cause.

Memory Leaks

After monitoring the memory patterns in the top 10 applications, we noticed frequent restarts, -and most were caused by memory leaks.

When a node process consumes > 85% of allocated memory, PM2 (Node Process Manager) will restart the node process to reclaim the memory. Below is TSDB (Time Series Database) report of a sample Node app which had frequent application restarts due to OutOfMemory (OOM).

Before Fix: The number of restarts on the app in 12-hour window triggered by PM2 when allocated memory is > 85%

After Fix: No OOM restarts observed

We started engaging with each product development team to analyze and fix the memory leaks in production. On-Demand Profiling (which is our in-house profiling solution for applications across various stacks) was used to deep-dive on the performance bottlenecks in production.

Now that memory leaks were ruled out, our next important questions came up…

Did node applications scale after fixing memory leaks?
and
Are memory leaks the only reason for scalability issues with Node apps?

Well, memory leaks were one of the reasons but not the ONLY reason. There were many other applications that did not have any memory leak complaints but could not scale.

If not memory leaks, what else?

We dug deeper and continued our monitoring process of all node apps. We noticed that, unlike Java apps, most node applications’ CALL to CONNECT ratio was 1:1. What does CALL to CONNECT ratio mean?

CALL is an event for downstream/outbound API request.
CONNECTis an event to establish a connection to an outbound/downstream service.

This pushed us to focus on connection management with Node apps.

Connection Management: Persistent Connection not Enabled by Default

Without persistent connections, a new connection is established for every client to server request.

Every service-to-service communication happens through a connection. The connection can be solely for the request, or it can be long-lived to process multiple requests. The CALL:CONNECT ratio measures how often the CONNECT operation happens for outbound calls. The intent is to keep the number of CONNECTs very low in ratio to CALL. With high incoming traffic, the applications would initiate a high volume of outgoing HTTP requests, wherein the connections need to be reused. Reusing the connection avoids the overhead of DNS (Domain Name Service) lookup, connection time, and SSL (Secure Socket Layer) handshake time.

HTTP 1.0 protocol does not support persistent connections by default. To do so, the client must send a connection header with its value set to keep-alive to indicate to the server that the connection should remain open for any subsequent requests.

We noticed that most of the node applications did not have persistent connections enabled by default, causing it to create new CONNECTs with every outbound call. This was indeed an overhead on the application CPU and response time.

PayPal’s Performance Engineering team did an extensive analysis on the top node applications with persistent connections enabled (The keep-alive configuration was manually set to true in config.json)

Results prove that applications with persistent connections performed better. We now see a 20% benefit on CPU and 3% to 28% benefit on response time (P95).

Here is the data that shows P95 response times with 2x traffic — with and without keep-alive.

The higher the number of downstream calls, the higher the latency benefits with persistent connections.

The configuration control of keep-alive was handled by the application owners. Missing the keep-alive configuration at the framework layer did not allow applications to use persistent connections by default.

Taking the Solution Further: Enabling Persistent Connections at the Framework Layer

We worked with the infrastructure team to enable persistent connections by default for all node applications while the Web Platform team got the Node Scalability change rolled out.

Change has been rolled out as a framework service module update, which applications are adopting. The Performance Engineering team has been monitoring CPU and latency gains for top node apps by comparing the Monday peak time traffic before and after release.

Observations: up to ~18% of CPU gains and up to ~14% of latency gains (P95) from top node apps.

Note:

The higher the number of downstream calls, the higher the latency benefits with persistent connections.
Apps that had persistent connections enabled already and apps with non-comparable traffic patterns were excluded from the list.

Next Steps

The Web Platform team mandated scalability changes for NodeJS apps from June 2021. Since then, more apps are adopting the changes. Given the benefits with CPU and Latency with high concurrent node apps, we should accelerate to 100% node apps to propagate the changes to live.

A big shout out to the Web Platform team for partnering with us to make this happen. It was a collaborative effort all along!

Enabling Node Apps To Do More With Less

Enabling Node Apps To Do More With Less

Introduction:

Scalability Issues in NodeJS:

Memory Leaks

Connection Management: Persistent Connection not Enabled by Default

Taking the Solution Further: Enabling Persistent Connections at the Framework Layer

Next Steps

Recommend

How to Bypass Cognito-Hosted UI With Federated Sign In

Department of Commerce Says We Need Fewer Repair Restrictions

Google Pixel 3 and Pixel 3 XL support dropped just a week after the Pixel 6 rele...

开源书籍-Java 并发实践

iQOO双11开门红旗开得胜，斩获天猫苏宁多平台TOP2

What’s new in C# 10: overview | How Not To Code

Maxar 50 cm高分辨率存档卫星影像优惠活动

It's okay to not be okay

云×5G，百业绽放中国移动举办2021全球合作伙伴大会云×5G论坛

ROHM开发出实现4W业内超高额定功率的厚膜分流电阻器“LTR100L”

About Joyk