5

Caveats storing large amounts of data in Elixir Agents

 3 years ago
source link: https://www.kabisa.nl/tech/caveats-storing-large-amounts-of-data-in-elixir-agents/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Me 2020 squarePosted by Pascal Widdershoven on 25-4-2019

Recently while working on an Elixir project I ran into an interesting gotcha with Agents that caused massive amounts of resource usage. Read on to find out what happened.

What are Agents in Elixir?

Agents are a simple abstraction around state.

Often in Elixir there is a need to share or store state that must be accessed from different processes or by the same process at different points in time.

The Agent module provides a basic server implementation that allows state to be retrieved and updated via a simple API.

Elixir is an immutable language where nothing is shared by default. This has many benefits, but it also means that when you do want to share data between processes you need to do some extra work. Fortunately, Elixir provides a lot of great building blocks to achieve this like Agents, ETS and Mnesia.

So what’s the problem?

There are two ways to use the state stored in an agent:

  1. By operating on the data form within the Agents process:

    1
    2
    3
    4
    
    # Compute in the agent/server
    def get_something(agent) do
      Agent.get(agent, fn state -> do_something_expensive(state) end)
    end
    
  2. By pulling the data into the client process and operating on it there:

    1
    2
    3
    4
    
    # Compute in the client
    def get_something(agent) do
      Agent.get(agent, & &1) |> do_something_expensive()
    end
    

If you look at the code the differences are very subtle. The difference in behaviour, however, is not subtle.

In approach #1 the data will remain in the Agent process. However, if you perform expensive operations there the agent will be blocked for the entire duration of the operation, meaning no other process can access the data until the operation is finished. Using this model to respond to an HTTP request is killing for performance.

In approach #2 the Agent will not be blocked, but the data will be copied into the process that is accessing the data. When the amount of data is small this is not really a problem, but if you start storing larger amounts of data this becomes really expensive real quick.

Real life example

The impact of this can be huge as I will demonstrate in the case below.

In the project I’m working on we were storing a set of rules in an Agent. A rule is a struct with 27 fields and we were storing approximately ~ 5000 rules in the Agent. There’s an HTTP endpoint that for every request uses these rules to determine the response.

For a while this was fine, but when the load started increasing we noticed the server going out of memory. To debug this I started throwing load at the endpoint using wrk. Results below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Running 1m test @ http://localhost:4001
  4 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.23s   553.43ms   3.00s    57.78%
    Req/Sec    59.72    121.98   680.00     88.64%
  Latency Distribution
     50%    2.24s
     75%    2.67s
     90%    2.88s
     99%    3.00s
  5551 requests in 1.00m, 11.80MB read
  Socket errors: connect 0, read 5971, write 3, timeout 5506
  Non-2xx or 3xx responses: 4575
Requests/sec:     92.38
Transfer/sec:    201.05KB

As you can see 92 requests per second are handled and a lot of requests time out (take more than 3 seconds). During the test, the Elixir process consumed around 10GB of memory.

Solutions

As we’ve seen in the previous section, storing these amounts of data in an Agent requires a lot of memory and performance is frankly not great.

Looking at the code and reading the Agent documentation, I quickly realised that the root cause of this issue was the fact that all rules were copied to the process handling the HTTP request, for every request. So how can we prevent this?

‘Shared nothing’ is a very core principle of Elixir/Erlang, so the short answer is you can’t prevent the data from being copied if you want to share the data between processes. This affects all ways of storing data in memory, so not just Agents.

There are workarounds, like fast_global. Fastglobal works by dynamically compiling a module at runtime, but it’s not without drawbacks.

So the solution is to make sure the data does not have to be shared between processes. There are a variety of ways to do this. The approach I took was to create a pool of worker processes (with Poolboy) that handle executing the rules. When an HTTP request comes in, the rule matching is handled by one of the worker processes.

In code this looks roughly like this (simplified):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
defmodule Worker do
  use GenServer
  
  def start_link(_) do
    GenServer.start_link(__MODULE__, nil, [])
  end

  def init(_) do
    rules = State.get()
    {:ok, rules}
  end

  def handle_call({:match_rules, input}, _from, rules) do
    matches = match_rules(rules, input)
    {:reply, matches, rules}
  end
end

When a worker starts it loads (copies) the rules from the Agent (State is a module wrapping the Agent) into the worker process. Each worker process contains a copy of the rules so the memory usage is predictable.

If the rules change at runtime, the processes are simply killed and restarted so the new rules will be used automatically. Poolboy takes care of starting N workers and selecting a worker from the pool.

End result

With that in place, wrk results started looking as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Running 1m test @ http://localhost:4001
  4 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.04s   270.71ms   1.66s    69.89%
    Req/Sec   221.14     65.34   405.00     66.08%
  Latency Distribution
     50%    1.07s
     75%    1.25s
     90%    1.36s
     99%    1.47s
  52823 requests in 1.00m, 14.41MB read
  Socket errors: connect 0, read 1014, write 0, timeout 0
Requests/sec:    879.16
Transfer/sec:    245.55KB

As you can see the throughput increased from 92 req/sec to 879 req/sec. Average latency went down from 2.23s to 1.04s. Memory used went down from 10GB to 400MB.

Not bad!

Pascal Widdershoven

Full Stack Developer • Github: pascalw • Twitter: @_pascalw


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK