2

The surprising properties of DynamoDB pagination

 2 years ago
source link: https://advancedweb.hu/the-surprising-properties-of-dynamodb-pagination/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

The surprising properties of DynamoDB pagination

When limits and filters mess up the results

a0b6f3-407df02723675900b32816a8a8eda7b6aa5626cd0cd2de1505e9f5ef7edf3843.jpg
Tamás Sallai
6 mins

DynamoDB uses token-based pagination. This means when a query or a scan operation potentially return more results then the result contains a LastEvaluatedKey field with some value. To fetch the next page, pass that value as the ExclusiveStartKey in a separate query. Do this until the LastEvaluatedKey is undefined, and you can be sure you fetched all items.

How it works is logical and simple. But when you start using other DynamoDB features, namely the Limit and the FilterExpression for queries and scans, things become more interesting.

In this article, we’ll look into a specific case where at first the results do not make much sense. Then we’ll look into how these features work and see why this is a feature of DynamoDB and not a bug.

The base query

Recommended book
d5ff54-6dab65823ed59e0cc72e6279dfd6a857982b32e1f9b6fd8bd30f32e2c72bf193.jpg
How to design, implement, and deploy GraphQL-based APIs on the AWS cloud
Book, 2022

Let’s run this DynamoDB query and inspect the results:

import { DynamoDBClient, QueryCommand } from "@aws-sdk/client-dynamodb";

const client = new DynamoDBClient();
const command = new QueryCommand({
	TableName: process.env.TABLE,
	IndexName: "groupId",
	KeyConditionExpression: "#groupId = :groupId",
	ExpressionAttributeNames: {"#groupId": "group_id", "#status": "status"},
	ExpressionAttributeValues: {
		":groupId": {S: "group1"},
		":status": {S: "ACTIVE"},
	},
	ScanIndexForward: false,
	FilterExpression: "#status = :status",
});
console.log(await client.send(command));

This returns a single item with some additional metadata:

{
  '$metadata': {
    "..."
  },
  Items: [
    {
      "..."
    }
  ],
  LastEvaluatedKey: undefined,
}

Resultsgroup_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3

Limits

Let’s see what happens if we add different Limit arguments to the query!

Limit: 1

const command = new QueryCommand({
	TableName: process.env.TABLE,
	IndexName: "groupId",
	KeyConditionExpression: "#groupId = :groupId",
	ExpressionAttributeNames: {"#groupId": "group_id", "#status": "status"},
	ExpressionAttributeValues: {
		":groupId": {S: "group1"},
		":status": {S: "ACTIVE"},
	},
	ScanIndexForward: false,
	FilterExpression: "#status = :status",
	// Add a limit
	Limit: 1,
});

The result now has zero items!

{
	"Count": 0,
	"Items": [],
	"LastEvaluatedKey": {
		"..."
	},
	"ScannedCount": 1
}

Even when fetching the next page, there are no items, but a different LastEvaluatedKey. Only the third page returns an item.

Results1group_id (PK)last_active (SK)statusid...LastEvaluatedKey1Results2group_id (PK)last_active (SK)statusid...LastEvaluatedKey2ExclusiveStartKey:LastEvaluatedKey1Results3group_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3...LastEvaluatedKey3ExclusiveStartKey:LastEvaluatedKey2Results4group_id (PK)last_active (SK)statusidExclusiveStartKey:LastEvaluatedKey3

Stranger still, the third query returns a pagination token, but the fourth query returns zero results.

Limit: 2

const command = new QueryCommand({
// ...
	Limit: 2,
});

No items, but a LastEvaluatedKey:

{
	"Count": 0,
	"Items": [],
	"LastEvaluatedKey": {
		"..."
	},
	"ScannedCount": 2
}

Results1group_id (PK)last_active (SK)statusid...LastEvaluatedKey1Results2group_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3ExclusiveStartKey:LastEvaluatedKey1

Limit: 3

const command = new QueryCommand({
// ...
	Limit: 3,
});

We have a result!

{
	"Count": 1,
	"Items": [{
		"..."
	}],
	"LastEvaluatedKey": {
		"..."
	},
	"ScannedCount": 3
}

Repeating the query with the pagination token returns no items and no LastEvaluatedKey either:

{
	"Count": 0,
	"Items": [],
	"ScannedCount": 0
}

Results1group_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3...LastEvaluatedKey1Results2group_id (PK)last_active (SK)statusidExclusiveStartKey:LastEvaluatedKey1

Limit: 4

const command = new QueryCommand({
// ...
	Limit: 4,
});

We have a result!

{
	"Count": 1,
	"Items": [{
		"..."
	}],
	"ScannedCount": 3
}

With this limit we came back to the original response: 1 item, no pagination token.

Results1group_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3

Seeing these results, there are several pressing questions we can ask. Why the matching item not in the result when the Limit: 1? Why is there a pagination token when the response already contains all items, such as in the Limit: 3 scenario?

Investigation

First, let’s see what items are in the table!

Users tablegroup_id (PK)last_active (SK)statusidgroup12022-06-15INACTIVEuser1group22022-06-16INACTIVEuser2group12022-01-01ACTIVEuser3group12022-01-02INACTIVEuser4

The users are stored with a group_id and a status field. There is also a last_active and that is the sort key. This defines the ordering of the items.

The query expression defines the group_id, so the results only contain items where the group is group1. This is why all the returned users are in this group.

Then the filter expression makes sure that only active users are returned.

For the first query, DynamoDB fetches a page of users that match the query expression. This means all three users are fetched, then the filter expression drops inactive ones.

Fetchgroup_id (PK)last_active (SK)statusidgroup12022-06-15INACTIVEuser1group12022-01-02INACTIVEuser4group12022-01-01ACTIVEuser3Filtergroup_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3

This is why this query returns 1 item and no pagination token: DynamoDB knows there are no more users, as it reached the end of the table.

Limit: 1

When the Limit is 1, DynamoDB fetches only 1 item. The query expression is effective before the fetch, it will still skip users not in group1. But since the filter expressions runs after, the result will have zero items and a pagination token.

Fetchgroup_id (PK)last_active (SK)statusidgroup12022-06-15INACTIVEuser1...LastEvaluatedKey1Filtergroup_id (PK)last_active (SK)statusid

Limit: 2

When Limit is 2, it works almost the same. The only difference is that DynamoDB fetches 2 items, then drops both.

Fetchgroup_id (PK)last_active (SK)statusidgroup12022-06-15INACTIVEuser1group12022-01-02INACTIVEuser4...LastEvaluatedKey1Filtergroup_id (PK)last_active (SK)statusid

Limit: 3

When the Limit is 3, things change a bit. DynamoDB gets 3 items, but the last one is ACTIVE. This means the filter expression drops only 2 items and returns 1. But since the query did not reach the end of the table, there will be a pagination token.

The next query returns zero results, but also no pagination token. This means DynamoDB reached the end of the table.

Fetch1group_id (PK)last_active (SK)statusidgroup12022-06-15INACTIVEuser1group12022-01-02INACTIVEuser4group12022-01-01ACTIVEuser3...LastEvaluatedKey1Filter1group_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3Fetch2group_id (PK)last_active (SK)statusidExclusiveStartKey:LastEvaluatedKey1Filter2group_id (PK)last_active (SK)statusid

Limit: 4

Finally, when Limit is 4, DynamoDB fetches all results and also reaches the end of the table. This means the result will contain the single result and no pagination token.

Fetchgroup_id (PK)last_active (SK)statusidgroup12022-06-15INACTIVEuser1group12022-01-02INACTIVEuser4group12022-01-01ACTIVEuser3Filtergroup_id (PK)last_active (SK)statusidgroup12022-01-01ACTIVEuser3

Takeaways

The filter expression is run after DynamoDB fetches the results, so it is almost the same as doing it on the client-side, the only difference is the saved bytes on the network layer.

Then Limit is applied for the fetch, which is before the filtering. This means there are no guarantees that a Limit will contain all matching items. As we’ve seen previously, it is possible that a result set contains zero items and multiple steps of pagination is needed to get the first result.

This is especially important when the DynamoDB query is behind an abstraction. For example, if the query is sent by an AppSync resolver and clients only interface with the GraphQL endpoint then they might not be aware that the limit works this way. Worse still, a seemingly simple change of adding a filter expression to that resolver changes the behavior of the GraphQL query. Because of this, it is important to communicate the lack of guarantees.

Using filter expressions with limit can easily affect performance. If, let’s say, there are 10 non-matching item then a Limit: 1 query needs to send 11 requests to get the first result. Without Limit, DynamoDB usually fetches enough items that it at least won’t impact the number of roundtrips.

Then, getting a LastEvaluatedKey does not mean there are more items to return. It can happen that DynamoDB did not reach the end of the table and a subsequent request is needed to make sure no elements are left out. This might result in strange behaviors such as showing a “load more results” link but when clicked it shows nothing.

And finally, the query expression is more efficient as it does not even touch the items that do not match. So if you can organize the table or index in a way that you can integrate the filtering in the query expression, then most of the surprises caused by pagination and limiting won’t happen.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK