9

Storing and Using JSON Within PostgreSQL Part Two

 3 years ago
source link: https://www.percona.com/blog/storing-and-using-json-within-postgresql-part-two
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Storing and Using JSON Within PostgreSQL Part Two

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

Shell
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  "country": "USA",
  "version": [
      "runtime": "141 min",
      "description": ""
  "imdb_rating": 7.5,
  "director": [
      "id": "nm0923736",
      "name": "Joss Whedon"
  "writer": [
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
  "cast": [
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

Shell
select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'
select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

Shell
movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character    
-----------+--------------------------------+-------------+-----------+----------------------+------------------
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

Shell
movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10;
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

Shell
movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof
-----------+--------------
tt7620156 |
tt0109524 | array
tt0534466 | array
tt0111091 |
tt4167726 | array
tt0638383 | array
tt6346314 | array
tt5877808 | array
tt4098782 |
tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

Shell
select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

Shell
movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
Planning Time: 0.064 ms
Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

Shell
movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

Shell
movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored;
ALTER TABLE
movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
Planning Time: 0.068 ms
Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

Shell
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  "country": "USA",
  "version": [
      "runtime": "141 min",
      "description": ""
  "imdb_rating": 7.5,
  "director": [
      "id": "nm0923736",
      "name": "Joss Whedon"
  "writer": [
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
  "cast": [
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

Shell
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

Shell
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms

  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

STAY UP-TO-DATE With Percona!

p

Join 33,000+ of your fellow open-source enthusiasts! Our newsletter provides updates on Percona open source software releases, technical resources, and valuable MySQL, MariaDB, PostgreSQL, and MongoDB-related articles. Get information about Percona Live, our technical webinars, and upcoming events and meetups where you can talk with our experts.

Enter your work email address:*

By submitting my information I agree that Percona may use my personal data in send communication to me about Percona services. I understand that I can unsubscribe from the communication at any time in accordance with the Percona Privacy Policy.

Author

Matt Yonkovit

Matt is the Chief Customer Officer for Percona. He has been working with relational databases since the mid 1990's as a DBA and System Administrator. Before joining Percona, Matt worked at MySQL and Sun Microsystems as a Solution Architect helping to build architectures for Fortune 500 and Top Alexa rated companies.


Leave a Reply Cancel reply


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK