Storing and Using JSON Within PostgreSQL Part Two
source link: https://www.percona.com/blog/storing-and-using-json-within-postgresql-part-two
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Storing and Using JSON Within PostgreSQL Part Two
We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly. Most of the functionality we talked about in part one works well when you have a simple JSON format. However, in real life, our documents may get a bit messy. Let’s look at part of the JSON we are using for our tests:
Example JSON:
You can see here that we have some nested arrays and a bit of multi-dimensional flair. If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions. Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.
First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the following:
However, what if we also need to pull out the character from the movie? For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).
This works fine – until it doesn’t. If I do a similar search for all movies starring Robert Downey Jr., I get:
When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).
You can work around this “Null” issue in a couple of different ways. The easiest is by converting to text, then back into JSON. For example:
Remember in part one how I harped on types and ensuring you cast to different data types when you needed to? This is another example. In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.
While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one. Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null. For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.
This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however. To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:
Of course, we can work around this as well. But sometimes, it’s best to pivot. Let’s look at another alternative to simply searching for a NULL value. As we learned in the first post, we can use generated columns or expression indexes to do some nice things. Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:
You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others. I want to highlight some of the challenges this brings up with using unstructured data within a structured system.
Evolution Upsets the Balance (or Breaks Stuff)
All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time. Let’s use the above example. Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors. Then, one day, the feed you get your data from allows movies without a cast listing. Your application will still work, your load scripts will still work. But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these). But 99.9% of queries are fine. It is these transient issues that drive people bonkers.
Looking at this problem slightly differently, what if you start adding data or changing the order of certain items? Back to our original JSON:
What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:
Now we run into another challenge. If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application. We had to define the definition of the recordest; the first field returned is id, then actor name, and then character. If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner. Now to make additions, you would need to start adding at the end like:
MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses. Here this is a trade-off. You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.
To JSON or Not? Database Design
While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design. It is a great tool to offer additional flexibility. If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen. It would help to think about how JSON columns can augment your current database designs, not replace them.
Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON. While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability. For this, I built three separate designs and schemas for housing the same data:
A.) Minimal Database Design -> single table, single column. “The dump.” With only a GIN index.
B.) Single Table -> Generated columns and Expression indexes where needed.
C.) Old School Normalized Tables, with JSON stored for easy use
The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):
A.) Minimal Database Design No Gin IDX A.) Minimal Database Design with Gin IDX B.) Single Table W Generated Column B.) Single Table W Expression Indexes c.) Normalized Simple Query for Movie Title (random 10 titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms Select all movies for an actor 1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
- Each test was repeated 100 times, and the average results were listed. Min/Max is available as well.
- The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.
Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON. We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index. I left these out for now. Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t. My goal here is merely to make you think about your design. Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.
Recap & What’s Next
A few quick takeaways:
Recapping part 1:
- Using JSONB is probably going to be your best option in most use cases.
- Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns. You may get errors or odd results.
- Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.
What we learned in part 2:
- The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
- Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need. However, using this function or others is very susceptible to the data structure in your document (data types matter!).
- JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
- JSON within a well-built, designed application and database can be wonderful and offer many great benefits. JSON just dumped into databases won’t scale. Database design still matters.
Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!
STAY UP-TO-DATE With Percona!
Join 33,000+ of your fellow open-source enthusiasts! Our newsletter provides updates on Percona open source software releases, technical resources, and valuable MySQL, MariaDB, PostgreSQL, and MongoDB-related articles. Get information about Percona Live, our technical webinars, and upcoming events and meetups where you can talk with our experts.
By submitting my information I agree that Percona may use my personal data in send communication to me about Percona services. I understand that I can unsubscribe from the communication at any time in accordance with the Percona Privacy Policy.
Author
Matt is the Chief Customer Officer for Percona. He has been working with relational databases since the mid 1990's as a DBA and System Administrator. Before joining Percona, Matt worked at MySQL and Sun Microsystems as a Solution Architect helping to build architectures for Fortune 500 and Top Alexa rated companies.
Leave a Reply Cancel reply
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK