PostgreSQL’s JSON support

Raul Rene Lepsa
Java Devolper
@3Pillar Global

PROGRAMMING

There is an undeniable increasing need for data flexibility and scalability, this being the reason why so many have considered NoSQL databases over the last years. However, there are pros and cons and furthermore non-relational databases were never meant to replace relational ones.

Software developers and architects often have a hard time choosing one over another, especially when the format of the data is unknown or subject to change. A compromise solution is to use both relational and non-relational databases and to set up a communication system between the two. This, however, can prove to be a bigger headache than going with just one database system from the beginning.

Why not 2 in 1?

Companies such as IBM and Oracle started offering methods through which Relational Database Management Systems and non-relational ones can coexist. PostgreSQL offers an alternative, by providing special data types with a loose structure, which mimic NoSQL behavior on a RDBMS.

Starting with version 8.3, PostgreSQL introduced the hstore data type, which came in useful for rows with many attributes that are rarely examined and semi-structured data. It allows storing key-value pairs (similar to some NoSQLs) and supports different operations and functions on them. In version 9.2 the JSON-type column was introduced and it got performance improvements and also support for different functions and operators in the 9.3 beta. JSON (Javascript Object Notation) is a lightweight, human-readable, language-independent data-interchange format and it is stored by Postgres as text.

Why would you consider it?

Almost every developer has had to deal with ever-changing client requirements and felt the need for more flexibility from the data storage system, especially when dealing with multi-tenant applications. On the other hand, clients often feel the need for custom fields and demand flexibility. But what do you do when some clients want one custom field and others want four, for the same functionality? I am sure most of you have seen columns like customField1, customField2, customField3, anotherCustomField, and so on. These can be avoided by using an array, but what if you have to store pairs? Or triplets? What if the custom field has a label and a value? Or even an associated date? Things get more complicated.

Another general issue is dealing with people"s names. There is a list of 40 falsehoods about names, and just to get a feel of the point I"m trying to make, here are a few things programmers generally omit thinking about:

people"s names can contain numbers
people can have an indefinite number of names
people might not have a first name, a middle name or a last name
people can have more than one canonical fullname
people"s names are not necessarily written in ASCII, and not necessarily written in a single character set
people might not have names

Of course, you probably won"t have to deal with many of these cases, but you can never be sure that your system will never have to store Chinese names. A traditional way of storing names is a combination of first, last and middle names, where the latter is optional. But if a person does not have a first or a last name, this implementation becomes erroneous.

A possible way of doing it is using a JSON:

{
    "first_name"  : "Ronaldo",
    "mother_name": "de Assis",
    "last_name"  : "Moreira",
    "nicknames"  :["Ronaldinho", 
		  "Gaúcho"]
}

The above implementation can however add the overhead of changing the full_name whenever one of the others is changed, but it"s just an example. Generally speaking, the solution is dependent mainly on the application and the targeted clients, but the above one offers great flexibility compared with classic relational database way of storing names. Rarely will we need to query all of these names, and that"s the whole purpose. Why create a mother"s name column when most of the entities probably won"t even need one? Why create a nickname column when some entities will need more than one? Same can go for all the other names.

Functions and Operators

Although the full list of functions and operators can be found on the Postgres documentation, it is important to mention that accessing a JSON object field is done using the "->" operator (it can also be retrieved as text using the "->>" operator). For example, supposing the column is called names, accessing a last_name field from it is done by: names->>"last_name". The above operator can also be used to access an array element at a certain index: (names->"nicknames")->0 would return the first object from the nicknames array from within the names column.

Apart from the operators, starting from version 9.3 there have been a lot of functions added in order to aid the developers using this data type, such as ones for extracting objects from an JSON array, for converting a row to a JSON, for expanding the JSON object into a set of key/value pairs or for safely converting any element to a JSON.

Advantages of the JSON column type

The first advantage of this column type is that JSON is an open standard format, being language-independent. Secondly, it makes it easy to deal with ever-changing requirements because it"s flexible and scalable. It comes in useful when storing multi-level object graphs, as it provides better performance and the code itself is easier to write and maintain than usual graph implementations. The JSON column does not take much space (it"s like storing text) and allows storing up to 1 GB of data into one column. Furthermore, it prevents SQL injection by default because the JSON object is validated before it"s persisted.

Foreign keys can be avoided by data denormalization and fields in complex queries can be accessed without having to join other (possibly large) tables, thus offering a NoSQL-like behavior in a relational database management system.

It can be a plus in web applications by making it easier to transport and convert data from the client to the web controllers, and then to the persistence layer, by storing JSON objects received from the client.

These types of columns can be indexed and furthermore indices can be created inside JSON objects (e.g. for arrays inside a JSON object). The current indices supported on both hstore and JSON types are almost as performant as indices on common types, but there is work in progress for the new generation of GIN indices. These have already been implemented for hstore in the 9.4 development version and performance comparisons with MongoDB have shown that although the sequential scan is almost the same in terms of performance, the index scan if faster than in Mongo. These indices will most probably be applied to the JSON data types as well for version 9.4.

Disadvantages

The most important disadvantage of the JSON data type is that it"s not portable, as it is currently only supported by PostgreSQL. Other disadvantages include the fact that foreign key references are not supported and that queries have an odd syntax, less human readable than ordinary SQL.

Furthermore, queries that are simple on common data types can get tricky and difficult using the JSON one, especially when dealing with arrays of objects. To prove this point, consider an example in which phones are stored as an array of JSONs for a users table, as shown in Figure 1. Although this kind of setup looks better than creating columns such as home_number, primary_number, work_number and so on, querying the table in order to retrieve the primary phone numbers for example is overly complicated (the result of the query is shown in Figure 2):

SELECT users.id, phone->>"number" AS number 
FROM users INNER JOIN
(
  SELECT id, 
     json_array_elements(phones) 
     AS phone
     FROM users 
     WHERE id=users.id
) phones 
     ON phones.id = users.id
     WHERE phone->>"type"="primary"

Figure 1 - Phones stored as an array of JSON objects, each having a type and a number

Figure 2 - Primary numbers as retrieved by the query

One cannot safely assume (generally speaking) that the primary phone number is on the first position of the array. Nevertheless, this is subjective and dependent on the system, so enforcing such constraints could lead to easier queries, as the "->" operator can be directly used to access the n-th element of an array. One more thing to notice is that the above example shows a simple array of phone numbers. Queries become even more complicated when dealing with more complex objects inside the arrays and when the need of joining with elements from within the array appears.

Conclusions

As we often find ourselves searching for flexibility in databases, opting solely for a relational database management system or a NoSQL one (such as a document-based or graph-based) isn"t always a solution. Although steps are being taken in order to ease integrating these two types of databases, at least for now this can prove to be too big of a headache.

PostgreSQL offers a "compromise" solution by providing support for hstore columns that mimic key-value stores and for JSON columns that allow storing JSON objects and arrays. Moreover, it provides operators and functions in order to ease the data manipulation, while also allowing indices and inner indices on such objects.

Whether the solution is good or not is, of course, dependent on the system and on the requirements it needs to meet. What is important to remember is that the developer still has the power of the Postgres RDBMS available, independent on the fact that the JSON data type is used or not.