Analytics Tips #2 - Parsing JSONs the Right Way
Navigating Complex Data Structures with Python's json_normalize.
Hey everyone! This is Josep, one more week 👋🏻
This week, we have a second issue of the Analytics Tips series, where we'll explore different aspects of data science in an easy-to-understand way.
Diving into the world of data science and machine learning, one of the fundamental skills you'll encounter is the art of reading data. If you have already some experience with it, you're probably familiar with JSON (JavaScript Object Notation) - a popular format for both storing and exchanging data.
Think of how NoSQL databases like MongoDB love to store data in JSON, or how REST APIs often respond in the same format.
However, JSON, while perfect for storage and exchange, isn't quite ready for in-depth analysis in its raw form.
This is where we transform it into something more analytically friendly – a tabular format.
So, whether you're dealing with a single JSON object or a delightful array of them, in Python's terms, you're essentially handling a dict or a list of dicts.
1. Dealing with simple JSONs and lists of JSONs
1.1 Dealing with simple JSON structures
First, we need to import the pandas library and then we can use the command pd.json_normalize(), as follows:
import pandas as pd
pd.json_normalize(json_string)
Considering the following JSON examples:
simple_json = {
'name': 'David',
'city': 'London',
'income': 80000,
}
simple_json_2 = {
'name': 'Taylor',
'city': 'Chicago',
'income': 120000,
}
simple_json_list = [
simple_json,
simple_json_2
]
By applying this command to a JSON with a single record, we obtain the most basic table. However, when our data is a little bit more complex and presents a list of JSONs, we can still use the same command with no further complications and the output will correspond to a table with multiple records.
Easy… right?
The next natural question is what happens when some of the values are missing.
1.2 Dealing with null values
Imagine some of the values are not informed, like for instance, the Income record for David is missing. When transforming our JSON into a simple pandas dataframe, the corresponding value will appear as NaN.
And what about if I only want to get some of the fields?
1.3 Selecting only those columns of interest
In case we just want to transform some specific fields into a tabular pandas DataFrame, the json_normalize() command does not allow us to choose what fields to transform.
Therefore, a small preprocessing of the JSON should be performed where we filter just those columns of interest.
# Fields to include
fields = ['name', 'city']
# Filter the JSON data
filtered_json_list = [{key: value for key, value in item.items() if key in fields} for item in simple_json_list]
pd.json_normalize(filtered_json_list)
2. Dealing with multiple-level JSONs
When dealing with multiple-leveled JSONs we find ourselves with different levels. The procedure is the same as before, but in this case, we can choose how many levels we want to transform.
By default, the command will always expand all levels and generate new columns containing the concatenated name of all the nested levels.
3. Dealing with a nested List JSON
The last case we can find is having a nested List within a JSON field. We can effectively manage this data using Pandas in Python.
The pd.json_normalize() function is particularly useful in this context. It can flatten the JSON data, including the nested list, into a structured format suitable for analysis. When this function is applied to our JSON data, it produces a normalized table that incorporates the nested list as part of its fields.
To learn more the full detail and explore different techniques for each step you can check the following article.
And this is all for now!
If you have any suggestions or preferences, please comment below or message me through my social media!
Remember you can also find me in X, Threads, Medium and LinkedIn 🤓