Analytics Tips #5 - Converting Unstructured Data into Structured Insights
From Chaos to Clarity with LLMs
Hey everyone! This is Josep, one more week 👋🏻
This week, we have a third issue of the Analytics Tips series, where we'll explore different aspects of data science in an easy-to-understand way.
In today's world, we're constantly generating information.
Yet much of it arises in unstructured formats.
This includes from content on social media to countless PDFs and Word documents stored across organizational networks.
Getting insights and value from these unstructured sources, whether they be text documents, web pages, or social media updates, is a considerable challenge.
However, the emergence of LLMs such as GPT or LlaMa has completely revolutionized the way we deal with unstructured data.
Setting up for our challenge
Imagine we are running e-commerce (Amazon in this case 😉), and we are the ones responsible for dealing with the millions of reviews that users leave on our products.
But first… let’s try to understand our challenge 👇🏻
What’s the difference between structured and unstructured data?
Structured data refers to data types that are consistently formatted and repeated, like banking transactions or airline reservations. This is great to store and work with!
Unstructured data is information that isn't organized in a “neat” way like a spreadsheet. This includes emails, social media posts, and even audio or videos. This is no as great… 😔
So, what’s the main problem of unstructured data?
While structured data is well-suited to storage and management within a conventional database management system due to its uniform format, unstructured data is more challenging to be integrated dues to its less rigid stucture.
So let’s try to deal with the most common unstructured data: TEXT.
And this brings us to the following question…
Is text genuinely unstructured, or does it possess an underlying structure that's not immediately apparent?
Computers are able to interpret simple, straightforward structures, but language, with its elaborate syntax, falls outside their field of comprehension.
So this brings us to a final question:
If computers struggle to process unstructured data efficiently, is it possible to convert this unstructured data into a structured format for better handling?
Manual conversion to structured data is time-consuming and has a high risk of human error. It's often a mishmash of words, sentences, and paragraphs, in a wide variety of formats which makes it difficult for machines to grasp its meaning and to structure it.
And this is precisely where LLMs play a key role.
Converting unstructured data into a structured format is essential if we want to work or process it somehow, including data analysis, information retrieval, and knowledge management.
1. Text Summarization
LLMs can efficiently summarize large volumes of text, such as reports, articles, or lengthy documents. This can be particularly useful for quickly understanding key points and themes in extensive data sets.
In our case, it is way better to have a first summary of the review rather than the whole review. So, any LLM can deal with it in seconds.
And our only - and most important task - will be crafting a good prompt.
In this case, I can tell GPT to:
Summarize the following review: \"{review}\" with a 3 words sentence.
And we will get something like follows 👇🏻
2. Sentiment Analysis
These models can be used for sentiment analysis, determining the tone and sentiment of text data such as customer reviews, social media posts, or feedback surveys.
The most simple, yet most used, classification of all time is polarity.
Positive reviews or why are people happy with the product.
Negative reviews or why are they upset.
Neutral or why people are indifferent with the product.
By analyzing these sentiments, businesses can understand public opinion, customer satisfaction, and market trends. So, instead of having a person decide for each review, we can have an LLM to classify them for us.
And we would obtain something as follows 👇🏻
3. Thematic Analysis
LLMs can identify and categorize themes or topics within large datasets. This is particularly useful for qualitative data analysis, where you might need to sift through vast amounts of text to understand common themes, trends, or patterns.
When analyzing reviews, it can be useful to understand the main purpose of the review. Some users will be complaining about something (service, quality, cost…), some users will be rating their experience with the product (either in a good or a bad way) and some others will be performing questions.
Again, doing manually this work would suppose a lot of hours.
But with our new LLMs friends, it only takes a few lines of code 👇🏻
4. Keyword extraction
LLMs can be used to extract keywords. This means, detecting any element we ask for.
Imagine for instance that we want to understand if the product where the review is attached is the product the user is talking about. To do so, we need to detect what product is the user reviewing.
And again… we can ask our LLM model to find out the main product the user is talking about.
And we would obtain something as follows 👇🏻
Main Conclusions
In conclusion, the transformative power of Large Language Models (LLMs) in converting unstructured data into structured insights cannot be overstated. By harnessing these models, we can extract meaningful information from the vast sea of unstructured data that flows within our digital world.
The four methods discussed – text summarization, sentiment analysis, thematic analysis and keyword extraction – demonstrate the versatility and efficiency of LLMs in handling diverse data challenges.
These capabilities enable organizations to gain a deeper understanding of customer feedback, market trends, and operational inefficiencies.
To learn more you can check the following article with the corresponding pieces of code and you can go check my whole code in the following GitHub repo.
And this is all for now!
If you have any suggestions or preferences, please comment below or message me through my social media!
Remember you can also find me in X, Threads, Medium and LinkedIn 🤓
DataBites by Josep is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Very interesting and well structured 😉 issue!