Hey you all!
Itβs Josep here, checking in from beautiful Luxembourg, where Iβm visiting one of my closest friends.
In this edition, I want to dive into a topic that every data analyst, scientist, or enthusiast encounters sooner or later: outliers.
So stay with me for 4 minutes βtrust me, itβll be worth your time!
The most important news of the week?
The weekly cheasheet format was a success - this is why for now we will keep this format for the incoming issues!!!
Now that weβre caught up, letβs get into the important stuff! π¨π»βπ»
#1. Outliers: What, Why, and How?
Outliers are the peculiar data points that donβt quite fit with the rest of the dataset. These anomalies can arise from:
Data entry errors ποΈ
Measurement issues π οΈ
Genuine rare events π
While they can skew your analysis and lead to incorrect conclusions, they can also hold valuable insights (think fraud detection or identifying unique customer segments).
This leads us to todayβs main questionβ¦
So, whatβs the best way to approach them?
But before starting, I want to share the Cheatsheet of the week:
#2. Why Should You Detect Outliers?
Outliers detection serve two main purposes in data analysis:
1. To Remove Them
Outliers can distort patterns and bias machine learning models. By detecting and removing them, you enhance the performance of forecasting or ML training.
Examples: Training machine learning models, forecasting sales.
2. To Identify Them
Sometimes, outliers represent anomalies worth analyzing further. For instance, they might indicate fraud or highlight quality control issues.
Examples: Fraud detection, quality control, anomaly detection.
#3. How to Detect Outliers?
There are three primary approaches to detecting outliers:
Approach I: Graphical Methods
Visualization is one of the simplest ways to identify outliers.
Scatter Plot: Look for points far from the main cluster.
Box Plot: Outliers appear as points outside the whiskers.
Example:
In a scatter plot of customer spending, outliers might represent unusually high purchases during a sales campaign.
Approach II: Statistical Methods
Statistical techniques rely on probability and data distribution:
Z-score: Measures how many standard deviations a data point is from the mean.
IQR: Identifies outliers as points beyond the range between the first and third quartiles.
Example:
In revenue data, a Z-score analysis might reveal months with revenues far above the average due to special events.
Approach III: Machine Learning Models
Leverage ML for more advanced outlier detection.
Isolation Forest: Isolates anomalies based on how easily a data point can be separated.
Elliptic Envelope: Fits a multivariate Gaussian distribution to identify outliers.
LOF (Local Outlier Factor): Detects density-based anomalies.
SVM (Support Vector Machine): Learns a boundary to separate normal and outlier data.
Example:
An isolation forest can help detect fraudulent transactions in financial datasets.
Wrapping It Up π
Outliers can be challenges or opportunities depending on how you handle them. Always remember:
Define them based on context.
Detect them using visual, statistical, or ML methods.
Decide whether to remove, analyze, or transform them.
Still with me? π§
As fellow data enthusiast, Iβm sure youβd be eager to help me shape some impactful KPIs and take this newsletter to the next level!
So hereβs how you can help:
ππ» I want this newsletter to be truly valuable for you, so please share your feedback!
Additionally, you can let me know any preference for future content or any idea you think might add more value to you!
Before you go, tap the π button at the bottom of this email to show your supportβit really helps and means a lot!
My latest articles π
Understanding the Essentials of Time Series π in DataBites
Using Pandas and SQL Together for Data Analysis in KDnuggets
How to Learn Databricks in DataCamp
My Favorite Resources! π¨π»βπ»
I strongly recommend following AIgents roadmaps if you want to become a full Data Scientist, ML Engineer, or Data Engineer,
You can follow the Data Engineer Zoomcap on GitHub to become a fully proficient Data Engineer.
Want to get more of my content? ππ»ββοΈ
Reach me on:
LinkedIn, X (Twitter), orΒ Threads to get daily posts about Data Science.
My MediumΒ Blog to learn more about Data Science, Machine Learning, and AI.
Just email me at rfeers@gmail.com for any inquiries or to ask for help! π€
Remember now that DataBites has an official X (Twitter) account and LinkedIn page. Follow us there to stay updated and help spread the word! ππ»
On how to start as a new bier