DataBites

How to Actually Get Started with HuggingFace 🤗

Josep Ferrer — Tue, 28 Oct 2025 13:15:02 GMT

If you still think 🤗 is just a WhatsApp emoji, you’ve missed a lot.

AI isn’t stuck in research labs anymore, it’s in products, back-office flows, and tiny scripts that save hours each week.

Hugging Face is the community backbone behind much of that shift.

One of the leading agents of this revolution is Hugging Face, an open-source platform that has become essential for anyone working in Machine Learning (ML) and Natural Language Processing (NLP).

Whether you’re an experienced data scientist or just starting, Hugging Face offers a wide variety of tools and resources to help you bring your AI projects to life.

Trust me when I say, you’ll want to be a part of it!

Before we dive in, I strongly recommend checking out my previous issue on How to Get Started with LLMs (if you haven’t already). Trust me, it’s a great primer!

Hugging Face, or The GitHub of ML

Hugging Face is often described as the “GitHub of the ML world”, a collaborative platform with lots of pre-trained models and datasets (ready to be loaded and used!!).

But it actually further pushes this definition. Think of it as GitHub + model hosting + serving for AI: a massive Hub of models/datasets, the Transformers library (not just NLP anymore), easy Datasets, and simple ways to demo (Spaces) and serve (Inference Endpoints, TGI) models.

Why you should care

Speed: pre-trained models + one-line pipelines get you to a baseline in minutes.
Breadth: text, vision, audio, multimodal, diffusion—you name it.
Community: model cards, evals, PRs, and fast iteration on SOTA ideas.

So… where does this company come from?

From Chatbot to Open-Source Powerhouse

Founded in 2016, Hugging Face originally aimed to create a chatbot targeted at teenagers. However, the company quickly pivoted after open-sourcing its underlying model, leading to the creation of the Transformers library in 2018.

Today, Hugging Face is a central hub for AI professionals and enthusiasts, fostering a community that continually pushes the boundaries of what’s possible with machine learning.

Isn’t it crazy how things change up so fast?

Core pieces you’ll actually use

One of the biggest advantages of Hugging Face is how easy it is to get started.

#1. Transformers Library

The Transformers library is a comprehensive suite of state-of-the-art ML models specially designed for NLP that contains an extensive collection of pre-trained models optimized for tasks such as text classification, language generation, translation, and summarization, among others

It abstracts common NLP tasks into a simple-to-use pipeline() method, an easy-to-use API for performing a wide variety of tasks. The Transformers library simplifies the implementation of NLP models in several key ways:

Abstraction of complexity: It abstracts away the complexity involved in initializing models, managing pipelines, and handling tokenization.
Pre-trained models: Providing the biggest collection of pre-trained models, they reduce the time and resources required to develop NLP applications from scratch.
Flexibility and modularity: The library is designed with modularity in mind, allowing users to plug in different components as required.
Community and support: Hugging Face has fostered a strong community around its tools, with extensive documentation, tutorials, and forums.
Continuous updates and expansion: The library is constantly updated with the latest breakthroughs in NLP, incorporating new models and methodologies.

#2. Model Hub

The Model Hub stands as the community’s face, a platform where thousands of models and datasets are at your fingertips. It is an innovative feature that allows users to share and discover models contributed by the community, promoting a collaborative approach to NLP development.

You can go check it out on their official website. There you can easily select the Model Hub by clicking on the Models button in the navigator, and a view like the following should appear to you:

Screenshot of Hugging Face Model Hub main view.

As you can see, in the left-sidebar, there are multiple filters regarding the main task to be performed.

Contributing to the Model Hub is made straightforward by Hugging Face’s tools, which guide users through the process of uploading their models. Once contributed, these models are available for the entire community to use, either directly through the hub or via integration with the Hugging Face Transformers library.

Isn’t it exciting?

This ease of access and contribution fosters a dynamic ecosystem where state-of-the-art models are constantly refined and expanded upon, providing a rich, collaborative foundation for NLP advancement.

#3. Tokenizers

Tokenizers are crucial in NLP, as they are responsible for converting text into a format that machine learning models can understand, which is essential for processing different languages and text structures.

They are responsible for breaking down text into tokens—basic units like words, subwords, or characters—thus preparing data for machine learning models to process. These tokens are the building blocks that enable models to understand and generate human language.

They also facilitate the transformation of tokens into vector representations for model input and handle padding and truncation for uniform sequence lengths.

Hugging Face provides a range of user-friendly tokenizers, optimized for their Transformers library, which are key to the seamless preprocessing of text.

#4. Datasets Library

Another key component is the Hugging Face Datasets library, a vast repository of NLP datasets that support the training and benchmarking of ML models.

This library is a crucial tool for developers in the field, as it offers a diverse collection of datasets that can be used to train, test, and benchmark any NLP models across a wide variety of tasks.

One of the main benefits it presents is the simple and user-friendly interface. While you can browse and explore all datasets in the Hugging Face Hub, to use it in your code, they have tailored the dataset library that allows you to download any dataset effortlessly.

Screenshot of Hugging Face Datasets main view.

It includes datasets for common tasks such as text classification, translation, and question-answering, as well as more specialized datasets for unique challenges in the field.

So now that we know what it is, let’s get our hands dirty 💥

Getting Started with Hugging Face

Before you can start exploring Hugging Face, you’ll need to install it on your local machine.

Installation

First, you should combine the transformers library with your favorite deep learning library, either TensorFlow or PyTorch.

The transformers library can be easily installed using pip, Python’s package installer.

pip install transformers

To have the full capability, you should also install the datasets and the tokenizers library.

pip install tokenizers, datasets

Hugging Face’s model hub offers a huge collection of pre-trained models that you can use for a wide range of NLP tasks. There are a bunch of things we can do with LLMs.

The first task we can do is directly using a Pre-trained Model.

1. Using Pre-trained Models

#1 Select a Pre-trained Model

First, you need to select a pre-trained model. To do so, we go to the Model Hub.

Imagine we want to infer the sentiment corresponding to a string of text. So we can easily browse only the models that perform `Text Classification` tasks by selecting the Text Classification button on the left-sidebar.

Hugging Face models always appeared ordered by Trending. Usually, the higher results are the most used ones.

So, we select the second result, which is the most used sentiment analysis model.

Model Hub. Selecting our model.

To use it, we need to copy the corresponding name of the model. It can be found within the top section of its specific view.

#2 Load a pre-trained model

Now that we already know what model to use, let’s use it in Python. First we need to import the AutoTokenizer and the AutoModelForSequenceClassification classes from transformers.

Using these AutoModel classes will automatically infer the model architecture from the model name.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = “lxyuan/distilbert-base-multilingual-cased-sentiments-student”

# We call define a model object
model = AutoModelForSequenceClassification.from_pretrained(model_name)

#3 Prepare your input

Load a tokenizer for our model, in this case, the transformers library facilitates the process as it inferes the tokenizer to be used from the name of the model that we have chosen.

#We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name)

#4 Run the model

Generate a pipeline object with the chosen model, the tokenizer, and the task to be performed. In our case, a sentiment analysis. If you initialize the classifier object with the task, the pipeline class will populate it with the default values, even though it is not recommended in production.

# Initializing a classifier with a model and a tokenizer
classifier = pipeline(”sentiment-analysis”, model = model, tokenizer = tokenizer)
# When passing only the task, the pipeline command inferes both the model and tokenizer.
classifier = pipeline(”sentiment-analysis”)

We can execute this model by introducing some input.

output = classifier(”I’ve been waiting for this tutorial all my life!”)

And we will obtain the results right away!

Which leads to the following (and final) step…

#5 Interpret the outputs

The model will return an object containing various elements depending on the model’s class. For example, for this sentiment analysis example, we will get:

In this instance, the input string has been classified with the “Positive” label (using a sentiment analysis model), achieving a confidence score of 0.579. This score reflects the model’s certainty in its classification.

A second task we can do using HF is fine-tuning a model.

2. Fine-tuning models

Fine-tuning is the process of taking a pre-trained model and updating its parameters by training on a dataset specific to your task. This allows you to leverage the model’s learned representations and adapt them to your use case.

Imagine we need to use a text-classifier model to infer sentiments from a list of tweets. One natural question that comes to mind is:

Will this pre-trained model work properly?

To make sure it does, we can take advantage of fine-tuning by training a pre-trained Hugging Face model with a dataset containing tweets and their corresponding sentiments so the performance improves.

Here’s a basic example of fine-tuning a model for sequence classification:

#1. Choose a pre-trained model and a dataset

Select a model architecture suitable for your task. In this case, we want to keep using the same sentiment analysis model.

However, now we need some data to train our model. And this is precisely where the datasets library kicks in. We can go check all datasets in the Model Hub, and find the one that fits us the best.

In my case, I’ll be using the twitter-sentiment-analysis dataset.

Datasets section.

Now that I already know what dataset to choose, we can simply initialize both the model and dataset.

model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Loading the dataset to train our model
dataset = load_dataset(”mteb/tweet_sentiment_extraction”)

If we check the dataset we just downloaded, it is a dictionary containing a subset for training and a subset for testing. If we convert the training subset to a dataframe, it looks like follows:

The dataset we are using.

#2. Prepare Your dataset

Now that we already have our dataset, we need a tokenizer to prepare it to be parsed by our model. The text variable of our dataset needs to be tokenized so we can use it to fine-tune our model.

This is why the second step is to load a pre-trained Tokenizer and tokenize our dataset so it can be used for the fine-tuning.

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples[”text”], padding=”max_length”, truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

#3. Build a PyTorch dataset with encodings

The third step is to generate a train and testing dataset. The training set will be used to fine-tune our model, while the testing set will be used to evaluate it.

Usually, the fine-tuning process takes a lot of time.

(To facilitate the tutorial, we randomly sample both datasets so your computation time is lower)

from datasets import load_dataset

model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Loading the dataset to train our model
dataset = load_dataset(”mteb/tweet_sentiment_extraction”)

small_train_dataset = tokenized_datasets[”train”].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets[”test”].shuffle(seed=42).select(range(1000))

#4. Fine-tune the model

Our final step is to set up the training arguments and start the training process. The transformers library contains the trainer() class, which takes care of everything.

We first define the training arguments together with the evaluation strategy. Once everything is defined, we can easily train the model with the train() command.

from transformers import Trainer, TrainingArguments
import numpy as np

training_args = TrainingArguments(output_dir=”trainer_output”, evaluation_strategy=”epoch”)

metric = evaluate.load(”accuracy”)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

#5. Evaluate the model

After training, evaluate the model’s performance on a validation or test set. Again, the trainer class already contains an evaluate method that takes care of this.

import evaluate

trainer.evaluate()

Our fine-tuned model presents an accuracy of 70%.

Now that we have already improved our model, how can we share it with the community?

This brings us to our final step…

#6. Sharing Models

Once we’ve fine-tuned our new model, the best idea is to share it with the community.

Hugging Face makes this process straightforward. First, we need to install the huggingface_hub library.

A requirement for this final step is to have an active token to be able to connect to your Hugging Face account. You can easily get one following this guideline. When working in a Jupyter Notebook, we can easily import the notebook_login library.

from huggingface_hub import notebook_login

notebook_login()

This will generate a login within our Jupyter Notebook. We just need to submit our token, and our notebook will be connected to our hugging face account.

After this, the model will be available for everyone in our Hugging Face profile.

4 use-cases you can start doing today

If we want to standardize any NLP process, Hugging Face makes it incredibly simple, allowing us to build any pipeline in just three steps:

How to Actually Get Started with SQL

Josep Ferrer — Wed, 22 Oct 2025 10:02:42 GMT

Many of you have been asking how to get started in the data world. I know it can seem complex and intimidating, but fear often clouds our vision.

That’s why I want to remind you all that SQL is still the number one data language and the easiest one to learn.

If you’re looking to break into this field, there’s not better advice than…

START
LEARNING
SQL
RIGHT
…

How to Actually Get Started with Python

Josep Ferrer — Tue, 14 Oct 2025 10:02:52 GMT

You’ve wanted to learn Python for a while…

Too many tabs, not enough progress?

This guide cuts the noise and gives you a shippable path.

Only the pieces that actually move you forward.

Why this, why now

Python is the most versatile “one language, many careers” tool: analytics, ML, web, scripting, automation, LLM apps—you name it.
If you learn it now, you co…

How to Actually Get Started with LLMs

Josep Ferrer — Wed, 08 Oct 2025 13:33:57 GMT

LLMs are moving faster than your backlog.

Feeling behind? You’re not.
Today’s issue compresses the essentials (what matters, what doesn’t) into a buildable path.

Minimal theory, maximum leverage.

Following my Transformers cheat sheets (architecture, encoder, decoder), today we go end-to-end:

What to learn
What to build first
How to avoid the rabbit holes.

⚠️ It’s a longer, denser issue — but it’s meant to be a keeper. Bookmark it, steal the prompts, and ship something this week.

Subscribe now

TL;DR (paste this in your notes)

LLMs ≠ magic. Learn the Transformer + tokens + pretrain→post-train→inference pipeline.
Start “outside in.” Ship value via APIs or open models first; fine-tune later.
Leverage > novelty. Framing, evaluation, and alignment matter more than training a giant from scratch.

Why this, why now

Understanding LLMs and GenAI is crucial for everyone, from seasoned data professionals to beginners, as they are set to revolutionize text data processing and our future. With new models and applications constantly emerging, it’s essential to stay updated and maintain sharp skills in this rapidly evolving field.

#1 Understanding the Basics

What are LLMs?

Large Language Models are a type of artificial intelligence trained on extensive text datasets. These models can generate human-like text, understand context, and even carry on conversations. They’re used in various applications, from chatbots to content creation and beyond.

So… why are they so popular?

LMs are popular due to their ability to generate coherent, contextually relevant, and grammatically accurate text. Their exceptional performance on diverse language tasks and the accessibility of pre-trained models have democratized AI-powered natural language understanding and generation.

LLMs core components

Key concepts of LLMs include:

Transformer Architecture: It is the backbone of LLMs, featuring self-attention mechanisms that enable the model to weigh the importance of different words in a sentence.
Tokenization: Breaking down text into manageable pieces or tokens. This is performed by Tokenizers.
Pre-training: Involves training the model on a large corpus of text to learn language patterns, grammar, and context.
Fine-tuning: Adapts the pre-trained model to specific tasks using smaller, task-specific datasets.
NLU (Natural Language Understanding): The ability to understand and interpret human language.
NLG (Natural language Generation): The ability to generate coherent and contextually relevant text.
Prompt Engineering: Crafting input prompts to guide the model towards generating desired outputs, essential for tasks performed via API access.

Main Differences between LLMs and Deep Learning Models

LLMs differ from other deep learning models primarily due to their size and use of self-attention mechanisms. Key differentiators include:

Transformer Architecture: This revolutionary design underpins LLMs and has transformed natural language processing.
Contextual Understanding: LLMs capture long-range dependencies in text, enhancing their contextual comprehension.
Versatility: They excel in various language tasks, including text generation, translation, summarization, and question-answering.

Subscribe now

#2 How to get started with LLMs?

1. Understanding the Transformer Architecture in LLMs

Now that you’re familiar with LLMs, let’s delve into the Transformer architecture that powers these models. The original Transformer, introduced in the paper Attention Is All You Need, revolutionized natural language processing.

Key Features:

Self-Attention Layers: Allow the model to focus on different parts of the input sequence.
Multi-Head Attention: Enables the model to attend to information from different representation subspaces.
Feed-Forward Neural Networks: Process the output from the attention mechanism.
Encoder-Decoder Architecture: Facilitates tasks like translation.

Transformers Architecture

Remember, you can learn more about it in the following article about the Transformers Architecture.

2. Pre-training LLMs

Now that you understand the fundamentals of LLMs and the transformer architecture, it’s time to explore pre-training LLMs. Pre-training is crucial for enabling LLMs to grasp human language by exposing them to huge amounts of text.

This part is (usually) performed by companies like OpenAI, Google, DeepSeek, Meta, or Anthropic.

Key Concepts:

Objectives of Pre-training: LLMs learn language patterns, grammar, and context through exposure to extensive text corpora. Key tasks include masked language modeling and next sentence prediction.
Text Corpus for Pre-training: LLMs are trained on diverse and massive datasets, including web articles, books, and more, with billions to trillions of text tokens. Common datasets are C4, BookCorpus, Pile, OpenWebText, etc.
Training Procedure: Understand the technical aspects such as optimization algorithms, batch sizes, and training epochs, and learn about challenges like mitigating data biases.

For further learning, check out the module on LLM training from CS324: Large Language Models.

As training an LLM from scratch requires a lot of resources, we can access pre-trained models directly via API (OpenAI, Google…) or using open-source models in HuggingFace.

3. Accessing LLMs and using them

In today’s landscape, accessing and utilizing LLMs has become easier than ever, thanks to both commercial APIs and open-source platforms.

Using Commercial APIs

The most common one is OpenAI and their GPT models, but others like Anthropic can be used as well.

API Access: OpenAI provides robust API access to its models, such as GPT-4 and ChatGPT, allowing developers to integrate powerful language capabilities into their applications.
Ease of Use: With simple HTTP requests, you can send text prompts to the API and receive generated responses. The API supports various parameters to fine-tune the behavior of the model, such as temperature, max tokens, and more.
Applications: This API is versatile and can be used for chatbots, content generation, summarization, translation, and other NLP tasks.

Using Open-Source Models (Hugging Face)

Model Hub: Hugging Face offers a vast repository of open-source models, including versions of GPT, BERT, T5, Mistral, Meta’s Llama and many more, which can be accessed for specific tasks.
Transformers Library: The Transformers library by Hugging Face provides a comprehensive toolkit for using and fine-tuning these models. It supports multiple frameworks, including TensorFlow and PyTorch.
Ease of Use: With Hugging Face, you can load pre-trained models with just a few lines of code and fine-tune them on your dataset. The library also offers utilities for tokenization, training, and deploying models.

4. Fine-Tuning LLMs

Once we know how to access and use pre-trained LLMs, the next step is understanding the process of fine-tuning and how to train them for specific tasks. Fine-tuning tailors pre-trained models to perform tasks like sentiment analysis, question answering, or translation with greater accuracy and efficiency.

Why Fine-Tune LLMs?

Task-Specific Performance: While pre-trained LLMs have a general understanding of language, fine-tuning is essential to excel in specific tasks by learning their unique nuances.
Efficiency: Fine-tuning leverages the pre-trained model’s knowledge, reducing the data and computation needed compared to training from scratch. This process requires a much smaller dataset.

Fine-Tuning LLMs with access to their weights

Choose the Pre-trained LLM: Select a pre-trained model that suits your task. For instance, for question-answering, choose a model designed for natural language understanding.
Data Preparation: Prepare a labeled dataset for your specific task, ensuring it is properly formatted.
Fine-Tuning Process:
- Use parameter-efficient techniques to fine-tune the model, considering LLMs have tens of billions of parameters.
- If you don’t have access to the weights, explore alternative approaches or frameworks that facilitate fine-tuning without direct weight manipulation.

By following these steps, you can adapt pre-trained LLMs to achieve optimal performance on your desired tasks. You can read more about it here.

Fine-Tuning LLMs Without Access to Model Weights

When you don’t have access to an LLM’s weights and must use an API, you can still fine-tune the model using in-context learning and prompt tuning.

In-Context Learning: Leverage the LLM’s ability to learn from provided examples. By giving input-output examples within the prompt, the model can perform tasks without explicit fine-tuning.
Prompt Tuning:
- Hard Prompt Tuning: Modify the input tokens directly in the prompt to guide the model’s output.
- Soft Prompt Tuning: Concatenate the input embedding with a learnable tensor. Prefix tuning is a related approach where learnable tensors are used with each Transformer block, not just the input embeddings.
Parameter-Efficient Fine-Tuning Techniques (PEFT):
- LoRA and QLoRA: These techniques allow fine-tuning by introducing a small set of learnable parameters, called adapters, instead of updating the entire weight matrix. QLoRA, for instance, enables fine-tuning a 4-bit quantized LLM on a single consumer GPU without performance loss.

By using these methods, you can adapt LLMs for specific tasks efficiently, even without direct access to the model’s weights. Here are some resources to explore further:

And don’t forget to check my webinar about fine-tuning Disilbert and Mistral 7B!

5. Alignment and Post-Training in LLMs

LLMs can sometimes generate content that is harmful, biased, or misaligned with user expectations. Alignment involves adjusting an LLM’s behavior to align with human preferences and ethical standards, aiming to reduce the risks of biased, controversial, or harmful content.

Techniques to Explore:

Reinforcement Learning from Human Feedback (RLHF): This method uses human annotations on LLM outputs to train a reward model, guiding the model to produce more desirable outputs.
Contrastive Post-Training: This technique leverages contrastive methods to automatically create preference pairs, refining the model’s responses to better match user expectations.

By employing these techniques, you can enhance the alignment of LLMs, ensuring they produce content that is safe, ethical, and aligned with human values.

Subscribe now

6. Evaluating LLMs

Evaluating the performance of LLMs is crucial to assess their effectiveness and identify areas for improvement. Key aspects of LLM evaluation include:

Task-Specific Metrics: Select appropriate metrics for your specific task. For example:
- Text Classification: Use metrics like accuracy, precision, recall, and F1 score.
- Language Generation: Metrics such as perplexity and BLEU scores are commonly used.
Human Evaluation: Have experts or crowdsourced annotators assess the quality of generated content or model responses in real-world scenarios.
Bias and Fairness: Evaluate LLMs for biases and fairness, especially when deploying them in real-world applications. Analyze performance across different demographic groups and address any disparities.
Robustness and Adversarial Testing: Test the LLM’s robustness by subjecting it to adversarial attacks or challenging inputs to uncover vulnerabilities and enhance model security.

7. Continuous Learning and Adaptation

To keep LLMs updated with new data and tasks, consider these strategies:

Data Augmentation: Continuously augment your dataset to prevent performance degradation due to outdated information.
Retraining: Periodically retrain the LLM with new data and fine-tune it for evolving tasks to ensure the model stays current.
Active Learning: Implement active learning techniques to identify instances where the model is uncertain or likely to make errors. Collect annotations for these instances to refine the model.

Additionally, to mitigate common issues like hallucinations, explore techniques such as retrieval augmentation.

Subscribe now

#3 Building and Deploying LLM Applications

Once you’ve developed and fine-tuned an LLM for specific tasks, the next step is to build and deploy applications that harness the LLM’s capabilities. This involves creating practical, real-world solutions that make the most of your LLM’s potential.

Building LLM Applications

When developing applications that leverage Large Language Models (LLMs), consider the following:

Task-Specific Application Development:
Tailor your applications to meet specific use cases, such as web interfaces, mobile apps, chatbots, or integrations into existing software systems.
User Experience (UX) Design:
Prioritize user-centered design to ensure your LLM application is intuitive, user-friendly, and meets the needs of your target audience.
API Integration:
If your LLM acts as a language model backend, create RESTful APIs or GraphQL endpoints to facilitate seamless interaction with other software components.
Scalability and Performance:
Design your applications to handle varying levels of traffic and demand. Optimize for performance and scalability to provide a smooth and reliable user experience.

Deploying LLM Applications

Now that you’ve developed your LLM application, it’s time to deploy it to production. Here are key considerations for a successful deployment:

Cloud Deployment:
Deploy your LLM applications on cloud platforms like AWS, Google Cloud, or Azure. These platforms offer scalability, reliability, and easy management of resources.
Containerization:
Use containerization technologies such as Docker and Kubernetes to package your applications. This ensures consistent deployment across various environments and simplifies scaling and management.
Monitoring:
Implement robust monitoring solutions to track the performance of your deployed LLM applications. This allows you to detect and address issues in real time, ensuring optimal performance and reliability.

Practical experience is crucial. Here’s how you can get hands-on:

Welcome to the Hands-on LLM Course by
Pau Labarta Bajo
Hugging Face and PyTorch Lightning by Jon Krohn.

Subscribe now

A final note

If you’ve made it this far, you’ve already taken the first step: understanding that this isn’t about knowing everything—it’s about moving forward bit by bit.

With patience, curiosity, and consistency.

No one starts out knowing.

But we all start in the same place: by taking the first step.

Are you in?

Hope to see you in the community soon!

Sincerely,

— Josep

Your turn

Some final resources to check:

Attention Is All You Need (must read)
My illustrated Transformers saga (architecture, encoder, decoder)
Module on Modeling from Stanford CS324: Large Language Models
HuggingFace Transformers Course

Are you still here? 🧐

👉🏻 I want this newsletter to be useful, so please let me know your feedback!

Before you go, tap the 💚 and the restack buttons at the bottom of this email to show your support—it really helps and means a lot!

P.S. Share with the coworker who thinks self-attention is a personality trait.

Any doubt? Let’s start a conversation! 👇🏻

You’re Using ChatGPT Wrong (According to 700M Users)

Josep Ferrer — Tue, 16 Sep 2025 10:02:45 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Enjoying a biking day from Rotterdam to Delft! 💚

A quick gut-check:

When you picture ChatGPT, what’s the first image that pops up?
Someone cranking out SQL? Debugging Python? Auto-drafting emails?

That was my picture too, until I dug into a new OpenAI study cover…

How to Become a Data Scientist

Josep Ferrer — Mon, 15 Sep 2025 13:56:07 GMT

If you’re reading this, you probably suspect it already: data science is a fascinating field… and also overwhelming.

With so many languages, tools, and possible paths, it’s easy not to know where to start.

That’s why one of the questions I get most is:

How do you become a data scientist?

This article is my attempt to answer it clearly.

I won’t promise mag…

SQL COURSE PROBLEM #4

Josep Ferrer — Fri, 06 Jun 2025 11:22:18 GMT

All the course material is stored in the SQL Crash Course repository.

Hi everyone! Josep and Cornellius Yudha Wijaya from Non-Brand Data here 👋🏻

As promised, today we are publishing the next two issues of our SQL Crash Course – From Zero to Hero! 🚀

I am sure you are here to continue our SQL Crash Course Journey!📚

If this is your first time or you’ve for…

The Importance of Context

Josep Ferrer — Tue, 03 Jun 2025 10:02:41 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Last week, we unpacked how to position ourselves for luck.

This week, let me ask you this:
Have you ever built a chart that technically made sense… but no one seemed to get it?
You showed the data, but it didn’t land.

It didn’t inspire action.

It didn’t spark con…

SQL COURSE PROBLEM #1

Josep Ferrer — Thu, 29 May 2025 12:02:52 GMT

All the course material is stored in the SQL Crash Course repository.

Hi everyone! Josep and Cornellius Yudha Wijaya from Non-Brand Data here 👋🏻

As promised, today we are publishing the next two issues of our SQL Crash Course – From Zero to Hero! 🚀

I am sure you are here to continue our SQL Crash Course Journey!📚

If this is your first time or you’ve for…

Position Yourself for Luck

Josep Ferrer — Wed, 28 May 2025 10:02:48 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Last week, we unpacked why mindset beats raw talent — and how the smallest shift in belief can unlock big transformation.

This week, we go even deeper.

🌀 What do you do when the world is changing faster than your plans can keep up? You improve your position — …

Understanding Autoencoders

Daniel — Sat, 24 May 2025 10:02:24 GMT

We’re kicking things off with Daniel García, machine learning educator and creator of The Learning Curve. In this issue, Daniel dives into the world of autoencoders — a neural network technique that goes far beyond just copying data.

Through real-world demos and crisp explanations, he reveals how autoencoders help us compress complex data, clean up noisy signals, detect anomalies, and even generate new content. Whether you're curious about smarter data workflows or the magic behind generative models, this breakdown makes a foundational concept feel both approachable and powerful.

— Josep

Hi! I’m Daniel García, the tinkerer behind The Learning Curve, where I break down machine learning concepts into real-world demos and practical insights for anyone curious about AI.

In this issue, we’ll explore a tool that’s both powerful and deceptively simple: autoencoders.

While they’re often introduced as models that “just copy their input,” the truth is much richer.

By learning to compress and reconstruct data, autoencoders unlock everything from smarter data cleaning to generative art. If you’ve ever wondered how machines learn to see structure in chaos, this one’s for you.

What is an Autoencoder?

An autoencoder is a type of neural network designed to take in data, compress it, and then reconstruct it as closely as possible to the original. But it’s not just copying — the key feature is a bottleneck, a deliberately small hidden layer that limits how much information the model can store.

This constraint forces the model to learn the most important features of the input data. It must filter out noise, redundancy, and irrelevant detail in order to create a useful summary — called the latent representation.

Once trained, autoencoders can be used for compression, denoising, anomaly detection, and — in some variants — even creative generation of new data.

1. Why Should You Care?

Let’s take a closer look at how autoencoders show up in the real world and why they’re worth adding to your machine learning toolbox.

1.1 Compression (Dimensionality Reduction)

Autoencoders are one of the most flexible tools for dimensionality reduction. When working with high-dimensional data like images, sensor arrays, or audio signals, storing or processing that data in full can be expensive. Autoencoders solve this by learning a more compact version.

Instead of manually engineering features, the network automatically learns a compressed form that keeps the structure but discards unnecessary detail. This is especially useful for speeding up downstream models or visualizing data in two or three dimensions.

Image by Author (Daniel García from The Learning Curve)

1.2 Denoising (Data Cleaning)

A common real-world problem is noisy data. Whether you’re scanning documents, recording audio, or collecting signals from sensors, the data you get is rarely clean.

A denoising autoencoder solves this by training the model to take in noisy input and predict the clean version. This forces the network to ignore irrelevant variations and reconstruct only the core signal.

It’s a data-driven way to clean inputs without needing to handcraft filtering rules. If you’re dealing with messy datasets, this can make a huge difference in performance downstream.

Image from the Autoencoder Wikipedia website.

1.3 Anomaly Detection

Autoencoders are also a powerful tool for spotting things that don’t belong. By training a model on normal data — for example, regular sensor readings or typical user behavior — it becomes very good at reconstructing those patterns.

But when an unusual input comes along, the autoencoder struggles. Its reconstruction will be poor, and the reconstruction error will spike. That spike is a useful signal: something about this input is different from the norm.

This technique is widely used in fraud detection, predictive maintenance, cybersecurity, and monitoring for system failures.

Image by Author. Daniel García from The Learning Curve.

1.4 Generating New Samples

Not all autoencoders are just for cleaning or compressing data. Some are built to generate new data entirely.

Variational Autoencoders (VAEs) treat the latent space as a probability distribution rather than a fixed point. This allows the model to sample new points in that space and decode them into plausible new outputs.

In practice, this enables you to create new images, sounds, or sequences based on the structure learned from training data. It’s one of the most creative and experimental branches of unsupervised learning.

Image by Author (Daniel García from The Learning Curve)

2. How Autoencoders Work

The architecture of an autoencoder is made up of two main parts:

Encoder: This compresses the input data into a smaller internal representation (the latent vector). It’s like summarizing a paragraph into a sentence.
Decoder: This takes that compressed summary and tries to recreate the original data as closely as possible. The goal is to make the output look just like the input.

The model is trained by minimizing a reconstruction loss, a measure of how different the output is from the original. This could be mean squared error (for continuous data like images) or binary cross-entropy (for normalized or binary data).

Without the bottleneck, the network would just memorize and copy the data. But with the bottleneck, it is forced to learn patterns and compress meaningfully.

3. Autoencoder Variants

Autoencoders come in many forms. Here’s a quick guide to the most common variants and what they’re useful for:

Table by Author (Daniel Garcia from the Learning Curve)

Let’s dig a little deeper into the more interesting ones:

3.1 Sparse Autoencoders

In this version, we don’t necessarily make the latent space smaller — instead, we encourage the model to activate only a few neurons at a time. This sparsity leads to representations where different neurons specialize in detecting different features.

We add a penalty during training that discourages the model from turning on too many neurons. The result is a more interpretable and often more robust model that can still extract useful features.

3.2 Contractive Autoencoders

These are designed to be resistant to tiny changes in the input. They include a penalty that discourages the model from making big changes in the encoding in response to small changes in the input.

This is useful for tasks where inputs may be noisy or jittery, but we want the model to focus on the stable patterns.

3.3 Variational Autoencoders (VAEs)

VAEs change the way we think about the latent space. Instead of mapping inputs to a single point, they map them to a probability distribution — usually Gaussian. This enables you to sample new points and generate new outputs.

To make this work, VAEs add an extra penalty term that ensures the latent space stays well-behaved (smooth, continuous, and compact). This is what allows them to generate data that looks convincingly real.

4. Three Mini-Projects to Try

Let’s make this practical. Here are three hands-on projects you can try to explore different use cases of autoencoders.

Project 1 – Compression with MNIST

Train a basic autoencoder on MNIST, a dataset of grayscale images of digits. After each training epoch, save a set of reconstructions and compare them to the originals.

You’ll be able to visually track how the model learns to compress and reconstruct the data over time — starting with blurry blobs and ending with recognizable digits.

Image by Author (Daniel García from The Learning Curve)

Project 2 – Audio Denoising

Record yourself saying a short phrase like "machine learning rocks" while there’s background noise (e.g. fan or vacuum cleaner). Alternatively, add artificial Gaussian noise to a clean recording.

Convert the audio into a spectrogram, then train a denoising autoencoder to reconstruct the clean signal. The model will learn to ignore the noise and focus on the speech signal.

Project 3 – Anomaly Detection in Sensor Data

Use a dataset of sensor readings from an industrial process or simulated IoT environment. Train an autoencoder only on normal data. Then introduce some outlier readings (e.g., spikes, drops, or irregular behavior).

Monitor the reconstruction error over time. When it spikes, it’s likely an anomaly. This is a powerful technique for predictive maintenance and safety monitoring.

5. Common Questions About Autoencoders

Is this just fancy PCA?

They’re related — both compress data — but PCA is linear and deterministic. Autoencoders are nonlinear and can be scaled and customized for many more types of input.

Image by Author (Daniel García from The Learning Curve)

How do I pick the latent size?
A good rule of thumb is to start with log base 2 of your input size. From there, adjust based on how well the model reconstructs data and whether it generalizes well.
Why are my reconstructions blurry or inaccurate?
Check your latent size, your loss function, and whether your train/test split is correct. Blurry outputs often mean your model doesn’t have enough capacity or hasn’t trained long enough.
Can I generate new data with a vanilla autoencoder?
Not reliably. You’ll need a Variational Autoencoder (VAE) or a GAN if you want to generate novel samples.

6. Wrapping Up

Autoencoders are a foundational tool in the machine learning world. While they’re often described as models that "just copy the input," the reality is that they learn how to compress and represent the essence of your data. Once trained, they can do much more than reconstruction — they can clean, compress, detect, and even create.

Over the coming weeks, I’ll be publishing walkthroughs on The Learning Curve showing exactly how to build each of the three mini-projects above, step by step. That means you’ll not only understand the theory — you’ll get working code, visualizations, and practical insights to make it your own.

Daniel is an ML engineer and writer of The Learning Curve, a newsletter that makes AI make sense—no hype, no jargon. He’s been through every stage of academia (yes, all the way to a PhD), worked in startups and consulting, and now shares the kind of lessons he wishes he’d had when he started: clear, practical, and fluff-free.

SQL Crash Course – Getting into Practice! 👨🏻‍💻

Josep Ferrer — Fri, 23 May 2025 10:02:50 GMT

Hey everyone! Josep and Cornellius one more week of SQL learning! 👋🏻

Can you believe it’s already been 2 months since (from ) and I (from ) kicked off our SQL Crash Course? 🎉
What started as a fun idea quickly turned into a full-blown series — and today, we’re sharing a recap of everything we’ve covered… and what’s coming next! 🙌🏻

🔥 What’s inside the course?

We’ve structured the course into 7 key modules to take you from zero to SQL hero:

Introduction – What SQL is and why it matters
SQL Fundamentals – Basic commands, filtering, and aggregation
Intermediate SQL – Joins, unions, and functions
Advanced SQL – Subqueries, CTEs, recursion, and views
Database Operations – CRUD, schema changes, and optimization
Crafting Good SQL Queries – Best practices for writing efficient queries
Real-world Problems – Applying SQL to practical challenges

✅ What We’ve Covered So Far

We’ve completed Modules 1 through 6 — all the theory and best practices you need.
You can catch up on each lesson here:

1️⃣ Introduction to SQL

#1. What is SQL? → link
#2. Why Learn SQL? → link
#3. Relational Data & Models → link

2️⃣ SQL Fundamentals

#4. Basic Commands (SELECT, FROM, WHERE) → link
#5. Sorting & Limiting (ORDER BY, LIMIT)→ link
#6. Aggregate Functions (SUM, AVG, COUNT, etc.) → link

3️⃣ Intermediate SQL

#7. JOINS (INNER, LEFT, RIGHT, FULL) → link
#8. UNION & UNION ALL → link
#9. Case Expressions → link
#10. Functions (String, Date, Numeric) → link

4️⃣ Advanced SQL

#11. Subqueries → link
#12. Common Table Expressions (CTEs) → link
#13. Recursion → link
#14. Views → link

5️⃣ Database Operations

#15. CRUD operations (INSERT, UPDATE, DELETE) → link
#16. Database modifications (ALTER, DROP, CREATE) → link
#17. Indexing & Optimization → link

6️⃣ Crafting Good SQL queries

#18. Modular Code →link
#19. SQL Execution Order → link
#20. Query Optimization → link

So the following question is…

🔜 What’s Next?

Now it’s time to put theory into practice! Over the next few weeks, we’ll release hands-on exercises and projects to help you apply what you’ve learned:

7️⃣ 7. Real-World Problems (Quick Wins)

Easy-level problems, perfect for 30–60 minutes of practice:

Problem 1 → 29th May
Problem 2→ 29th May
Problem 3 → 5th June
Problem 4→ 5th June

🧪 Mini-Projects (Deeper Dives)

Medium-difficulty projects to consolidate your skills:

Mini-Project 1 → 19th June
Mini-Project 2 → 19th June

💣 Final Projects (End-to-End Challenges)

Advanced, real-life projects released in multiple parts:

Project 1 → 26th June (with multiple issues in the following weeks)
Project 2 → 26th June (with multiple issues in the following weeks)

We’ll share full project briefs and walkthroughs — so stay tuned!

💡 Where to Follow Along?

We’ll continue posting weekly updates in our newsletters:

DataBites (by Josep)
Non-Brand Data (by Cornellius)

👉 Check out the GitHub repo and stay tuned for the first post!

Let’s dive in and make SQL less scary, more fun, and way more useful! 🚀

Josep & Cornellius

Why Resilience Is the New Hard Skill

Josep Ferrer — Tue, 20 May 2025 10:02:33 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Last week, we explored how mindset, not talent, shapes your future, and how a small shift in belief can unlock big growth.

This week, we go even deeper.

🌀 What happens when the world keeps changing faster than your plans can keep up?

#19 SQL Execution Order

Josep Ferrer — Thu, 15 May 2025 11:03:01 GMT

All the course material is stored in the SQL Crash Course repository.

Hi everyone! Josep and Cornellius Yudha Wijaya from Non-Brand Data here 👋🏻

As promised, today we are publishing the next two issues of our SQL Crash Course – From Zero to Hero! 🚀

I am sure you are here to continue our SQL Crash Course Journey!📚

If this is your first time or you’ve for…

Fixed Mindset vs Groth Mindset

Josep Ferrer — Tue, 13 May 2025 10:02:28 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Last week, we explored how soft skills like emotional intelligence are becoming the new superpowers in an automated world.

This week, we go deeper.
🌀 Beneath every skill—technical or human—lies the mindset that fuels it.

#18 Generating Modular Code

Josep Ferrer — Fri, 09 May 2025 11:02:23 GMT

All the course material is stored in the SQL Crash Course repository.

Hi everyone! Josep and Cornellius Yudha Wijaya from Non-Brand Data here 👋🏻

As promised, today we are publishing the next two issues of our SQL Crash Course – From Zero to Hero! 🚀

I am sure you are here to continue our SQL Crash Course Journey!📚

If this is your first time or you’ve for…

Why Soft Skills Are the New Hard Skills

Josep Ferrer — Tue, 06 May 2025 10:02:25 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Last week, we explored how showing up—even when uninspired—can unlock your best creative work.

This week, we’re shifting focus to a different kind of edge: the human one.
🌀 our most valuable skills might not be technical—they’re human.

ML - What It Is, How It Works & Why It Matters

Josep Ferrer — Mon, 05 May 2025 14:02:28 GMT

Machine learning is all around us, from your Netflix recommendations to the voice behind your phone’s assistant.
But how does it work?

And how can you get started?

Today I’m bringing a simple introduction to Machine Learning, its types, and real-world examples — plus giving you a cheatsheet to keep things crystal clear. 🙌🏻

So let’s get started with the full-resolution cheatsheet 👇🏻

#16 Database Modifications

Josep Ferrer — Fri, 02 May 2025 10:02:27 GMT

All the course material is stored in the SQL Crash Course repository.

Hi everyone! Josep and Cornellius Yudha Wijaya from Non-Brand Data here 👋🏻

As promised, today we are publishing the next two issues of our SQL Crash Course – From Zero to Hero! 🚀

I am sure you are here to continue our SQL Crash Course Journey!📚

If this is your first time or you’ve for…

You Don’t Need to Feel Inspired to Get Things Done

Josep Ferrer — Tue, 29 Apr 2025 10:01:51 GMT

Hey everyone! 👋🏼

Josep here, back with your weekly bite of career insights and encouragement ✨

Last week, we dove into how GenAI is not the real threat, but how you adapt to it might be.

This week, we’re taking a different kind of plunge — one that’s a little more personal, a little more uncomfortable, and incredibly important:
🌀 Your work doesn’t have t…