Hey you all!
It’s Josep here, one more week! 👋🏻
I've now spent a full week in NYC, and I’m loving my time in the Big Apple! 🍎
There’s not much to report on the professional front, but I did get to experience something incredibly exciting—I had a demo of the Vision Pro! 🎉
Now that we’ve caught up on life updates, let's dive into the important stuff 👨🏻💻.
Last week, I introduced you to the basics of HuggingFace. Today, let’s build on that foundation by diving into how to get started with this powerful tool.
This is why before we dive in, I strongly recommend checking out my previous issue on Understanding Hugging Face - The AI Community's Open-Source Oasis (if you haven't already).
By doing so, we make sure we all are on the same page and properly understand the core concepts behind Hugging Face.
Today’s issue will walk you briefly through the basics of getting started with Hugging Face, including installation, using pre-trained models, fine-tuning, and sharing your models with the community.
Sounds exciting right?
Let’s get started 💥
Getting Started with Hugging Face
Before you can start exploring Hugging Face, you’ll need to install it on your local machine.
Installation
First, you should combine the transformers
library with your favorite deep learning library, either TensorFlow
or PyTorch
.
The transformers library can be easily installed using pip
, Python's package installer.
pip install transformers
To have the full capability, you should also install the datasets
and the tokenizers
library.
pip install tokenizers, datasets
Hugging Face's model hub offers a huge collection of pre-trained models that you can use for a wide range of NLP tasks. There are a bunch of things we can do with LLMs.
The first task we can do is directly using a Pre-trained Model.
1. Using Pre-trained Models
#1 Select a Pre-trained Model
First, you need to select a pre-trained model. To do so, we go to the Model Hub.
Imagine we want to infer the sentiment corresponding to a string of text. So we can easily browse only the models that perform `Text Classification` tasks by selecting the Text Classification button on the left-sidebar.
Hugging Face models always appeared ordered by Trending. Usually, the higher results are the most used ones.
So, we select the second result, which is the most used sentiment analysis model.
To use it, we need to copy the corresponding name of the model. It can be found within the top section of its specific view.
#2 Load a pre-trained model
Now that we already know what model to use, let’s use it in Python. First we need to import the AutoTokenizer
and the AutoModelForSequenceClassification
classes from transformers
.
Using these AutoModel classes will automatically infer the model architecture from the model name.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
# We call define a model object
model = AutoModelForSequenceClassification.from_pretrained(model_name)
#3 Prepare your input
Load a tokenizer for our model, in this case, the transformers library facilitates the process as it inferes the tokenizer to be used from the name of the model that we have chosen.
#We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name)
#4 Run the model
Generate a pipeline object with the chosen model, the tokenizer, and the task to be performed. In our case, a sentiment analysis. If you initialize the classifier object with the task, the pipeline class will populate it with the default values, even though it is not recommended in production.
# Initializing a classifier with a model and a tokenizer
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)
# When passing only the task, the pipeline command inferes both the model and tokenizer.
classifier = pipeline("sentiment-analysis")
We can execute this model by introducing some input.
output = classifier("I've been waiting for this tutorial all my life!")
And we will obtain the results right away!
Which leads to the following (and final) step…
#5 Interpret the outputs
The model will return an object containing various elements depending on the model's class. For example, for this sentiment analysis example, we will get:
In this instance, the input string has been classified with the "Positive" label (using a sentiment analysis model), achieving a confidence score of 0.579. This score reflects the model's certainty in its classification.
A second task we can do using HF is fine-tuning a model.
2. Fine-tuning models
Fine-tuning is the process of taking a pre-trained model and updating its parameters by training on a dataset specific to your task. This allows you to leverage the model's learned representations and adapt them to your use case.
Imagine we need to use a text-classifier model to infer sentiments from a list of tweets. One natural question that comes to mind is:
Will this pre-trained model work properly?
To make sure it does, we can take advantage of fine-tuning by training a pre-trained Hugging Face model with a dataset containing tweets and their corresponding sentiments so the performance improves.
Here's a basic example of fine-tuning a model for sequence classification:
#1. Choose a pre-trained model and a dataset
Select a model architecture suitable for your task. In this case, we want to keep using the same sentiment analysis model.
However, now we need some data to train our model. And this is precisely where the datasets
library kicks in. We can go check all datasets in the Model Hub, and find the one that fits us the best.
In my case, I’ll be using the twitter-sentiment-analysis dataset.
Now that I already know what dataset to choose, we can simply initialize both the model and dataset.
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")
If we check the dataset we just downloaded, it is a dictionary containing a subset for training and a subset for testing. If we convert the training subset to a dataframe, it looks like follows:
#2. Prepare Your dataset
Now that we already have our dataset, we need a tokenizer to prepare it to be parsed by our model. The text variable of our dataset needs to be tokenized so we can use it to fine-tune our model.
This is why the second step is to load a pre-trained Tokenizer and tokenize our dataset so it can be used for the fine-tuning.
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
#3. Build a PyTorch dataset with encodings
The third step is to generate a train and testing dataset. The training set will be used to fine-tune our model, while the testing set will be used to evaluate it.
Usually, the fine-tuning process takes a lot of time.
(To facilitate the tutorial, we randomly sample both datasets so your computation time is lower)
from datasets import load_dataset
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
#4. Fine-tune the model
Our final step is to set up the training arguments and start the training process. The transformers library contains the trainer()
class, which takes care of everything.
We first define the training arguments together with the evaluation strategy. Once everything is defined, we can easily train the model with the train()
command.
from transformers import Trainer, TrainingArguments
import numpy as np
training_args = TrainingArguments(output_dir="trainer_output", evaluation_strategy="epoch")
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
#5. Evaluate the model
After training, evaluate the model's performance on a validation or test set. Again, the trainer class already contains an evaluate method that takes care of this.
import evaluate
trainer.evaluate()
Our fine-tuned model presents an accuracy of 70%.
Now that we have already improved our model, how can we share it with the community?
This brings us to our final step…
#6. Sharing Models
Once we’ve fine-tuned our new model, the best idea is to share it with the community.
Hugging Face makes this process straightforward. First, we need to install the huggingface_hub
library.
A requirement for this final step is to have an active token to be able to connect to your Hugging Face account. You can easily get one following this guideline. When working in a Jupyter Notebook, we can easily import the notebook_login
library.
from huggingface_hub import notebook_login
notebook_login()
This will generate a login within our Jupyter Notebook. We just need to submit our token, and our notebook will be connected to our hugging face account.
After this, the model will be available for everyone in our Hugging Face profile.
And this is all for now! Hope this first basic tutorial was useful to help you get started with Hugging Face, any further question, feel free to ask! 😉
Are you still here? 🧐
👉🏻 I want this newsletter to be useful, so please let me know your feedback!
Additionally, you can let me know any preference for future content or any idea you think might add more value to you!
My latest articles 📝
Nice articles (my weekly favs)! ♥
Referrals: The Most Effective Way to Land Data Science Interviews by
The life cycle in a Data Science project by
Grid Search vs. Random Search vs. Bayesian Optimization by
My Favorite Resources! 👨🏻💻
I strongly recommend following AIgents roadmaps if you want to become a full Data Scientist, ML Engineer, or Data Engineer,
Understand ML models following my MLBasics series!
You can follow the Data Engineer Zoomcap on GitHub to become a fully proficient Data Engineer.
Want to learn GCP? You can follow The Cloud Girl and learn using her intuitive illustrations!
Looking for your new Copilot tool? You can get started today with pieces for free!
Want to get more of my content? 🙋🏻♂️
Reach me on:
LinkedIn, X (Twitter), or Threads to get daily posts about Data Science.
My Medium Blog to learn more about Data Science, Machine Learning, and AI.
Just email me at rfeers@gmail.com for any inquiries or to ask for help! 🤓
Remember now that DataBites has an official X (Twitter) account and LinkedIn page. Follow us there to stay updated and help spread the word! 🙌🏻