Hey you all!
It’s Josep here, one more week! 👋🏻
This time, I'm writing to you from NYC! Despite a bit of lingering jetlag, I'm excited to explore this incredible city with my family!
But I never forget my promises to you—so here I am, crafting this newsletter with some pretty great views! 😉
Stick with me for 8 minutes (reading time) — I promise it’s worth it!
The most important news of the week?
I just received my very first book to review, courtesy of Packt! The book is titled Transformers for Natural Language Processing and Computer Vision, written by Denis Rothman. I'm excited to dive into it, and I'll do my best to have the review ready by next week.
Besides that, there's not much else to share. As I mentioned in my previous issue, these coming weeks are going to be pretty laid-back.
Now that we’ve caught up on life updates, let's dive into the important stuff 👨🏻💻.
Even though the title might have given it away, today I want to dive into… Hugging Face!
For those of you picturing the 🤗 emoji… you're close!
But let’s set the stage a bit.
AI has moved far beyond research labs—it's now an integral part of our everyday lives.
One of the leading agents of this revolution is Hugging Face, an open-source platform that has become essential for anyone working in Machine Learning (ML) and Natural Language Processing (NLP).
Whether you're an experienced data scientist or just starting, Hugging Face offers a wide variety of tools and resources to help you bring your AI projects to life.
Trust me when I say, you’ll want to be a part of it!
Before we dive in, I strongly recommend checking out my previous issue on How to Get Started with LLMs (if you haven't already). Trust me, it’s a great primer! 👇🏻
Hugging Face, or The GitHub of ML
Hugging Face is often described as the "GitHub of the ML world"—a collaborative platform with lots of pre-trained models and datasets (ready to be loaded and used!!).
What sets it apart is its community-driven approach, making AI more accessible and democratized.
The platform's most notable contribution is the Transformers library, which provides a comprehensive suite of state-of-the-art NLP models. These models, which come pre-trained, significantly simplify the process of deploying NLP applications, allowing developers to focus more on fine-tuning and less on the underlying complexities.
But wait… where does this company come from? 🤔
From Chatbot to Open-Source Powerhouse
Founded in 2016, Hugging Face originally aimed to create a chatbot targeted at teenagers. However, the company quickly pivoted after open-sourcing its underlying model, leading to the creation of the Transformers library in 2018.
Today, Hugging Face is a central hub for AI professionals and enthusiasts, fostering a community that continually pushes the boundaries of what's possible with machine learning.
Isn’t it crazy how things change up so fast?
The core features behind Hugging Face
One of the biggest advantages of Hugging Face is how easy it is to get started. After installing the necessary libraries, you can quickly begin using pre-trained models for a variety of tasks, from sentiment analysis to text generation (we will talk more about this in the coming issues!!)
The platform also supports fine-tuning, allowing you to adapt models to your specific needs with minimal effort.
Some core components of HuggingFace are:
#1. Transformers Library
The Transformers library is a comprehensive suite of state-of-the-art ML models specially designed for NLP that contains an extensive collection of pre-trained models optimized for tasks such as text classification, language generation, translation, and summarization, among others
The Transformers library abstracts common NLP tasks into a simple-to-use pipeline() method, an easy-to-use API for performing a wide variety of tasks.
These pipelines allow users to easily apply complex models to real-world problems. If you want to learn more about what’s behind this library, I strongly recommend reading the article An Introduction to Using Transformers and Hugging Face.
The Transformers library simplifies the implementation of NLP models in several key ways:
Abstraction of complexity: It abstracts away the complexity involved in initializing models, managing pipelines, and handling tokenization.
Pre-trained models: Providing the biggest collection of pre-trained models, they reduce the time and resources required to develop NLP applications from scratch.
Flexibility and modularity: The library is designed with modularity in mind, allowing users to plug in different components as required.
Community and support: Hugging Face has fostered a strong community around its tools, with extensive documentation, tutorials, and forums.
Continuous updates and expansion: The library is constantly updated with the latest breakthroughs in NLP, incorporating new models and methodologies.
#2. Model Hub
The Model Hub stands as the community's face, a platform where thousands of models and datasets are at your fingertips. It is an innovative feature that allows users to share and discover models contributed by the community, promoting a collaborative approach to NLP development.
You can go check it out on their official website. There you can easily select the Model Hub by clicking on the Models buttons in the navigator, and a view like the following one should appear to you:
As you can see, in the left-sidebar, there are multiple filters regarding the main task to be performed.
Contributing to the Model Hub is made straightforward by Hugging Face's tools, which guide users through the process of uploading their models. Once contributed, these models are available for the entire community to use, either directly through the hub or via integration with the Hugging Face Transformers library.
Isn’t it exciting?
This ease of access and contribution fosters a dynamic ecosystem where state-of-the-art models are constantly refined and expanded upon, providing a rich, collaborative foundation for NLP advancement.
#3. Tokenizers
Tokenizers are crucial in NLP, as they are responsible for converting text into a format that machine learning models can understand, which is essential for processing different languages and text structures.
They are responsible for breaking down text into tokens—basic units like words, subwords, or characters—thus preparing data for machine learning models to process. These tokens are the building blocks that enable models to understand and generate human language.
They also facilitate the transformation of tokens into vector representations for model input and handle padding and truncation for uniform sequence lengths.
Hugging Face provides a range of user-friendly tokenizers, optimized for their Transformers library, which are key to the seamless preprocessing of text. You can read more about Tokenization in a separate article.
#4. Datasets Library
Another key component is the Hugging Face Datasets library, a vast repository of NLP datasets that support the training and benchmarking of ML models.
This library is a crucial tool for developers in the field, as it offers a diverse collection of datasets that can be used to train, test, and benchmark any NLP models across a wide variety of tasks.
One of the main benefits it presents is the simple and user-friendly interface. While you can browse and explore all datasets in the Hugging Face Hub, to use it in your code, they have tailored the dataset library that allows you to download any dataset effortlessly.
It includes datasets for common tasks such as text classification, translation, and question-answering, as well as more specialized datasets for unique challenges in the field.
Use Cases and Applications
Hugging Face models are versatile and can be applied to numerous NLP tasks, including:
Text Classification: Categorizing text into predefined classes.
Text Generation: Producing human-like text from a given prompt.
Question Answering: Providing answers to questions based on a given context.
Translation: Converting text from one language to another with high accuracy.
Next week we will get started with Hugging Face, so stay tuned! 💥
Are you still here? 🧐
👉🏻 I want this newsletter to be useful, so please let me know your feedback!
Additionally, you can let me know any preference for future content or any idea you think might add more value to you!
My latest articles 📝
Optimizing Your LLM for Performance and Scalability in KDnuggets.
How ChatGPT is Changing the Face of Programming in KDnuggets.
The Transformers Architecture - What’s the magic behind LLMs? in AIgents.
Nice articles (my weekly favs)! ♥
Being a Data Scientist Expat: An American Working In Stockholm by
📖 My Best Data Science Resources by
My Favorite Resources! 👨🏻💻
I strongly recommend following AIgents roadmaps if you want to become a full Data Scientist, ML Engineer, or Data Engineer,
Understand ML models following my MLBasics series!
You can follow the Data Engineer Zoomcap on GitHub to become a fully proficient Data Engineer.
Want to learn GCP? You can follow The Cloud Girl and learn using her intuitive illustrations!
Want to get more of my content? 🙋🏻♂️
Reach me on:
LinkedIn, X (Twitter), or Threads to get daily posts about Data Science.
My Medium Blog to learn more about Data Science, Machine Learning, and AI.
Just email me at rfeers@gmail.com for any inquiries or to ask for help! 🤓
Remember now that DataBites has an official X (Twitter) account and LinkedIn page. Follow us there to stay updated and help spread the word! 🙌🏻
Welcome to big apple Josep !!
Let me know how the new book is !! Always looking out for new books on NLP!!
Always love your articles on huggingface !! Its my most used and favorite tool !!!
Thanks for the shoutout !!
Thanks for this post, Josep!
I've seen the platform's name come up quite a few times recently, and this was a really helpful primer to understand it better.