CS11 - Understanding Data Collection with APIs
A hands-on guide to collecting structured data with Python and real-world APIs
In today’s world, knowing how to collect clean, relevant, and timely data is essential for any data professional. While there are many ways to gather data, one of the most reliable and scalable methods is through APIs (Application Programming Interfaces).
Cheatsheet & Code in the end ‼️
In this issue, we’ll break down the essentials of using APIs for data collection, from how they work to how you can start using them in Python, complete with a practical example using Eurostat, the EU’s official statistics portal.
Let’s dive in! 👇🏻
🤔 What’s an API, and Why Should You Care?
An API (Application Programming Interface) is a set of rules that lets two software systems communicate.
Imagine a restaurant 🧑🏻🍳
You don’t go into the kitchen to prepare your food—you place your order with the waiter, who passes your request to the chef and brings your food back.
Similarly, an API receives your request, fetches the data from a source, and returns it in a structured format (usually JSON or XML).
This makes it easy to integrate external data into your apps or analyses.
Anatomy of an API Call 🔧
Every API interaction typically includes:
Client: The software requesting the data (e.g. your Python script)
Request: The structure of what you’re asking for
Request: Send extra info like API keys.
The API Server: The system that responds to your request
The API Endpoint: URL to access specific data or actions.
Response: The result you get, often in a machine-readable format
This communication allows applications to share information or functionalities efficiently, enabling tasks like fetching data from a database or interacting with third-party services.
Why Use APIs for Data Collection?
APIs offer several advantages for data collection:
Efficiency: They provide direct access to data, eliminating the need for manual data gathering.
Real-time Access: APIs often deliver up-to-date information, which is essential for time-sensitive analyses.
Automation: They enable automated data retrieval processes, reducing human intervention and potential errors.
Scalability: APIs can handle large volumes of requests, making them suitable for extensive data collection tasks.
Implementing API Calls in Python
Making an introductory API call in Python is one of the easiest and most practical exercises to get started with data collection. The popular requests library makes it simple to send HTTP requests and handle responses.
A simple API call request would look as follows 👇🏻
import requests
# Define the API endpoint
url = "https://api.example.com/data"
# Optional headers (e.g., for authentication)
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
# Optional parameters or payload
params = {
"query": "example",
"limit": 10
}
# Make the GET request
response = requests.get(url, headers=headers, params=params)
# Print the response
if response.status_code == 200:
print("Success:", response.json())
else:
print("Error:", response.status_code, response.text)
Example 1: Using random user API
To demonstrate how it works, we'll use the Random User Generator API, a free service that provides dummy user data in JSON format, perfect for testing and learning.
Here’s a step-by-step guide to making your first API call in Python.
STEP 1 - Install the Requests Library:
pip install requests
STEP 2 - Import the Library:
import requests
import pandas as pd
STEP 3 - Check the documentation page
Before making any requests, it's important to understand how the API works. This includes reviewing available endpoints, parameters, and response structure. Start by visiting the Random User API documentation.
STEP 4 - Define the API Endpoint and Parameters:
Based on the documentation, we can construct a simple request. In this example, we fetch user data limited to users from the United States:
url = 'https://randomuser.me/api/'
params = {'nat': 'us'}
STEP 5 - Make the GET Request:
Use the requests.get() function with the URL and parameters:
response = requests.get(url, params=params)
STEP 6 - Handle the Response:
Check whether the request was successful, then process the data:
if response.status_code == 200:
data = response.json()
# Process the data as needed
else:
print(f"Error: {response.status_code}")
STEP 7 - Convert our data into a dataframe
To work with the data easily, we can convert it into a pandas DataFrame:
if response.status_code == 200:
data = response.json()
# Process the data as needed
else:
print(f"Error: {response.status_code}")
Now, let’s exemplify it with a real case.
Example 2: Working with Eurostats API
Eurostat is the statistical office of the European Union. It provides high-quality, harmonized statistics on a wide range of topics such as economics, demographics, environment, industry, and tourism—covering all EU member states.
Through its API, Eurostat offers public access to a vast collection of datasets in machine-readable formats, making it a valuable resource for data professionals, researchers, and developers interested in analyzing European-level data.
STEP 0 - Understand the data contained in the API
If you go check the Data section of EUROSTATS, you will find a navigation tree as follows.
We can try to identify some data of interest in the following subsections:
Detailed Datasets: Full Eurostat data in multi-dimensional format.
Selected Datasets: Simplified datasets with fewer indicators, in 2–3 dimensions.
EU Policies: Data grouped by specific EU policy areas.
Cross-cutting: Thematic data compiled from multiple sources.
STEP 1 - Checking its documentation
Always start with the documentation. You can find Eurostat’s API guide [here]((https://wikis.ec.europa.eu/display/EUROSTATHELP/API+-+Getting+started+with+statistics+API). It explains the API structure, available endpoints, and how to form valid requests.
STEP 2 - Generating the first call request
To generate an API request using Python, the first step is installing and importing the requests library. Remember, we already installed it in the previous simple example. Then, we can easily generate a call request using a demo dataset from the EUROSTATS documentation.
# We import the requests library
import requests
# Define the URL endpoint -> We use the demo URL in the EUROSTATS API documentation.
url = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/DEMO_R_D3DENS?lang=EN"
# Make the GET request
response = requests.get(url)
# Print the status code and response data
print(f"Status Code: {response.status_code}")
print(response.json()) # Print the JSON response
Pro tip: We can split the URL into the base URL and parameters, to make it easier to understant what data are we requesting from the API.
# We import the requests library
import requests
# Define the URL endpoint -> We use the demo URL in the EUROSTATS API documentation.
url = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/DEMO_R_D3DENS"
# Define the parameters -> We define the parameters to add in the URL.
params = {
'lang': 'EN' # Specify the language as English
}
# Make the GET request
response = requests.get(url, params=params)
# Print the status code and response data
print(f"Status Code: {response.status_code}")
print(response.json()) # Print the JSON response
STEP 3 - Determining what dataset to call
Instead of using the demo dataset, you can select any dataset from the Eurostat database. For example, let's query the dataset TOUR_OCC_ARN2, which contains tourism accommodation data.
# We import the requests library
import requests
# Define the URL endpoint -> We use the demo URL in the EUROSTATS API documentation.
base_url = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/"
dataset = "TOUR_OCC_ARN2"
url = base_url + dataset
# Define the parameters -> We define the parameters to add in the URL.
params = {
'lang': 'EN' # Specify the language as English
}
# Make the GET request -> we generate the request and obtain the response
response = requests.get(url, params=params)
# Print the status code and response data
print(f"Status Code: {response.status_code}")
print(response.json()) # Print the JSON response
STEP 4 - Understand the response
Eurostat’s API returns data in JSON-stat format, a standard for multidimensional statistical data. You can save the response to a file and explore its structure:
import requests
import json
# Define the URL endpoint and dataset
base_url = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/"
dataset = "TOUR_OCC_ARN2"
url = base_url + dataset
# Define the parameters to add in the URL
params = {
'lang': 'EN',
"time": 2019 # Specify the language as English
}
# Make the GET request and obtain the response
response = requests.get(url, params=params)
# Check the status code and handle the response
if response.status_code == 200:
# Parse the JSON response
data = response.json()
# Generate a JSON file and write the response data into it
with open("eurostat_response.json", "w") as json_file:
json.dump(data, json_file, indent=4) # Save JSON with pretty formatting
print("JSON file 'eurostat_response.json' has been successfully created.")
else:
print(f"Error: Received status code {response.status_code} from the API.")
STEP 5 - Transform the response into usable data.
Now that we got the data, we can find a way to save it up into a tabular format (CSV) in order to smooth the process of analyzing it.
import requests
import pandas as pd
# Step 1: Make the GET request to the Eurostat API
base_url = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/"
dataset = "TOUR_OCC_ARN2" # Tourist accommodation statistics dataset
url = base_url + dataset
params = {'lang': 'EN'} # Request data in English
# Make the API request
response = requests.get(url, params=params)
# Step 2: Check if the request was successful
if response.status_code == 200:
data = response.json()
# Step 3: Extract the dimensions and metadata
dimensions = data['dimension']
dimension_order = data['id'] # ['geo', 'time', 'unit', 'indic', etc.]
# Extract labels for each dimension dynamically
dimension_labels = {dim: dimensions[dim]['category']['label'] for dim in dimension_order}
# Step 4: Determine the size of each dimension
dimension_sizes = {dim: len(dimensions[dim]['category']['index']) for dim in dimension_order}
# Step 5: Create a mapping for each index to its respective label
# For example, if we have 'geo', 'time', 'unit', and 'indic', map each index to the correct label
index_labels = {
dim: list(dimension_labels[dim].keys())
for dim in dimension_order
}
# Step 6: Create a list of rows for the CSV
rows = []
for key, value in data['value'].items():
# `key` is a string like '123', we need to break it down into the corresponding labels
index = int(key) # Convert string index to integer
# Calculate the indices for each dimension
indices = {}
for dim in reversed(dimension_order):
dim_index = index % dimension_sizes[dim]
indices[dim] = index_labels[dim][dim_index]
index //= dimension_sizes[dim]
# Construct a row with labels from all dimensions
row = {f"{dim.capitalize()} Code": indices[dim] for dim in dimension_order}
row.update({f"{dim.capitalize()} Name": dimension_labels[dim][indices[dim]] for dim in dimension_order})
row["Value (Tourist Accommodations)"] = value
rows.append(row)
# Step 7: Create a DataFrame and save it as CSV
if rows:
df = pd.DataFrame(rows)
csv_filename = "eurostat_tourist_accommodation.csv"
df.to_csv(csv_filename, index=False)
print(f"CSV file '{csv_filename}' has been successfully created.")
else:
print("No valid data to save as CSV.")
else:
print(f"Error: Received status code {response.status_code} from the API.")
STEP 6 - Generate a specific view
In our, imagine we just want to keep those records corresponding to Campings, Apartments or Hotels. We can generate a final table with this condition, and obtain a Pandas DataFrame we can work with.
# Check the unique values in the 'Nace_r2 Name' column
set(df["Nace_r2 Name"])
# List of options to filter
options = ['Camping grounds, recreational vehicle parks and trailer parks',
'Holiday and other short-stay accommodation',
'Hotels and similar accommodation']
# Filter the DataFrame based on whether the 'Nace_r2 Name' column values are in the options list
df = df[df["Nace_r2 Name"].isin(options)]
df
Best Practices When Working with APIs
Read the Docs: Always check the official API documentation to understand endpoints and parameters.
Handle Errors: Use conditionals and logging to gracefully handle failed requests.
Respect Rate Limits: Avoid overwhelming the server—check if rate limits apply.
Secure Credentials: If the API requires authentication, never expose your API keys in public code.
🎉 Limited-Time Offer: Celebrate 8,000+ Subscribers!
Woke up today to 8,000 amazing subscribers on DataBites — and I want to say thank you! 🥹
To celebrate, I'm giving 20% off for life to anyone who upgrades to a paid plan this week only. You'll get access to:
📝 Read all my paywalled articles and in-depth guides.
🧩 Access to all my cheatsheets in the Cheatsheet Hub.
💬 Post comments and questions on premium content.
This offer is only available until April 13th, so don’t miss it!
👉🏻 Grab your lifetime discount now
And now… what you have been waiting all along… Here goes our weekly cheatsheet 👇🏻
Conclusion
Eurostat’s API is a powerful gateway to a wealth of structured, high-quality European statistics. By learning how to navigate its structure, query datasets, and interpret responses, you can automate access to critical data for analysis, research, or decision-making—right from your Python scripts.
You can go check the corresponding code in my brand-new DataBites GitHub repository where I’ll share the associated codes for coming code-alongs and projects.
Are you still here? 🧐
👉🏻 I want this newsletter to be useful, so please let me know your feedback!
Before you go, tap the 💚 button at the bottom of this email to show your support—it really helps and means a lot!
Any doubt? Let’s start a conversation! 👇🏻
Want to get more of my content? 🙋🏻♂️
Reach me on:
LinkedIn, X (Twitter), or Threads to get daily posts about Data Science.
My Medium Blog to learn more about Data Science, Machine Learning, and AI.
Just email me at rfeers@gmail.com for any inquiries or to ask for help! 🤓
Remember now that DataBites has an official X (Twitter) account and LinkedIn page. Follow us there to stay updated and help spread the word! 🙌🏻