Data wrangling, often considered a preliminary but essential step in data analysis, involves converting and mapping raw data into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes. This post aims to unravel the layers of this process, providing a comprehensive guide from the initial stages of defining research questions to the final steps of data visualization. It’s a journey through the best practices and tools essential for any data scientist, ensuring that your data is prepared for analysis.
Define Research Questions
Ensuring that a research question is clearly stated at the beginning of any data analysis is crucial for several reasons:
-
Guidance: The research question guides your entire data analysis process. It determines what data is needed, what kind of analyses to run, and how to interpret the results.
-
Clarity: Clearly stating the research question at the beginning helps anyone reading the notebook understand the purpose of the analysis. It sets the context and allows the reader to follow the logic of the study.
-
Focus: The research question keeps the analysis focused. It’s easy to get lost in the data and go off on tangents. A well-defined research question prevents this by reminding you of the main goal of your analysis.
Useful Links
- From Research Question to Exploratory Analysis
- The Importance of Structure, Coding Style, and Refactoring in Notebooks
- Seven Tricks for Better Data Storytelling: Part I
- Seven Tricks for Better Data Storytelling: Part II
Import Packages
Based on the PEP8 style guide, imports should be grouped in the following order:
- Standard library imports.
- Related third-party imports.
- Local application/library-specific imports.
Common packages for data wrangling:
- NumPy
- pandas
- requests
- BeautifulSoup
- Pillow (PIL)
- SQLAlchemy
- Scikit-learn
- Openpyxl
Updating packages like pandas within a Jupyter Notebook is not a good practice. Here’s why:
- Reproducibility: Updating packages in the notebook makes it hard to reproduce the environment. It’s better to state the required versions of all packages at the beginning of your notebook.
- Clarity: Notebooks should tell a clear story. Including code to update packages can distract from the main narrative.
- Potential Errors: Updating a package in the notebook could break the code in earlier cells. It’s safer to update packages outside the notebook (i.e. with pip or conda).
Best Practices:
- Always manage package versions outside the notebook, in a separate environment setup process. This could be done using virtual environments, docker containers, or conda environments.
- At the beginning of the notebook, state the versions of all major packages used. This can be done in a markdown cell for human readers, and optionally also in a code cell that prints out the actual versions in use when the notebook is run. This can be achieved using the
watermark
extension (documentation), for example, as shown here:
%load_ext watermark
%load watermark
watermark -p pandas,numpy,matplotlib
Structure + Documentation
Best Practices for Documentation:
- Use a consistent style and format throughout your documentation.
- Use comments to explain what your code is doing.
- Write documentation for your data, including the data sources, the data types, and the meaning of each variable.
- Write documentation for your analysis, including the methods used and the rationale behind the choices you made.
- Write documentation for your results, including the key findings and the implications for the business.
- Use diagrams and visualizations to help explain your findings.
Consider Using Pre-Built Templates:
- CRISP-DM: documentation template (Common traditional data science life cycle)
- Microsoft’s TDSP: Team Data Science Process (More Modern: A combo of Scrum and CRISP-DM)
- Model Cards: Google’s vision for a standardized and transparent reporting model
- Cookiecutter: A command-line utility that creates projects from a library of project templates
Useful Articles
- How To Structure a Data Science Project: A Step-by-Step Guide
- How to organize your Python data science project
- 15 Data Science Documentation Best Practices
- Best Practices For Data Science Project Workflows and File Organizations
1. Gather Data
Data gathering is an important step in any data science project as it involves collecting the data you need from various sources to use in further analysis. In Python, there are a number of ways to gather data:
- Download Data Manually:
import pandas as pd
df = pd.read_csv('file.csv')
print(df.head())
- Programmatically Download Files:
import requests
url = 'https://example.com/file.csv'
response = requests.get(url)
with open('file.csv', 'wb') as f:
f.write(response.content)
- Accessing APIs:
import requests
import json
url = 'https://api.github.com/users/octocat'
response = requests.get(url)
data = response.json()
print(json.dumps(data, indent=4))
- Web Scraping with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
- Extracting Data from a SQL Database:
import sqlite3
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_name")
rows = cursor.fetchall()
for row in rows:
print(row)
Useful Articles
- Downloading Files using Python
- Data Collection & Storage
- 4 Data Collection Libraries in Python That You Should Know
- Data Version Control With Python: A Comprehensive Guide for Data Scientists
2. Assess Data
Data assessment involves inspecting your datasets for two things:
- Data quality issues: like missing, duplicates, or incorrect data. This is called dirty data.
- Data structural issues: like data having different formats. This is called messy data.
You can search for these issues in two ways:
- Visually by scrolling
- Programmatically using code
Dimensions of Data Quality:
Completeness is a metric that helps you understand whether your data is sufficient to answer interesting questions or solve your problem.
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)
# Drop or fill missing values
df = df.dropna()
# Or
df = df.fillna(value)
Validity is a metric that helps you understand how well your data conforms to a defined set of rules for data, also known as a schema.
# Convert string to datetime
df['date_column'] = pd.to_datetime(df['date_column'], format='%d-%m-%Y')
Accuracy is a metric that helps you understand whether your data accurately represents the reality it aims to depict.
# Here, df_truth is the source of truth
accuracy = np.mean(df['column_name'] == df_truth['column_name'])
print(accuracy)
Consistency is a metric that helps you understand two things: whether your data follows a standard format and whether your data’s info matches with information from other data sources.
# Here, df1 and df2 are two different datasets
consistency = np.mean(df1['column_name'] == df2['column_name'])
print(consistency)
Uniqueness is a metric that helps you understand whether there are duplicate or overlapping values in your data.
# Check for duplicates
duplicates = df.duplicated().sum()
print(duplicates)
# Drop duplicates
df = df.drop_duplicates()
Removing columns or variables that are unnecessary for your analysis is called Dimensionality Reduction. Changing variables measurements or ratios to improve your analysis is called Feature Engineering. Both of these are common data preprocessing steps that can be used to improve the quality of tidy data, but they don’t count as quality or tidiness issues.
Requirements for Tidiness:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
Useful Links
- Python for Data Science — Tidy data
- What is Data Quality Assessment? Why it is Important?
- Data Quality Assessment Framework
- Using Missingno to Diagnose Data Sparsity
3. Clean Data
This is the stage where we solve the problems established during data assessment.
The Data Cleaning Process
Programmatic data cleaning has three steps:
- Define: write a data cleaning plan by converting your assessments into cleaning tasks.
- Code: translate the data cleaning plan to code and run it to clean the data.
- Test: test the dataset, often using code, to ensure the cleaning code works and revisit some of the elements from the assessment phase.
Dealing with Outliers
- Set up a range manually if you already know the range using pandas indexing.
- Identify outliers automatically using the standard deviation method, using the outputs of pandas
df.describe()
function. - Drop outliers using
df.drop(index=...)
- Identify the impact on summary statistics after dealing with outliers.
- Instead of dropping outliers, you can choose to keep them separate:
# If we know the range of x
low_bound = df.loc[df['x'] < min]
upper_bound = df.loc[df['x'] > max]
# If we don't: Use standard deviation (σ)
summaries = df.describe().loc[['mean', 'std']]
low_bound = summaries['x']['mean'] - summaries['x']['std']
upper_bound = summaries['x']['mean'] + summaries['x']['std']
# Separate Rows
violating_rows = df[(df['x'] < low_bound) | (df['x'] > upper_bound)]
Dealing with Duplicates
- Review duplicates
duplicate_rows = clean_df.loc[df.duplicated(['x','y'])]
- Remove rows or convert to NaN
# Keep 1st occurrence
remove_dups = clean_df.drop_duplicates(subset=['x','y'])
# Keep last occurrence
remove_dups = clean_df.drop_duplicates(subset=['x','y'], keep='last')
# Remove specific occurrence
clean_df.drop([:]) #slice specific location
# Convert duplicates to NaN
duplicate_index = clean_df[clean_df.duplicated(['x', 'y'])].index
clean_df.loc[duplicate_index, 'x'] = np.nan
Dealing with Missing Data
- Replace non-null missing values found during assessment (ex. ‘#’)
clean_df.loc[df['x'].isin(['#'])]
# OR
clean_df['x'] = clean_df['x'].replace({'#':np.nan})
- Check missing total
df.isna().sum()
- Drop values or fill values
# Drop rows
df.dropna()
# Drop columns
df.drop('x', axis=1)
# Fill Values
## Impute w/ mean
df.fillna(df['x'].mean())
## Forward Fill
df.fillna(method='ffill')
## Back Fill
df.fillna(method='backfill')
Testing Methods
Visual:
- Heatmaps to visualize missing data
- Histograms to visualize the range of data
- Box plots to visualize outliers
Programmatic:
- Use Python assert statements to check the data types, number of NA values, etc.
Useful Links
- Pythonic Data Cleaning With pandas and NumPy
- Data Wrangling in Python: Tips and Tricks
- How to Perform Data Cleaning for Machine Learning with Python
4. Storing Data
Maintaining different versions of your data, particularly the raw data and the cleaned data, has several advantages:
- It allows you to trace back to the original data if something goes wrong during the data-cleaning process.
- It provides a clear understanding of the transformations applied to the raw data.
- It facilitates the reproducibility of your analysis, which is a key aspect of good data science practices.
Save data to a CSV file using Pandas:
df.to_csv(‘demo/2023_ad_cleaned.csv’, index=False, encoding="utf-8")
Save data to a SQL database using SQLAlchemy:
df.to_sql(‘clean_data’, con=connection, if_exists='append', index_label=’ID’)
It is important to have a structure when storing and publishing your data. A sample structure is:
- A
doc
folder for documentation - A
data
folder for raw data - A
src
folder for source code - A
result
folder for cleaned data and analysis
This is particularly helpful for reproducible workflows, where downstream stakeholders can understand the process and replicate your work on their setups.
Save data to a pickle file:
Sometimes during cleaning, you may have adjusted the dtypes of variables to resolve validity issues. However, saving the cleaned data to .csv
files strips certain data types, which means they’ll be different when you next import them.
If you need to preserve your data types, you can save your dataframe as a pickle file using:
df.to_pickle("./FILE_NAME.pkl")
Useful Links
- Introduction to Data Wrangling and Data Quality
- Python Data Wrangling Guide
- Best Practices for Cleaning and Storing Data
5. Visualization
Useful Links
- Python Data Visualization Tutorials
- Document Analysis Guide: Definition and How To Perform It
- Principles of Documenting Data
- 15 Data Science Documentation Best Practices
- Reflective writing introduction
- How to Write Conclusions
- How to Present the Limitations of the Study
Conclusion
As we reach the conclusion of “Data Wrangling 101: A Jupyter Workflow,” it’s clear that the process of transforming raw data into a clean, structured format is no small feat. This comprehensive guide has walked you through the essential steps of this intricate process, emphasizing the importance of a well-defined research question, the selection of appropriate tools and packages, and the implementation of best practices in data cleaning and storage.
Remember, data wrangling is more than just a pre-processing step; it is the foundation upon which insightful analysis is built. With the knowledge and tools provided in this guide, you’re now better equipped to tackle the complexities of data science projects, transforming raw data into powerful stories and actionable insights. Keep exploring, keep analyzing, and let your data tell its story.