The AI Edition

The AI Edition

Share this post

The AI Edition
The AI Edition
NLP Project Series: Data Cleaning and Data Preparation in the real world

NLP Project Series: Data Cleaning and Data Preparation in the real world

Deep Dive into the Substack Newsletters Data Cleaning and Data Preparation Process

Nov 22, 2022
∙ Paid
3

Share this post

The AI Edition
The AI Edition
NLP Project Series: Data Cleaning and Data Preparation in the real world
Share

Hello everyone!

This week, we're diving deep into the Substack Newsletter data we've crawled over the last two weeks.

Join me on our journey to develop a realistic Natural Language Processing (NLP) project from scratch!

Human Language Technology
NLP Project Series: Finalizing the web scraper for the Substack newsletter understanding tool
Hi all, This is the second part of the hands-on walkthrough series covering the entire development of the realistic Natural Language Processing (NLP) project from scratch.Human Language Technology is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber…
Read more
3 years ago · 1 like · 3 comments · hal

Some announcements to get you excited about this week’s content!

Black Friday Deal

I'm giving a 40% discount on both monthly and yearly subscriptions for my Human Language Technology Newsletter until Saturday.

Get up-to-date on the latest in Artificial Intelligence and Natural Language Processing and land a high-paying NLP engineer job.

Happy Thanksgiving everyone!

Get 40% off now!

I’m giving out a Black Friday Deal for my other product as well.

For a limited time, get 50% off my Code Highlighter for the Substack chrome plug-in!

With this plug-in, easily read and use the code you see on Substack with colors similar to your code editor. Change the font size and copy the code inside your code editor with one button click. Reading and managing the code you see in my Substack will be much easier than ever with my chrome plug-in!

Read this testimonial from this Code Highlighter happy user

BowTiedDevil

Testimonial for Code Highlighter for Substack chrome plug-in by happy user BowTiedDevil

Get 50% off for Code Highlighter

Now, back to the main content.

Data, data, data

Now that we have collected data from the entire 28,257 newsletters on Substack, we can move on to the fun part.

Data is the fuel of every machine learning model, and this is especially true for natural language processing. In every successful project I have seen, the developers behind the project spend some time understanding, cleaning, and preprocessing data to make sure it is in the best condition to get the most performance from the machine learning model.

This is also a good time to remember that the quality of data is one of the factors that affect performance the most. Great slide from Andrew NG covering how cleaning data improved accuracy by 10%. This is a massive gain! Check out the

BowTied_Raptor
post covering the Data Centric AI view by Andew Ng in more detail.

Andew Ng on performance of machine learning model with clean vs noisy data

When we say "data cleaning" what does
that entail?

I won't lie and say that "data cleaning" is some sort of magic buzzword that will make all of your performance issues go away. In most cases, data cleaning can improve your results (which is why it's such a popular concept) but it's important to understand what data cleaning actually entails.

So, what happens in the typical natural language processing data collection scenario? You have collected the data by asking annotators to label each document with corresponding topic and now you would like to clean it. This is what you do:

  1. You need to do Quality Assurance (QA) to make sure that the label assigned to the document actually matches the guidelines.

    1. This is typically done by another set of annotators and checking the agreement rate.

  2. You need to check the trends and do the data analysis of the dataset.

    1. Perhaps you will discover that you have too many examples for certain topic and need to balance your data more.

  3. Finally, you need to remove any strange artifacts (for example, empty sentences or repeated datapoints).

    1. Remove strange artifacts (for example empty sentences; repeated datapoints)

What about our project?

In our case, we don’t have any human-labeled data per se.

The data is collected programmatically by scraping all Substack newsletters. So, unless there is a bug in my scraping code (for example, a post is not assigned to the right newsletter), there should be no issues with the data.

Instead, we are going to be focused on data analysis. According to the data, we will “clean” and “process” our data appropriately.

These are the types of questions we will be answering in the below parts.

  • What is the type of text data we collected?

    • How long is the typical title?

    • Any strange artifacts in the title and subtitle text?

  • Are there a lot of missing posts for many newsletters?

  • Any other discoveries we make before using machine learning model.

If you want to master how actual data processing and cleaning is done in the industry, this post is for you!

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 hal
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share