NLP Project Series: Finalizing the web scraper for the Substack newsletter understanding tool
Join me as I build an advanced Substack newsletter understanding tool from scratch!
Hi all,
This is the second part of the hands-on walkthrough series covering the entire development of the realistic Natural Language Processing (NLP) project from scratch.
In the first part of the series, we introduced the project that we will build: a tool that offers an advanced understanding of all available newsletters on Substack. This tool is meant to help casual readers analyze, search, and recommend Substack newsletters through the power of natural language processing.
We outlined some high-level plans and timelines to accomplish our goal.
We also started collecting newsletter data from Substack with web scraping.
Where did we stop last week?
I coded the full working version of the web scraper (code link from last week) that does two things:
Lists all the newsletters currently available on the Substack website (thanks to
sitemap.xml
)Scrapes the titles of the most recently released posts for each newsletter in order to form a general understanding of the theme behind that particular substack.
The issue was that step #2 was too slow.
As I started scraping each newsletter last week, I hit the problem. Substack started timing out my requests after accessing every few pages, which forced me to add a four-second delay between each call.
This four-second delay makes it impossible to quickly scrape all the web pages.
We simply waste 28,000 (approx number of newsletters as of mid-Nov 2022) * 5 (max posts we scrape) * 4 (delay) = 560,000 seconds simply waiting to make sure that Substack doesn’t time us out. This translates to approximately 6 and a half days of simply not doing anything in the code to make sure that Substack lets us scrape the next page. It is a lot of wasted time.
Building a much faster Substack web scraper
To go around the timeout issue, we need to figure out how to rotate proxies during scraping (i.e. open the Substack website from different IP addresses). This way whenever Substack prevents one IP address from scraping the page, we simply rotate to another IP address and continue scraping.
I wasn’t sure how to do setup rotating proxies using the Watir library from last week's post. Instead, I used the scrape.do API that provides the web scraping functionality with a rotating proxy included.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F15ce7e9f-a7f4-419a-994a-ceb6f7f52f62_3680x2334.png)
It's easy to scrape the latest five posts from each substack newsletter using the scrape.do API. We just need to pass the URL and our personal scrape.do API token using a GET request.
In Ruby, this only takes two lines of code.
scrape_uri = 'http://api.scrape.do?token=' + ENV["SCRAPEDO_TOKEN"] + '&url=' + url
html_out = Nokogiri::HTML(URI.open(scrape_uri)) # call the scrape_uri using GET request
The scape.do API has the advantage of allowing us to scrape multiple pages concurrently. The hobby plan allows for up to 5 concurrent requests, which works well with our goal of scraping the 5 latest posts for each Substack newsletter.
Putting it all together, we get the code snippet below that scrapes the title, subtitle, and Substack newsletter name for 5 latest posts in parallel. The complete code can be found at the end of the post.
urls.each do |url|
threads << Thread.new do
scrape_uri = 'http://api.scrape.do?token=' + ENV["SCRAPEDO_TOKEN"] + '&url=' + url
html_out = Nokogiri::HTML(URI.open(scrape_uri))
# get h1 with class post-title
post_title = html_out.css('h1.post-title').text
# get h3 with class subtitle
post_subtitle = html_out.css('h3.subtitle').text
# get h1 with class navbar-title
substack_name = html_out.css('h1.navbar-title').text
#puts 'Post title: ' + post_title
#puts 'Post subtitle: ' + post_subtitle
#puts 'Substack name: ' + substack_name
outs << {
url: url,
post_title: post_title,
post_subtitle: post_subtitle,
substack_name: substack_name
}
end
end
Something to keep in mind.
While the scape.do API is very convenient to use for web scraping, it comes at a 29$ per month price for 150,000 scraping requests (hobby plan). I think this is a reasonable price to pay for such a service if you need to scrape certain webpages on a regular basis.
However, for someone like me who might need to scrape certain webpage on an inconsistent basis, it might be worth spending more time to figure out the way to use the rotating proxy without the external API like scape.do.
Even if it ends up taking more time and effort than using the scape.do API, I will simply become more proficient in web scraping and move on. But that’s for another time!
Deploying Substack web scraper on a remote server
Even with scrape.do API, we need few seconds (3 seconds at most from my experience) to scrape 5 latest posts from each Substack newsletter. In total it adds up to roughly 24 hours.
Quite a big improvement from the version version of the code we had last week (we reduced total scraping time from 1 week to 24 hours).
Now last step that remains is to figure out the way to run the code for 24 hours without interruptions. Running on a personal laptop is a viable option, but I prefer turning off my laptop at night. Additionally, I want my code to be able to withstand an increase in the number of Substack newsletters over the next few years.
If we run the Substack web scraper on a remote server, it will be able to handle a much larger number of newsletters. Plus, we can keep our laptops off at night.
We would use Heroku as our remote server. Heroku is a perfect example of the Platform as a Service (PaaS) -- they maintain servers, and we just deploy code to them.
I won’t cover Heroku setup in detail here, but I recommend following these steps to run the Substack scraper script remotely on a Heroku server. Here are the Gemfile and Gemfile.lock to use in order to setup dependencies properly for the Substack scraper Heroku app.
If you have any issues with the Heroku setup, please leave a comment.
Once you have everything setup, simply run the following script (script link):
heroku run:detached "ruby scraper.rb"
This scraper script will run in the background on the Heroku server.
Remember to set the environment variables with your AWS and scrape.do credentials on Heroku. This will ensure that 1) the Substack newsletter data is backed up on AWS S3 data storage and 2) scrape.do API runs properly.
And that's it!
What’s Next?
You can take a look at the main scraper.rb script and associated Gemfile and Gemfile.lock to replicate my run. The main script is pretty small, only about ~200 lines of code.
In the next post we will do a deep dive into the data, do some data analysis and clean it up to properly set it up for our NLP experiments. You can also take a look at the complete Substack newsletter data coming from more than 28,000 newsletters that I scraped in the meantime.
See you soon!
Another great post!
I’m really enjoying this series so far.
I like how you’re going into the obstacles you’re facing and the long-term and short-term trade offs of the decisions you’re making.
Often when people write project walkthroughs they don’t go into the challenges they had to deal with. And for those who do, it’s more of a token gesture that leaves you up a creek without a paddle when you as a reader inevitably walk into issues trying to implement something similar.
The way you’re taking us through your obstacles and decisions, sharing your code, and, frankly, working on something much more difficult, is so much more useful.
Thanks for this.
I can’t wait for your next post!