hey all,
after some hiatus from building a real nlp project series, we are back!
wanted to give a shout-out to andriy mulyar who helped prepare the visualizations and write this post. look at how beautiful and insightful the map of all substack posts is.
in the previous posts, we did the tedious part of scraping substack data, cleaning, and preprocessing it. it could feel monotonous, but we shouldn’t forget that data is the secret recipe behind all the ai advances (including much recently hyped generative ai)
now it comes to the exciting part! visualizing and understanding the data using nlp.
what is atlas?
atlas offers a powerful and easy-to-use solution for visualizing and exploring large collections of text.
by using non-linear dimensionality reduction techniques, atlas can be used to get a birds-eye view of the topics, themes, and trends that are most prevalent across substack posts.
what is non-linear dimensionality reduction?
non-linear dimensionality reduction techniques are a family of methods that are used to reduce the complexity of high-dimensional datasets. in many cases, datasets can have thousands of features or variables, making them difficult to work with and visualize.
think of each substack post, which has endless possibilities of combinations of words in the post.
non-linear dimensionality reduction techniques aim to reduce the dimensionality of such datasets by discovering meaningful and informative low-dimensional representations of the data.
most importantly, non-linear dimensionality reduction techniques aim to preserve the key features and relationships of the original data in the lower-dimensional (2d in our case) space.
enough of technical concepts, let’s dive into the fun part:
exploring trends in substack posts
make sure to drag your mouse around, zoom in and explore all topics
i would highlight some of the high-level topics:
web3, daos
crypto investing
ukraine, politics
war
black culture
poetry
humanities
basketball
creativity
zooming in a web3, dao topic you see relevant subtopics like economics, volatility, ecosystem, decentralized work, and more!
if you would like to reproduce and create similar maps, it is very easy.
feed the substack data into atlas’ python api and out came a map.
code: https://github.com/nomic-ai/maps/blob/main/maps/substack.py
see you soon!