TL;DR: We find a way to scrape a lot of images from Reddit, and then format them to a standard 128x128 size. Code here.

Yesterday, inspired by this post I wanted to try my own hand at writing a Generative Adversarial Network (GAN) and generating synthetic images. I figured I could just re-use their code to obtain my own dataset of images to train on.

Problem 1

The script used to download images from Reddit no longer worked, due to a performance optimization made by Reddit’s team. Originally, Reddit would allow a user to view arbitrarily many submissions by simply clicking ‘Next’. However, now Reddit caches the first 1000 submissions for every category, and doesn’t allow immediate access to any other submissions. In other words, regardless of whether a user sorts by ‘New’ or ‘Hot’ or ‘Top’, Reddit will always show at most 1000 posts.

Up to a year ago, developers were able to get around this by making more complex search queries. Users were still able to use Reddit’s search function to obtain a list of posts which were submitted within a range of time. As a result, a scraper could theoretically make a search query for as many time intervals as wanted, and therefore access as many submissions as wanted.

Problem 2

Recent improvements to Reddit’s search function removed the feature which allowed search queries on timestamp, and as a result this method of obtaining posts no longer worked. Luckily, pushshift.io has been collecting Reddit data for a while, and allows for timestamp-based queries. Using their API, we’re able to access URLs for as many images as we want.

Problem 3

After downloading a lot of images and resizing them, I found out that quite a few images uploaded to Reddit look like this: To get around this, we can just check the image size of what we download and make sure that the dimensions don’t match with this specific image’s dimensions, which are 130x60.

End Result

Now that all these problems are fixed, We’re able to collect a large number of images from the subreddit of our choice. My scraper code can be found here. We resize each of them to fit within a 128x128 square, and zero-pad the images when necessary. I chose to scrape the subreddit /r/CatsStandingUp as an example, and now I have a collection of 10,000 images of standing cats. We’ll see soon if a GAN is able to learn from these images and synthesize new ones.