Patreon is a delight to scrape. Actually, scrapping is the wrong word for it – the frontend of Patreon is a react application that calls a number of very sensibly designed json end points. Call the same endpoints and you get delightfully clean json that exactly matches what gets displayed on the site.
A disclaimer – this is undocumented as far as I can tell – the publicly documented API (JS Implementation) is targeted at creators and provides access to privately information only visible to creators. All of this could change at any point.
I was all ready to parse HTML, but looking at the source there was a beautiful JS object containing all the data needed to display most pages.
"data": {
"attributes": {
"created_at": "2016-04-30T13:58:22+00:00",
"creation_name": "Entertainment",
"display_patron_goals": false,
"earnings_visibility": null,
...
Even better, at the tail end of the long object is the call to fetch just the JSON:
"links": {
<span id="line535"></span> "self": "https://api.patreon.com/campaigns/355645"
}
So as long as I can get the ID of a campaign, I can get all the information about it in an easily processed format. Thanks to the explore pages and a bit of network monitoring reveals calls to the following URL’s:
https://api.patreon.com/explore/category/12?include=creator.null&fields[user]=full_name,image_url,url&fields[campaign]=creation_name,patron_count,pledge_sum,is_monthly,earnings_visibility&page[count]=20&json-api-version=1.0';
Inspection of the different category pages reveals the number after ‘/category/’ runs from 1 through 14, and 99 for the ‘All’ category. This way, I can fetch all the top campaigns then use the campaign API to retrieve detailed information.
An interesting note – the data structures reveal a lot about how site has been designed and where complexity can be added later – multiple campaigns per user, links between campaigns, etc.
Full code for my scrapper is after the break – I’ll be diving into analysis next.