Back to home

scraping google patents — the smart(er) way

Posted on friday, 12th july, 2025

recently, i was helping a friend search for technical patterns on google patents — and wow, was it tedious. every search gave us a wall of results, and each abstract had to be opened manually to even check if it was remotely relevant. after hours of scrolling and clicking, we had maybe 2 or 3 useful patents. not great.

so i started looking into ways to automate the process. my first attempt was with firecrawl — decent results, but it turned out to be too expensive for the volume i needed.

that’s when i got curious about how google patents actually fetches its data. after a bit of network inspection, i realized: when you search, the results are fetched via a `xhr/query` endpoint. and when you click on a result, the abstract is buried in raw html, not in any clean API format.

so, i wrote a scraper that:

  • hits the query endpoint with custom search params
  • uses the publication id to fetch each result’s HTML
  • extracts just the title, abstract, and URL
the result? clean data.

i then passed these abstracts into openai’s api with a focused prompt that analyzed the relevance. this made the search smarter and faster — but of course, there was pagination to deal with.

the query api only returns 10 results per page, so i wrote a loop that auto-increments the page number and pulls abstracts until all results are collected. then, to avoid token limits, i chunked the abstracts into smaller batches before sending them to the model.

finally, i wrapped all of it into a simple internal API: you give it a query and a custom prompt, it returns the top recommended patents — filtered and summarized. it now takes seconds to get relevant insights that used to take hours.

sometimes, all it takes is a bit of digging to turn a painful workflow into something smooth and powerful. i’ll probably open source this soon — let me know if you’d be interested!