Back to home

scraping google scholar — the smart(er) way

Posted on friday, 12th july, 2025

recently, i was helping a friend look for relevant patent data on google patents — and i quickly realized just how slow and repetitive the process really was. for every keyword search, we had to manually open each result and scan the abstract just to figure out if it was even worth saving. after hours of digging, we walked away with maybe two or three useful ones.

i figured there had to be a better way. so i first tried using firecrawl to scrape patent pages — and while it technically worked, it wasn’t sustainable cost-wise, especially with the number of pages i needed to process.

so i decided to take a closer look at how google patents actually loads its data. i opened devtools and noticed something interesting: instead of server-rendering results, google uses an xhr/query endpoint behind the scenes. that endpoint takes a structured query and returns paginated patent result metadata in a JSON-like format.

for example, it builds a URL like:

https://patents.google.com/xhr/query?url=q%3D%28CD47%29%26oq%3DCD47&exp=&peid=...
each response includes around 10 items per page, and each item contains a unique patent ID (like patent/US11723348B2/en) that you can use to fetch the full detail page.

here's the tricky part: clicking a result doesn’t return a clean JSON object — it serves a full HTML document. so to extract the abstract, i had to write a scraper that:

  • hits the /xhr/query endpoint with a given search query
  • parses the response to extract patent publication IDs and their respective detail URLs
  • fetches the full HTML for each patent page and scrapes theabstract, title, and original URL using selector logic

once i had a collection of titles and abstracts, i passed them into the OpenAI API with a carefully written custom prompt. the prompt helped evaluate whether the abstract was topically relevant, and gave each one a score or summary.

of course, there was another obstacle: pagination. the query API only returns 10 results per page, and doesn’t give you a direct “next” link — but it does return the total count of results on the first page. i used that to calculate how many pages i needed, and wrote a loop that updated the page parameter to scrape all results.

with dozens (sometimes hundreds) of abstracts collected, i started hitting token limits on OpenAI. so, i batched the abstracts into smaller chunks before sending them to the model.

in the end, i wrapped the whole flow in a simple internal API that takes:

  • a search query (e.g., "CD47")
  • a custom relevance prompt

and returns a cleaned list of patent entries — each with title, abstract, and URL — filtered by relevance. now, instead of manually reviewing 50 tabs, i get distilled insights in seconds.

this project went from a frustrating search problem to an enjoyable automation exercise. it’s still internal right now, but i’m planning to open source it soon. if you’d like to try it or contribute, let me know!