Sunday, November 13, 2011

Google Scholar (still) sucks

(This is a follow-up to my previous post on the topic.)

I was encouraged by the appearance of two R-based Scholar-scrapers, within a week of each other. One, by Kay Cichini, converts the page URLs into text mode and scrapes from there (There's a slightly hacked version by Tony Breyal on github. The other, by Tony Breyal (github version here), uses XPath.

I started poking around with these functions -- they each do some things I like and have some limitations.
  • Cichini's version:
    • is based on plain old text-scraping, which is easy for me to understand.
    • has a nice loop for fetching multiple pages of results automatically.
    • has (to me) a silly output format -- her code automatically generates a word cloud, and can dump a csv file to disk if requested. It would be very easy and make more sense to break this up into separate functions: a scraper which returned a data frame and a wordcloud creator which accepted a data frame as input ...



  • Breyal's version:
    • is based on XPath, which seems more magical to me but is probably more robust in the long run
    • extracts numbers of citations
Neither of them does what I really want, which is to extract the full bibliographic information. However, when I looked more closely at what GS actually gives you, I got frustrated again. The full title is available, but the bibliographic information is only available in a severely truncated form; the author list and publication (source) title are both truncated if they are too long (!!: e.g. check out this search) Since the "save to [reference manager]" links are available on the page (e.g. this link to BibTeX information: see these instructions on setting a fake cookie), one could in principle go and visit them all, but ... this is where we run into trouble. Google Scholar's robots.txt file contains the line Disallow: /scholar, which according to the definition of the robot-exclusion protocol technically means that we're not allowed to use a script to visit links starting with http://scholar.google.ca/scholar.bib... as in the example above. Google Scholar does block IP addresses that do too many rapid queries (this is mentioned on the GS help page, and on the aforementioned Python scraper page). It would be easy to circumvent this by pausing appropriately between retrievals, but I'm not comfortable with writing general-purpose code to do that. So: Google Scholar offers a reduced amount of information on the page they return, and prohibits us from spidering the page to retrieve the full bibliographic information. Argh. As a side effect of this, I did take a quick look for existing bibliographic information-handling packages in R (with sos::findFn("bibliograph*")) and found:
  • CITAN: a Scopus-centric package that uses a SQLite backend and does heavy-duty bibliometric analysis (h-indices, etc.)
  • RISmed is Pubmed-centric and defines a Reference class (seems sensible but geared pretty narrowly towards article-type references). It imports RIS format (a common tagged format used by ISI and others)
  • ris: a similar (?) package without the PubMed interface
  • bibtex: parses BibTeX files
  • RMendeley from the ROpenSci project
So: there's a little more infrastructure out there, but nothing (it seems) that will do what I want without breaking or bending rules.
  • ISI is big and evil and explicitly disallows scripted access.
  • PubMed doesn't cover ecology as well as I'd like.
  • I might be able to use Scopus but would prefer something Open (this is precisely why GS's cripplage annoys me so much).
  • Mendeley is nice, and perhaps has most of what I really want, but ideally I would prefer something with systematic coverage [my understanding is that the Mendeley databases would have everything that everyone has bothered to include in their personal databases ...]
  • I wonder if JSTOR would like to play ... ?
If anyone's feeling really bored, here are the features I'd like:
  • scrape or otherwise save information to a variety of useful fields (author, date, date, source title, title, keywords, abstract?
  • save/identify various types (e.g. article/book chapter etc.)
  • allow dump to CSV file
  • citation information would be cool -- e.g. to generate co-citation graphs --but might get big

I wonder if it's worth complaining to Google?

7 comments:

  1. VERY interesting post.
    Thank you for both linking out to people and for explaining your beef with google scholar - I do think you should complain to google, and I've forwarded this to people their that I know (though I don't know how close they are to the google scholar thing).

    With regards,
    Tal

    ReplyDelete
  2. Ben,

    thanks for illustrating the issue here again.
    The main purpose of my function is to retrieve titles - just because this is the best we can get for conclusion on covered topics. Abstracts are truncated and thus shouldn't be used for meta-analysis. Also titles are truncated, as you said, and there is no way around. Though, not as often and severe as with Abstracts.

    The CSV is optional, the df with word frequencies and the word cloud are always returned - for any other output one can easily add some appropriate lines to the script..

    My opinion:
    My function is good for a quick summary and illustration of a query-result.

    Tony's function is evidently better if you want to pull all fields of a given query (authors, titles, abstracts,..)

    I wonder if you came across ROpenSci? I guess that might be very interesting for you!

    Last remark: Of course, a Google Scholar API would resolve all our problems in this regard..

    Best,
    Kay

    ReplyDelete
  3. Have you taken a look at CiteSeerX? http://ksuseer1.ist.psu.edu/index

    CiteSeerX uses an OAI protocol (http://csxstatic.ist.psu.edu/about/data) which R has a package (http://cran.r-project.org/web/packages/OAIHarvester/index.html). I've tried to use R to pull CiteSeerX biblio info but I didn't get really far. I'd be curious if you could go farther than me.

    ReplyDelete
    Replies
    1. Sadly CiteSeerX does not cover ecology.

      Delete
  4. I keep coming back to the question of Google's motivation. They've clearly gone out of their way to make scraping difficult (as opposed to simply not going out of their way to make it easy), so they must have a reason.

    Any complaint to Google would have to address that reason, so I'd like to have some sense of what it is. The information you're trying to scrape is citation and abstract info, not full text, so IP shouldn't be an issue. What else could it be? I'm at a loss.

    ReplyDelete
  5. Amazing post. I have opened a related question on stack overflow : http://stackoverflow.com/questions/10536601/how-to-retrieve-calculate-citation-counts-and-or-citation-indices-from-a-list-of

    Thanks for the effort!

    ReplyDelete
  6. This is a very interesting post and covers (i think) something I am looking in to at the moment. I have been asked to compile a literature review of all environmental literature which has been written about Cornwall, UK, to be presented on a website. We do not have the capacity to host the papers so it would need to be able to query Scholar. I have searched and searched the help pages but cannot see a simple way of retrieving this type of information in a relatively quick way! I am envisaging something similar to how Google Scholar Citation works with personal profiles but with environmentally related topics instead.
    Is this impossible? Any help would be greatly appreciated!
    Many thanks

    ReplyDelete