Wednesday, June 20, 2012

The importance of your metadata

I partook of a webinar today presented by BePress (Berkeley Electronic Press) and Google Scholar. Ann Taylor moderated the session and Darcy Dapra, Partner Manager, Google Scholar, provided 45 minutes of engaging talk about the way bots browse and index materials on the web, particularly institutional repository indexing with Google Scholar bots to make the materials accessible in the Scholar searches.

The session was informative about the basic technology behind web crawling and focused on using appropriate metadata to maximize efficiencies for crawlers. The methodology behind excellent web search services involves three steps: Crawling, Indexing and Ranking.

Crawling is typically done by the GS bots in parallel, meaning that a pool of sites is created and then simultaneous search-retrieve-analysis is performed. Bots are interrupted in their automated work by running into non-navigable Javascripting or POST data in URLs. File organization is also important to efficient crawling. Non-linear (exponential) browsing is faster than linear (list) browsing. To address these problems it is recommended that file organization and links are structured so that HTML HREF links are used without POST information AND articles to be indexed are only two layers in from the homepage.

Indexing the documents requires  excellent metadata. Indexing is interrupted or made weak when:
  • sufficient bibliographic metadata cannot be identified
  • selected metadata schema does not provide enough relevant detail
  • included data is incorrect or incomplete
  • documents are partitioned per outdated structure and size guidelines, such as those intended to speed download times
General purpose schemas allow all things to be described, but not all things are described well with these schemas. Scholarly articles need:
  1. Journal, volume, issue and page number
  2. ISSN
  3. Publication date
  4. Monographic series info for conference proceedings, etc. (ISBN)

Best practices for metadata include using full author identification, without abbreviations, full institutional names to avoid confusing one for another, using complete documents in one file (no partitions), and relying on direct links without intervening registration or copyright acceptance pages. Redirects should be used, HTTP 301 in particular, and should remain in place for at least 12 months after site migrations as this can be read by a crawler to remove the page URL from its index.

Ranking methodologies were not discussed in the time I was on the webinar call.

The above information is likely basic for most readers, and yet it served to bring me back in the loop on the basics of crawling and metadata reading by those crawlers. As our institution continues work on its own institutional repository, it is important for me to have this understanding.  We are in the process of interviewing candidates for a metadata librarian position. I hope this review can help others to find a foundation to building their own understanding of the importance of appropriate metadata.

For additional information, the speaker suggested a review of the About Google Scholar pages as a starting point.

No comments: