The session was informative about the basic technology behind web crawling and focused on using appropriate metadata to maximize efficiencies for crawlers. The methodology behind excellent web search services involves three steps: Crawling, Indexing and Ranking.
Crawling is typically done by the GS bots in parallel, meaning that a pool of sites is created and then simultaneous search-retrieve-analysis is performed. Bots are interrupted in their automated work by running into non-navigable Javascripting or POST data in URLs. File organization is also important to efficient crawling. Non-linear (exponential) browsing is faster than linear (list) browsing. To address these problems it is recommended that file organization and links are structured so that HTML HREF links are used without POST information AND articles to be indexed are only two layers in from the homepage.
Indexing the documents requires excellent metadata. Indexing is interrupted or made weak when:
- sufficient bibliographic metadata cannot be identified
- selected metadata schema does not provide enough relevant detail
- included data is incorrect or incomplete
- documents are partitioned per outdated structure and size guidelines, such as those intended to speed download times
- Journal, volume, issue and page number
- ISSN
- Publication date
- Monographic series info for conference proceedings, etc. (ISBN)
Best practices for metadata include using full author identification, without abbreviations, full institutional names to avoid confusing one for another, using complete documents in one file (no partitions), and relying on direct links without intervening registration or copyright acceptance pages. Redirects should be used, HTTP 301 in particular, and should remain in place for at least 12 months after site migrations as this can be read by a crawler to remove the page URL from its index.
Ranking methodologies were not discussed in the time I was on the webinar call.
The above information is likely basic for most readers, and yet it served to bring me back in the loop on the basics of crawling and metadata reading by those crawlers. As our institution continues work on its own institutional repository, it is important for me to have this understanding. We are in the process of interviewing candidates for a metadata librarian position. I hope this review can help others to find a foundation to building their own understanding of the importance of appropriate metadata.
For additional information, the speaker suggested a review of the About Google Scholar pages as a starting point.
No comments:
Post a Comment