By Evan Meehan
Data, whether it be found at a museum, or online, is only useful if it can be found. In museums this means that uncatalogued artifacts sit in a bin on a shelf collecting dust until someone has time to brush the dust off, figure out what this artifact is, write down the information, and then put the bin on a different shelf. At some point in the future the items contained within that bin may become part of an exhibit. Online it is much the same. As a rephrasing of a zen koan goes: if a website is published but cannot be found on google, does it contain any words?
While the system by which museums catalogue information is rigorously (albeit variously) defined, the internet is more free-form. Of even greater concern for the internet than museums is the amount of uncatalogued, or poorly catalogued websites. In 1999 for instance, search engines left 84% of the internet uncatalogued. More importantly however is how the remaining 16% was catalogued. The typical search engine then (and now) employed web crawlers, or spiders, which would record some basic information about a website to make it searchable. Yahoo on the other hand employed a much more resource intensive method where living, breathing, and paid human beings traveled along the web and put it into their own organizational schema. According to Danny Sullivan this system did not just work, but it worked so much better than the spider-based systems that other companies adopted the technique.
Ah, but you the intrepid reader who have discovered this blog post did not use Yahoo to get here, did you? Or, if you did, you probably would find the proposition that Yahoo uses a veritable army of editors to constantly scour the web for new blog posts like this one at least slightly dubious. Unless the AI singularity has come and gone and humans have learned from their mistakes and robots are relegated to the emergent holographic historical horror-genre, the odds are very good that the search engine of your choice uses web crawlers. Why? Because we found out how to make a better robot, and they really are quite good at these things.
And yet… at least for now humans still offer the ability to organize data online, efficiently (or at least cheaply) and perhaps even well. This method is what Thomas Vander Wal described as “folksonomies,” or user generated tagging systems. Like Yahoo’s human-generated search indexes, folksonomies rely on people to tag webpages with relevant meta-data. Alexis Wichowski writes about the theoretical issues of using a user-generated tagging system such as the value of the tags created for personal use (such as ‘mom’ or ‘read later’) in the broader scope of search indexing.
From an efficacy perspective Wichowski notes that while “traditional subject directories…[outperform] folksonomies in precision and recall…folksonomies were a close second (Morrison, 2007). Further, when folksonomies were combined with the directories with controlled vocabularies, precision and recall results were higher than in searches using the controlled vocabularies alone.” This hybrid mode of search heuristics, partially relying on spiders as well as user-generated tagging then seems to be the ideal search paradigm.
In internet-years Wichowski’s article is nearly able to receive discount tickets at the movie theater, but user-generated tagging is still with us, and hybrid searching is widespread. Three years after Wichowski published her article, Twitter was handling 1.6 billion search queries per day – combining both user generated hashtags with more traditional computer-based search methods. Whether or not academia wishes to accept the legitimacy of this organization method is a moot point – the resources to reorganize it are not readily available, and arguably it cannot be done much better.
 Danny Sullivan, “Once The Most Powerful Person In Search, Srinija Srinivasan Leaves Yahoo” (2010)
 Twitter Search Team (May 31, 2011). “The Engineering Behind Twitter’s New Search Experience”. Twitter Engineering Blog. Twitter. Retrieved February 27th 2017.