Clay Shirky on classification, browsing, tagging, and search

My contrarian streak makes me not want to publicize well-exposed figures like Clay Shirky, but sometimes the content is too good to pass up. I have been delving into issues of knowledge representation and ontologies for my own dissertation research, and found this very thought-provoking essay on classification, browsing, tagging, and search:

http://www.shirky.com/writings/ontology_overrated.html

Of Cards and Catalogs #

The periodic table gets my vote for the best categorization scheme ever, but libraries have the best-known categorization schemes. The experience of the library catalog is probably what people know best as a high-order categorized view of the world, and those cataloging systems contain all kinds of odd mappings between the categories and the world they describe.

Here’s the first top-level category in the Soviet library system:

 

A: Marxism-Leninism
A1: Classic works of Marxism-Leninism
A3: Life and work of C.Marx, F.Engels, V.I.Lenin
A5: Marxism-Leninism Philosophy
A6: Marxist-Leninist Political Economics
A7/8: Scientific Communism

Some of those categories are starting to look a little bit dated.

Or, my favorite — this is the Dewey Decimal System’s categorization for religions of the world, which is the 200 category.

 

Dewey, 200: Religion
210 Natural theology
220 Bible
230 Christian theology
240 Christian moral & devotional theology
250 Christian orders & local church
260 Christian social theology
270 Christian church history
280 Christian sects & denominations
290 Other religions

How much is this not the categorization you want in the 21st century?

This kind of bias is rife in categorization systems. Here’s the Library of Congress’ categorization of History. These are all the top-level categories — all of these things are presented as being co-equal.

 

D: History (general)
DA: Great Britain
DB: Austria
DC: France
DD: Germany
DE: Mediterranean
DF: Greece
DG: Italy
DH: Low Countries
DJ: Netherlands
DK: Former Soviet Union
DL: Scandinavia
DP: Iberian Peninsula
DQ: Switzerland
DR: Balkan Peninsula
DS: Asia
DT: Africa
DU: Oceania
DX: Gypsies

I’d like to call your attention to the ones in bold: The Balkan Peninsula. Asia. Africa.

And just, you know, to review the geography:

 


[ Spot the difference? ]

Yet, for all the oddity of placing the Balkan Peninsula and Asia in the same level, this is harder to laugh off than the Dewey example, because it’s so puzzling. The Library of Congress — no slouches in the thinking department, founded by Thomas Jefferson — has a staff of people who do nothing but think about categorization all day long. So what’s being optimized here? It’s not geography. It’s not population. It’s not regional GDP.

What’s being optimized is number of books on the shelf. That’s what the categorization scheme is categorizing. It’s tempting to think that the classification schemes that libraries have optimized for in the past can be extended in an uncomplicated way into the digital world. This badly underestimates, in my view, the degree to which what libraries have historically been managing is an entirely different problem.

The musculature of the Library of Congress categorization scheme looks like it’s about concepts. It is organized into non-overlapping categories that get more detailed at lower and lower levels — any concept is supposed to fit in one category and in no other categories. But every now and again, the skeleton pokes through, and the skeleton, the supporting structure around which the system is really built, is designed to minimize seek time on shelves.

The essence of a book isn’t the ideas it contains. The essence of a book is “book.” Thinking that library catalogs exist to organize concepts confuses the container for the thing contained.

The categorization scheme is a response to physical constraints on storage, and to people’s inability to keep the location of more than a few hundred things in their mind at once. Once you own more than a few hundred books, you have to organize them somehow. (My mother, who was a reference librarian, said she wanted to reshelve the entire University library by color, because students would come in and say “I’m looking for a sociology book. It’s green…”) But however you do it, the frailty of human memory and the physical fact of books make some sort of organizational scheme a requirement, and hierarchy is a good way to manage physical objects.

The “Balkans/Asia” kind of imbalance is simply a byproduct of physical constraints. It isn’t the ideas in a book that have to be in one place — a book can be about several things at once. It is the book itself, the physical fact of the bound object, that has to be one place, and if it’s one place, it can’t also be in another place. And this in turn means that a book has to be declared to be about some main thing. A book which is equally about two things breaks the ‘be in one place’ requirement, so each book needs to be declared to about one thing more than others, regardless of its actual contents.

People have been freaking out about the virtuality of data for decades, and you’d think we’d have internalized the obvious truth: there is no shelf. In the digital world, there is no physical constraint that’s forcing this kind of organization on us any longer. We can do without it, and you’d think we’d have learned that lesson by now.

And yet.

The Parable of the Ontologist, or, “There Is No Shelf” #

A little over ten years ago, a couple of guys out of Stanford launched a service called Yahoo that offered a list of things available on the Web. It was the first really significant attempt to bring order to the Web. As the Web expanded, the Yahoo list grew into a hierarchy with categories. As the Web expanded more they realized that, to maintain the value in the directory, they were going to have to systematize, so they hired a professional ontologist, and they developed their now-familiar top-level categories, which go to subcategories, each subcategory contains links to still other subcategories, and so on. Now we have this ontologically managed list of what’s out there.

Here we are in one of Yahoo’s top-level categories, Entertainment.


[ Yahoo’s Entertainment Category ]

You can see what the sub-categories of Entertainment are, whether or not there are new additions, and how many links roll up under those sub-categories. Except, in the case of Books and Literature, that sub-category doesn’t tell you how many links roll up under it. Books and Literature doesn’t end with a number of links, but with an “@” sign. That “@” sign is telling you that the category of Books and Literature isn’t ‘really’ in the category Entertainment. Yahoo is saying “We’ve put this link here for your convenience, but that’s only to take you to where Books and Literature ‘really’ are.” To which one can only respond — “What’s real?”

Yahoo is saying “We understand better than you how the world is organized, because we are trained professionals. So if you mistakenly think that Books and Literature are entertainment, we’ll put a little flag up so we can set you right, but to see those links, you have to ‘go’ to where they ‘are’.” (My fingers are going to fall off from all the air quotes.) When you go to Literature — which is part of Humanities, not Entertainment — you are told, similarly, that booksellers are not ‘really’ there. Because they are a commercial service, booksellers are ‘really’ in Business.


[ ‘Literature’ on Yahoo ]

Look what’s happened here. Yahoo, faced with the possibility that they could organize things with no physical constraints, added the shelf back. They couldn’t imagine organization without the constraints of the shelf, so they added it back. It is perfectly possible for any number of links to be in any number of places in a hierarchy, or in many hierarchies, or in no hierarchy at all. But Yahoo decided to privilege one way of organizing links over all others, because they wanted to make assertions about what is “real.”

The charitable explanation for this is that they thought of this kind of a priori organization as their job, and as something their users would value. The uncharitable explanation is that they thought there was business value in determining the view the user would have to adopt to use the system. Both of those explanations may have been true at different times and in different measures, but the effect was to override the users’ sense of where things ought to be, and to insist on the Yahoo view instead.

File Systems and Hierarchy #

 

It’s easy to see how the Yahoo hierarchy maps to technological constraints as well as physical ones. The constraints in the Yahoo directory describes both a library categorization scheme and, obviously, a file system — the file system is both a powerful tool and a powerful metaphor, and we’re all so used to it, it seems natural.


[ Hierarchy ]

There’s a top level, and subdirectories roll up under that. Subdirectories contain files or further subdirectories and so on, all the way down. Both librarians and computer scientists hit the same next idea, which is “You know, it wouldn’t hurt to add a few secondary links in here” — symbolic links, aliases, shortcuts, whatever you want to call them.


[ Plus Links ]

The Library of Congress has something similar in its second-order categorization — “This book is mainly about the Balkans, but it’s also about art, or it’s mainly about art, but it’s also about the Balkans.” Most hierarchical attempts to subdivide the world use some system like this.

Then, in the early 90s, one of the things that Berners-Lee showed us is that you could have a lot of links. You don’t have to have just a few links, you could have a whole lot of links.


[ Plus Lots of Links ]

This is where Yahoo got off the boat. They said, “Get out of here with that crazy talk. A URL can only appear in three places. That’s the Yahoo rule.” They did that in part because they didn’t want to get spammed, since they were doing a commercial directory, so they put an upper limit on the number of symbolic links that could go into their view of the world. They missed the end of this progression, which is that, if you’ve got enough links, you don’t need the hierarchy anymore. There is no shelf. There is no file system. The links alone are enough.


[ Just Links (There Is No Filesystem) ]

One reason Google was adopted so quickly when it came along is that Google understood there is no shelf, and that there is no file system. Google can decide what goes with what after hearing from the user, rather than trying to predict in advance what it is you need to know.

Let’s say I need every Web page with the word “obstreperous” and “Minnesota” in it. You can’t ask a cataloguer in advance to say “Well, that’s going to be a useful category, we should encode that in advance.” Instead, what the cataloguer is going to say is, “Obstreperous plus Minnesota! Forget it, we’re not going to optimize for one-offs like that.” Google, on the other hand, says, “Who cares? We’re not going to tell the user what to do, because the link structure is more complex than we can read, except in response to a user query.”

Browse versus search is a radical increase in the trust we put in link infrastructure, and in the degree of power derived from that link structure. Browse says the people making the ontology, the people doing the categorization, have the responsibility to organize the world in advance. Given this requirement, the views of the catalogers necessarily override the user’s needs and the user’s view of the world. If you want something that hasn’t been categorized in the way you think about it, you’re out of luck.

The search paradigm says the reverse. It says nobody gets to tell you in advance what it is you need. Search says that, at the moment that you are looking for it, we will do our best to service it based on this link structure, because we believe we can build a world where we don’t need the hierarchy to coexist with the link structure.

A lot of the conversation that’s going on now about categorization starts at a second step — “Since categorization is a good way to organize the world, we should…” But the first step is to ask the critical question: Is categorization a good idea? We can see, from the Yahoo versus Google example, that there are a number of cases where you get significant value out of not categorizing. Even Google adopted DMOZ, the open source version of the Yahoo directory, and later they downgraded its presence on the site, because almost no one was using it. When people were offered search and categorization side-by-side, fewer and fewer people were using categorization to find things.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s