The diminishing returns on data
This surprised me because there's a fairly widespread assumption out there that Google's search scale is an important source of its competitive advantage. Varian seems to be talking only about the effects of data scale on the quality of results and ads (there are other possible scale advantages, such as the efficiency of the underlying computing infrastructure), but if he's right that Google long ago hit the point of diminishing returns on data, that's going to require some rethinking of a few basic orthodoxies about competition on the web.
I was reminded, in particular, of one of Tim O'Reilly's fundamental beliefs about the business implications of Web 2.0: that a company's scale of data aggregation is crucial to its competitive success. As he recently wrote: "Understanding the dynamics of increasing returns on the web is the essence of what I called Web 2.0. Ultimately, on the network, applications win if they get better the more people use them. As I pointed out back in 2005, Google, Amazon, ebay, craigslist, wikipedia, and all other Web 2.0 superstar applications have this in common." (The italics are O'Reilly's.)
I don't see how he can lump together what are clearly several different things in this post. To take just two:
1) Google does "dumb" connections; not even its vaunted algorithms are as smart as any human being
2) O'Reilly is still correct because he is talking about the connections made by human intelligence -- and he is even right about Google in that it takes humans clicking on results to improve ranking accuracy
However, this (A) ...
But Varian's argument goes much further than that. He's saying that the assumption of an increasing returns dynamic in data collection -- what O'Reilly calls "the essence" of Web 2.0 -- is "pretty bogus." The benefit from aggregating data is actually subject to decreasing returns, thanks to the laws of statistics.
... ties in with this (B): When less is more
Because something can be done does not always mean it should be, though. Back in the 1980s, Richard Gabriel, an expert on Lisp programming, noted that quality in software development does not necessarily increase with functionality. "Worse is better" was the phrase he coined in a seminal essay on Lisp. There comes a point, he argued, where less functionality ("worse") is a more desirable ("better") optimisation of usefulness. In other words, a software program that is limited in scope but easy to use is generally better than one that is more comprehensive but harder to use.
Mr Gabriel's paradox was really an attack on "bloatware" -- in particular, the kind of feature-creep that forced Apple to abandon its Copland operating system and buy NeXT for the Unix software that became Macintosh OS X. In the process, "worse is better" has become one of the pillars of efficient software design and much else. Regrettably, it is not practised as much as it should be. But when it is, the process embodies simplicity, correctness, consistency and completeness.
But what the A bit lacks that the B takes into account is the human element.
Dr. Edward de Bono put it best:
We produce value through design.
There is no natural route to simplicity.
Design and simplicity are human creations. To expect our coarse software tools of today to produce elegance and intuitiveness is to invite disaster and to look for shortcuts that might never exist. (Sue me: I favor human intelligence and imagination over their by-product -- algorithms.)
Out of France: Towards the convergence of bibliographic formats? [Google English link] -- which is about the rise of ONIX. Although nuances are missing in the translation, I was surprised to learn that bibliographic data "over there" (France and Europe broadly) is not handled the same way it's been done in the United States. So much for my thinking librarianship and archivism had become a universal practice.
What troubles me is this bit:
It seems that XML in this regard, much more promising, as evidenced by the implementation of EAD for the world of archives, based on an XML declaration 8. Created in 1993 at the library of the University of California at Berkeley, EAD is a DTD for encoding archival research instruments and records we would insist on "and", which proves that we can use the same format for structuring both the primary information and secondary information.
EAD, as you will shortly see, is not considered the best flavor to bet on. There's RDF. I don't know if they can co-exist or if one must topple the other.
Via @doctorlaura and someone else on Twitter: Linked Data and Archival Description: Confluences, Contingencies, and Conflicts, presented at the Encoded Archival Description Roundtable at the Society of American Archivists Annual Meeting, August 12, 2009. A slideshow from which I am extracting some interesting ones:
What also must be taken into account is third-party client software that can create new relationships on the fly. To use another Wall Street analogy: think of how hedge funds take the raw data and metadata of finance and create proprietary trading systems. They see things others don't. So it will be with book metadata.
It looks forbidding when it's illustrated like that. But the thing is, it's all built one step at a time.
Yes. And to be able to see what the underlying definition is helps to ascertain the original assumptions that were made!
I'm not so sure about that. If that's true, then something is possibly wrong. Again Dr. Edward de Bono:
Patterns are asymmetric. The route from A to B is not the same as the route from B to A.
A chart such as that makes me think of my reaction to reading Theodor Nelson's mind-blowing Literary Machines back in the early 1980s. People laughed at me for grasping that information became spherical in nature. Well, that's a flat sphere.
This next slide is for @doctorlaura:
She raises many questions (in a Comment here) about how to do all this. I think such questions are asking for answers before all the questions themselves are known. Plus, we're not looking at a Big Bang phenomenon here. It's accretive, like the Internet itself.
Including the assumptions behind the labels!
Inheritance of concepts is an interesting idea.
But will everything necessarily be hierarchical in nature?
What I need someone to show me -- or to create (for everyone!) -- is a flowchart showing the hierarchy of metadata production, current methods, and the proposals vying to become standards. Beginning with what publishers use, then bookstores, then libraries, then other archival outlets, and finally where book metadata would fit in (somewhere right below publisher, I think) and how that would flow to everything/one else.
Smart Digital Books Metadata Notes #3
Smart Digital Books Metadata Notes #2
Smart Digital Books Metadata Notes #1
Dumb eBooks Must Die, Smart eBooks Must Live