Friday, August 14, 2009

Smart Digital Books Metadata Notes #4

Another throw-it-all-in-the-pot post.

The diminishing returns on data
This surprised me because there's a fairly widespread assumption out there that Google's search scale is an important source of its competitive advantage. Varian seems to be talking only about the effects of data scale on the quality of results and ads (there are other possible scale advantages, such as the efficiency of the underlying computing infrastructure), but if he's right that Google long ago hit the point of diminishing returns on data, that's going to require some rethinking of a few basic orthodoxies about competition on the web.

I was reminded, in particular, of one of Tim O'Reilly's fundamental beliefs about the business implications of Web 2.0: that a company's scale of data aggregation is crucial to its competitive success. As he recently wrote: "Understanding the dynamics of increasing returns on the web is the essence of what I called Web 2.0. Ultimately, on the network, applications win if they get better the more people use them. As I pointed out back in 2005, Google, Amazon, ebay, craigslist, wikipedia, and all other Web 2.0 superstar applications have this in common." (The italics are O'Reilly's.)

I don't see how he can lump together what are clearly several different things in this post. To take just two:

1) Google does "dumb" connections; not even its vaunted algorithms are as smart as any human being

2) O'Reilly is still correct because he is talking about the connections made by human intelligence -- and he is even right about Google in that it takes humans clicking on results to improve ranking accuracy

However, this (A) ...
But Varian's argument goes much further than that. He's saying that the assumption of an increasing returns dynamic in data collection -- what O'Reilly calls "the essence" of Web 2.0 -- is "pretty bogus." The benefit from aggregating data is actually subject to decreasing returns, thanks to the laws of statistics.

... ties in with this (B): When less is more
Because something can be done does not always mean it should be, though. Back in the 1980s, Richard Gabriel, an expert on Lisp programming, noted that quality in software development does not necessarily increase with functionality. "Worse is better" was the phrase he coined in a seminal essay on Lisp. There comes a point, he argued, where less functionality ("worse") is a more desirable ("better") optimisation of usefulness. In other words, a software program that is limited in scope but easy to use is generally better than one that is more comprehensive but harder to use.

Mr Gabriel's paradox was really an attack on "bloatware" -- in particular, the kind of feature-creep that forced Apple to abandon its Copland operating system and buy NeXT for the Unix software that became Macintosh OS X. In the process, "worse is better" has become one of the pillars of efficient software design and much else. Regrettably, it is not practised as much as it should be. But when it is, the process embodies simplicity, correctness, consistency and completeness.

But what the A bit lacks that the B takes into account is the human element.

Dr. Edward de Bono put it best:
We produce value through design.

And:
There is no natural route to simplicity.

Design and simplicity are human creations. To expect our coarse software tools of today to produce elegance and intuitiveness is to invite disaster and to look for shortcuts that might never exist. (Sue me: I favor human intelligence and imagination over their by-product -- algorithms.)

Out of France: Towards the convergence of bibliographic formats? [Google English link] -- which is about the rise of ONIX. Although nuances are missing in the translation, I was surprised to learn that bibliographic data "over there" (France and Europe broadly) is not handled the same way it's been done in the United States. So much for my thinking librarianship and archivism had become a universal practice.

What troubles me is this bit:
It seems that XML in this regard, much more promising, as evidenced by the implementation of EAD for the world of archives, based on an XML declaration 8. Created in 1993 at the library of the University of California at Berkeley, EAD is a DTD for encoding archival research instruments and records we would insist on "and", which proves that we can use the same format for structuring both the primary information and secondary information.

EAD, as you will shortly see, is not considered the best flavor to bet on. There's RDF. I don't know if they can co-exist or if one must topple the other.

Via @doctorlaura and someone else on Twitter: Linked Data and Archival Description: Confluences, Contingencies, and Conflicts, presented at the Encoded Archival Description Roundtable at the Society of American Archivists Annual Meeting, August 12, 2009. A slideshow from which I am extracting some interesting ones:



What also must be taken into account is third-party client software that can create new relationships on the fly. To use another Wall Street analogy: think of how hedge funds take the raw data and metadata of finance and create proprietary trading systems. They see things others don't. So it will be with book metadata.











It looks forbidding when it's illustrated like that. But the thing is, it's all built one step at a time.





Yes. And to be able to see what the underlying definition is helps to ascertain the original assumptions that were made!



I'm not so sure about that. If that's true, then something is possibly wrong. Again Dr. Edward de Bono:
Patterns are asymmetric. The route from A to B is not the same as the route from B to A.



A chart such as that makes me think of my reaction to reading Theodor Nelson's mind-blowing Literary Machines back in the early 1980s. People laughed at me for grasping that information became spherical in nature. Well, that's a flat sphere.



This next slide is for @doctorlaura:



She raises many questions (in a Comment here) about how to do all this. I think such questions are asking for answers before all the questions themselves are known. Plus, we're not looking at a Big Bang phenomenon here. It's accretive, like the Internet itself.



Including the assumptions behind the labels!



Inheritance of concepts is an interesting idea.



But will everything necessarily be hierarchical in nature?

What I need someone to show me -- or to create (for everyone!) -- is a flowchart showing the hierarchy of metadata production, current methods, and the proposals vying to become standards. Beginning with what publishers use, then bookstores, then libraries, then other archival outlets, and finally where book metadata would fit in (somewhere right below publisher, I think) and how that would flow to everything/one else.

Previously here:

Smart Digital Books Metadata Notes #3

Smart Digital Books Metadata Notes #2
Smart Digital Books Metadata Notes #1
Dumb eBooks Must Die, Smart eBooks Must Live

2 comments:

laura said...

One thing that struck me as I read this follows on directly from your closing paragraph:

When thinking about these book-related metadata standards, it seems to me that one of the main stumbling blocks is the different objectives of the various actors, publishers, distributors, retailers, librarians, and readers.

Publishers and retailers, are primarily interested in selling books, so the main concern is that sellers and readers can find and purchase them. Librarians are also interested in search and retrieval, although their primary interest is more archival than commercial in nature.

None of these are primarily interested in making books more "useful" to readers, which is what I understand as your main interest.

It seems to me that web actors such W3C or Google are interested in exactly such enhancements, albeit only for digital works, and (as you point out in your 5th installment) arguably to the extent that it's good for web business to make them more useful.

Mike Cane said...

>>>None of these are primarily interested in making books more "useful" to readers, which is what I understand as your main interest.

Yes, this is true. This is a failure of print publishing leadership -- and the lack of such too. It took Editis in France to commission a video envisioning a digital book that wasn't flat, static ePub!

>>>It seems to me that web actors such W3C or Google are interested in exactly such enhancements

I don't know about W3C, but certainly Google is -- which is why I oppose the Google Book Search settlement.