This is a transcript of the presentation “What Brian Cant Never Taught You About Metadata”, by Drew McLellen from Geek in the Park 2008.
About the Speaker
Drew is a proponent of the lower-case semantic web, a strong advocate of best practises, and is currently a Group Lead for the Web Standards Project. He is currently expending energies in the direction of the microformats movement. He blogs at all in the head and, with a little help from his friends, at 24ways. Drew has been hacking on the web since around 1996 and since then he's spread himself between both front- and back-end development projects, and now works as a Web Developer for edgeofmyseat.com in Maidenhead, UK.
[0:00] A quick show of hands -- who is under about 25 years of age?
[0:05] OK. Have you heard of Brian Cant?
[0:11] Yeah, yeah. So I suddenly thought on the way here I might be coming a little unstuck with this. But we will get onto who Brian Cant is.
[0:20] What I need you to do throughout this, if you see anywhere on the slides the wor “Brain”, where it should be Brian, please just shout out brain because I need to make a note and correct them.
[0:34] I called this presentatio “What Brian Cant Never Taught You about Metadata”. A possible subtitle i “Everything You Know about Metadata is Wrong” o “How I Learned to Stop Worrying and Love the Data”.
[0:49] Obviously this i “Geek in the Park”. You have come here to hear about metadata, HTML, robots, 1970s and ’80s children’s television programs, tofu, truth, honesty and some made-up rules stated as absolute.
[1:04] But I probably ought to tell you who I am. My name is Drew McLellan. As Bruce pointed out, I run, with Rachel Andrew, a Web development agency called edgeofmyseat.com. I blog at allinthehead.com. For the last couple of years, I have been group leader of the Web Standards Project. I am a community admin at microformats.org. So that is where I am coming from.
[1:30] First, let’s talk about Brian Cant. This is Brian Cant, bless him, a children’s television presenter from the 1970s, ’80s. He presente “Play School”, before that,“Play Away” and all sorts of things -- a narration for Camberwick Green, Trumpton, Chigley, Windy Miller, Pugh, Pugh Barney McGrew and all that sort of stuff -- the voice of Brian Cant.
[1:58] When I was growing up... This is a photo from 1982. This was sort of about the time I was watching. Brian Cant taught us an awful lot of things. Back then children’s television presenters wore shirts. This was a really good, moral time where television was not like it is now back then.
[2:24] The stories they put across had good, strong morals. I really learned almost everything I know from Brian Cant.
[2:31] Here is a graph.
[2:33] This is everything I know. 80 percent was taught to me by Brian Cant. 20 percent other sources.
[2:43] Of that 80 percent, if we break that down, you can see that one percent is about metadata and 99 percent other important stuff. But one percent, of all of the stuff that Brian Cant taught me, only one percent was metadata. That means there was more than Brian was letting on.
[3:08] So this presentation is to have a look at some of that stuff that Brian never taught us that I think we probably ought to know.
[3:16] But first, what did Brian teach us? One of the things was he taught us to share. Sharing is very important. It is one of the fundamental things that is a building block of society if you like. As human beings, we have to learn to share to get on with one another, share the good things and the bad things.
[3:42] The Web is really all about sharing. That is why it exists. It is for publishing information that other people might want to consume.
[3:52] I say just ask Humpty and Jemima here. Who remembers Humpty and Jemima? Big Ted, little Ted? Hamble? Hamble was a bit scary. But Humpty -- a fantastic dresser I think.
[4:06] So the Web is all about sharing. We use the Web to share stuff, like Brian taught us to. That is really its primary job. That is why it is there.
[4:21] And there are all sorts of things that we share on the Web. There are all sorts of types of data. Some of it is very common, some of it less so. Some of it has a very wide appeal, some quite a narrow appeal.
[4:35] Of the common items of data, you have got obviously things like names, addresses, dates and times, things that you are selling, reviews, all this sort of thing that has general, wide appeal, easy to understand and to consume.
[4:51] Then you have got more obscure things like your aunty publishing a hat collection. People maintaining Wikipedia pages about every place that Paul McCartney has sneezed since 1962. People publishing data about how many days their web server has been up.
[5:06] In fact my favorite Wikipedia page is simply title “List of Chairs”.
[5:12] It has quite an extraordinary list of chairs on there. So people are sharing all sorts of data.
[5:20] The reason people share is because all data is potentially useful to someone else other than yourself. Rather than keeping that information locked up, you put it out on the Web. Other people can use it and other people can do good things with it. That is why we publish.
[5:38] So Brian Cant taught us to share. He also taught us to tell the truth. These two things go hand-in-hand really, because when we share stuff it is only really useful if we are sharing stuff that is true. So data is only useful if it is correctly described, so people know what it is and how to use it. We are going to come back to that in just a little bit.
[6:05] This is the Play School clock from about 1963, or something. I don’t know. It wasn’t the 1980s version. That was a bit too difficult to cut out. Clearly I didn’t do that great of a job with this one.
[6:19] Anyway. So metadata. Who is familiar with the term metadata? OK. Metadata is just data about other data. It enables you to unlock what is there.
[6:37] Data in itself is pretty uninteresting and pretty difficult to understand unless you know what it is. Now, metadata is not a new concept. And it is certainly not a Web thing. It is something that exists in all walks of life.
[6:53] This is a bus timetable from the Island of Phuket. Here is the data. It is obviously pretty uninteresting. I can see a 17 here. There is a 10 and a 20. Of course, this data doesn’t make any sense without the descriptions at the top and at the side to actually put that data into context and make it mean something.
[7:18] A really obvious example is in audio. MP3 files and the like have metadata embedded in them. I picked up some at random from iTunes and it actually turned out to b “Brian’s Song”, about which I was quite pleased.
[7:35] So against an MP3 here we have got name of the album, name of the artist, where it falls in a collection, track one of 12, the genre, all this sort of information about the music track.
[7:50] Again, in photographs, particularly JPEGs and also raw files and things, you get information about the camera embedded in them. So you can see... I realize now this is a stupid one because I took it with a LensBaby, so there is no information about the aperture or the focal length.
[8:07] So this is misleading. This is lying metadata. So you shouldn’t do this. But there is all sorts of stuff there about the picture that helps us put into context and helps us to understand it a bit better.
[8:26] Metadata is pretty much everywhere. It is often hidden away, which actually is a bad thing. Metadata is so useful that it deserves prominence. The more you hide data away, the less useful it becomes. So conversely, the more you expose your metadata, the more useful it is and more useful the original data becomes, because of it being enhanced by the metadata.
[8:59] This brings us onto my first opinion stated as an absolute, which is...
[9:12] I think my Mac has crashed.
[9:19] Sorry about that.
[9:21] Metadata is everywhere. It’s hidden, and it shouldn’t be hidden. It should be exposed because the more it’s exposed, the more useful it is and the more it enhances the data that you’re already publishing.
[9:33] Rule Number One is: Beware of dark data.
[9:38] Dark data is data that’s hidden away under the surface that people can’t see. It’s a problem because when it’s not exposed, it’s not as useful because people don’t know it’s there.
[9:51] Hidden data gets forgotten, and over time it can go out of date or it can become inaccurate. Because it’s not clearly visible in front of everyone, people forget it’s there and don’t update it when the rest of the data is updated.
[10:08] So dark data is something to sa “no” to. Any type of metadata that’s hidden away is bad.
[10:20] But metadata generally isn’t complicated. In fact, this concept is an awful lot simpler than it sounds as a concept.
[10:27] Here we have a little bit of data:“sunny”. That doesn’t make an awful lot of sense on its own. If we add to i “yesterday’s weather”, now we’ve certainly got a bit of a picture building up. Sadly not today’s weather.
[10:46]“SL68AJ”, a string of letters and numbers, but if I pu “post code” it starts to make more sense.
[11:03] This is obviously a date. But when you actually put it in context wit “date of birth”, then it provides more information and can be used more meaningfully.
[11:17] Information is data that is put into context. Data on its own is grand. But without context, you can’t use it to inform and to make decisions.
[11:32] Metadata is what you use to put that data into context and turn that data into information. And information is better than data. Do you know how we know that information is better than data? Because if we look at this graph we see that information is three times better than data! On the scale of betterness, data rates a mere 20 where information gets 80 for information.
[11:58] Now, metadata isn’t new to the Web. It’s been around on the Web for an awful long time. There are all sorts of examples of how metadata exists on the Web. XML is a good example of something, but it’s also a good example of the way metadata works.
[12:20] You can see this small chunk of XML I’ve mocked up. The P.O. data in there i “orange house one zero”. But wrapping it in metadata, we tell that we’re talking about a building. We can see that its color is orange. The type of building is a house. It has one door and no windows.
[12:44] XML is quite good in that way. It lets you define your own schema and describe the data that you have in the way that is going to have the most meaning to it.
[12:56] Now you might be think about that and thinking,“OK. Well, we’ve got tags and we’re using the most appropriate tags to describe things. This is just like semantics in HTML.” Well, semantics and metadata aren’t identical concepts. They’re different ideas. But with the way we use things in the Web, there’s actually a lot of overlap between the two.
[13:22] Let’s think about HTML for a little bit. HTML’s got a pretty basic set of elements that we can use. There are tags allow us to communicate meaning, semantics. Their tags let us put data into context, metadata. And often once you’re putting into context, also these two are quite often the same thing.
[13:55] These are a couple of examples of tags that communicate meaning, so semantics, but not necessarily great for metadata, lik “paragraph”. It actually doesn’t tell you anything about the data, but it tells you its use in the document. The same with th “headings”, they real do not tell you anything about the meaning of it. They do tell you about the meaning, but not the... Whatever.
[14:26] These give you context. Here are a couple of examples:“title”. So this is more like metadata. It’s not just telling you what its meaning is in a document. It’s actually putting the data that that tag contains and giving it a context. So you can say,“This is the title for the page” o “This is the contact information for this page or this chunk of a page with the address on it.”
[14:53] HTML has got all sorts of ways to enable us to add metadata to the data in our document. The HTML class attribute, which people often mistakenly think is just a CSS thing, almost every tag they use inside the body of the web page can accept the class attribute. And it gives the element a sort of a classification, if you like. That’s what it’s there for.
[15:21] And what we can do is we can use this to say, in this example here, we’ve got my name,“Drew” in a document. I can wrap that in something like a span and pu “class=name”. We’ve added some metadata. We’ve said,“This isn’t the word ’drew’ as in drawing. It’s somebody’s name.” This is quite a useful technique.
[15:42] It really makes HTML very flexible indeed, in terms of metadata. Flexibility is good because it helps us [inaudible 15:50] proof things.
[15:55] Now when we think of metadata and HTML, there’s a really, really, really obvious example. Hands up if you have thought of a really obvious example of metadata in HTML. Nobody? It’s HTML metatags.
[16:17] These have been around since the very beginning, well since HTML 2. And they’re quite interesting. People often specify things like keywords, descriptions, author, copyright date, use Dublin core properties in the head of a document to say something about the page.
[16:37] The HTML spec actually doesn’t list any legal values for using these, so it’s actually pretty open and you can use it for anything that you like. And people do.
[16:47] Here’s an example of how those are implemented.“Metaname=content vacation in Greece sunshine. Metaname=description content my whole day in Greece”, that’s the country, not just covered in lard.“Metaname author, content Drew McLellan. Copyright, blah. Date.” Here’s a lovely date. And I have no idea what that is.
[17:17] The use of metatag really, in HTM, hasn’t been plain sailing. It’s got a pretty checkered history, if we’re honest.
[17:30] Many web designers, and developers, don’t know how to use them properly. And this leads to inconsistent use. Everybody uses them slightly differently. With keywords, do you separate them with commas? Do you separate them with spaces? People aren’t ever quite sure.
[17:49] Do you need to specify the singular and the plural properties? Does case have any bearing? Is it case-sensitive or case-insensitive? So nobody really knows how to use them. And they’re also dark data. They’re data that’s hidden away in the head of a document. Nobody sees it.
[18:09] If the boss of the company looks his corporate website and goes through it, he’s going to spot if there’s something amiss in the copy of the home page. But he’s not going to spot if the metatags in the head of the document have just not been maintained in so long that they’re completely inaccurate. They’re dark data, and therefore they’re bad.
[18:33] Many also misunderstand the purpose of them. And yes, we’re talking about web marketeers and SEO experts, so-called. Metatags aren’t for search engines, but they are used by search engines. Metatags are actually for describing the data in the page.
[18:57] In fact the HTML spec goes a little bit further, and it says their purpose i “to provide the means to discover dataset that exists and how it may be obtained or accessed, and to document the content quality and the feature of dataset, indicating its fitness for use.” But basically, just describe the data.
[19:17] So this leads us on to rule number two: The more you lie, the less you can be trusted, and the less valuable the information you’re providing becomes.
[19:29] This is something that Brian Cant taught us.
[19:36] This leads very quickly then into rule three: The fewer distinct consumers there are of metadata, the less valuable the metadata becomes over time.
[19:47] Let’s sort of have a look at these a little bit.
[19:52] This is the sort of process people go through is, only search engines really use meta-keywords and descriptions. Therefore authors start writing their keywords and descriptions targeted for search engines. As the search engine market sort of leveled out and Google really took over, people were then writing for Google.
[20:15] And they start then writing with an approach of how do I get my site well-ranked. Not how do I describe the data that’s in this page. And so search-engines can no longer trust the keywords or descriptions, because people are writing them to try to gain the system. And so it just spoils it for everyone. Brian Cant never said anything about that.
[20:44] So it’s important to be truthful in what we write and not write specifically for something like a search engine or any particular consumer, but just to describe the data that you have.
[21:00] The third rule is that if you’re only writing for one particular consumer, the data has a tendency to become less useful.
[21:12] So, what have we learned so far? We’ve learned that sharing is good. And the Web is made for sharing. Metadata isn’t new in real life or on the Web. HTML gives us ways to express metadata. But all this only really works if we tell the truth.
[21:38] Which brings us on to part two: We need thems robots on our side. Robots in this sense, I’m talking about software out on the Web that might be something like a search engine crawler that’s reading pages. It might be an extension or toolbar on somebody’s browser. It could be a bit of desktop software. But anything that uses metadata that’s in the Web and embedded. These are all for purposes of discussion, robots.
[22:14] So robots are really either with us or against us. And to be quite frank, we don’t want them against us, so we’d better cooperate. But they can actually save us time and save us effort, and mean that we have to do less work. So it’s in their interest and our interests to try to pander to them as much as possible, and this makes them happy.
[22:41] Now Tofu robot says that data is everywhere. And he’s right. And robots try their hardest to consume the data, but we have to kind of feed it to them in a reasonable way. Now as human beings we have lots of idioms for how we describe data.
[23:04] There are all sorts of things, for example opening times; you spot a list of opening times on a website for a shop. It’s pretty obvious what that is. It makes sense. It’s an idiom. We understand it even though if it’s taken completely out of context, a bit of software might not be able to understand it.
[23:22] Things like event details, addresses, you know, as human beings we can spot an address just because of the way it’s laid out. And the sort of mixture of words and numbers and things, it just makes sense to us as an address. We understand the pattern of short things with line breaks, not big flowing long sentences, and what have you.
[23:42] So from our point of view, idioms are good, and they don’t always have to be formal. Quite often they’re pretty informal. Things don’t have to be formal to be understood, in terms of data. Not everything has to be XML to be readable by a machine.
[24:01] So, you know, informal works for us, and informal is good. Informal actually works for robots too, because what’s important is consistency. And let’s have a look why.
[24:13] Humans are pretty quick to adapt. This chap has adapted to the rain with his plastic bag. We can easily reevaluate and adjust. If something is in a format we weren’t expecting, we can work out what it is. We can climb stairs without a trip to the workshop. We’re pretty adaptable. And that’s a good thing.
[24:36] But robots prefer patterns. They like, in fact, they rely on known patterns. Patterns can be formal or informal, it doesn’t matter. But they have to be consistent and they have to be repeatable patterns.
[24:52] As it turns out humans like patterns too. We like routine. We like repeating patterns. Robots like patterns because they’re repeatable. We like patterns because we don’t have to think. And thinking is hard and uncomfortable and inconvenient.
[25:08] If something is the same every time we do it, then we don’t have to engage our brains. We just repeat it. It’s like cleaning your teeth. You don’t have to think about it. You just do it. And because you’re cleaning your teeth the same every time, it’s not as much of an effort as it might be if it was a new task every time you approached it.
[25:27] So now we want to avoid thinking at all costs. If we look at thinking, it’s actually 21 percent hard, four percent uncomfortable, 29 percent inconvenient, and quite importantly, 45 percent prone to error. So thinking is really bad, and we want to avoid that pretty much at all costs.
[25:49] So as it turns out, what’s good for them robots is good for us to. They like repeatable patterns. We like repeatable patterns. So we can get along. So if we’re going to publish our metadata on the Web, we’ve got a bit of a challenge.
[26:09] So criteria. Metadata is good so we want to use it. It helps put our data into context, and it helps us use it as information. But we need to embrace reusable patterns. We need to avoid dark data. We want to avoid hiding things, because that leads to stuff going out of date. And it leads to lying.
[26:31] We want to avoid specific data for any consumer. So we don’t want to provide metadata only Google is reading, for example, because then we start targeting it towards Google. And that results in lying. We want to make it easy to be truthful.
[26:50] We want to embrace existing idioms. There’s no point in reinventing the wheel. We’ve got loads of ways of expressing data. Let’s use them. And we also want to reuse existing technology, because writing new technology is busy work and a waste of time and has to be avoided.
[27:09] So do you remember this from a while before our crash? Use of the course attribute in HTML to admit data to a simple bit of data on the page.
[27:24] As I said this is really flexible, and it’s really powerful. Of all criteria, it helps us to avoid dark data, because the data that we’re adding the metadata to is just in the page. It’s not hidden. It’s visible. We’re taking the data that’s already there and adding the meaning to it.
[27:43] It helps us avoid specific data for any consumer, because right at the outset, we’re already providing this data for our normal web users and for any robots. So we’ve got two users from the outset. So it’s not specific. We’ve guarded ourselves against that.
[28:02] Make it easy to be truthful, because if you’re going to lie, you’re lying in front of all your users, because it’s right there on the page. So, if you’re saying this page is about crash diets, and it’s not, it’s about horses, then your users are going to be able to see.
[28:22] You’re embracing existing idioms, because you’re already publishing this data, and all you’re doing is taking the data that’s on your page and adding the metadata to it. So there are no new idioms. You’re just taking the existing idioms and just adding the metadata to it. And you’re reusing existing technology, because HTML is pretty existing. I think you’ll agree.
[28:43] So, what about the last one: The need to embrace reusable patterns? Well, this is where you find out you’ve been duped all along. What I’m really talking about is microformats.
[28:57] Hands up if you’ve heard of microformats. Fantastic.
[29:02] Microformats are just a bunch of patterns for doing exactly what I’ve just described. So, if you’ve got data in your page and you want to describe it by adding some HTML class attributes, microformats are just patterns that say,“Here’s the attributes to add.” And this is great because it means you don’t have to think. You just take them and use them, and somebody else has done all the thinking for you.
[29:32] There are sets of classes for like names and addresses. The format is called hCard. It’s based on a format called vCard, which is what your address book stuff uses, pretty much all address books, Outlook and everything uses. And it just gives us simple class names:“given-name”,“family-name”,“email”,“URL”,“tel”,“title”,“org” for organization,“street-address”,“locality”. Somebody’s just assembled these class names for you, and you can use them.
[30:03] I won’t go too much into the implementation, because I guess you’re probably all reasonably familiar with the implementation. But here we’ve got this simple paragraph that says,“An announcement: Fire caused by Apple chief executive Steve Jobs earlier this year,” blah blah blah blah blah.
[30:19] And all we’ve done is taken these class names. So we’ve got this one, this vCard which says this chunk of the page is a set of address information, contact details, or what have you. We’ve marked an organization, Apple, a role of chief executive, and a formatted name of Steve Jobs. So, just by adding that little bit of metadata, we’ve created a reusable contact card.
[30:42] There’s stuff in there for events and dates. hCalendar is based on the iCal format. It’s an existing format that’s used by calendaring software. It ha “date-time start”,“date-time end”,“summary”,“location”,“URL”,“description”. So all the things you need for describing pretty much any sort of event.
[31:01] There are things for reviews. Again, the item, who you are reviewing it, your description of it, the rating that you’re giving something.
[31:11] And relationships. This is actually used in th “rel” attribute on a link to describe the relationship between the page you’re on and the page you’re linking to, where those pages represent people. So you can say someone is a coworker or an acquaintance or a friend, or the person you’re linking to is you.
[31:30] And there’s lots, lots more. Licenses, tags, date-based feeds for things like, it’s modeled on Atom, but for these sort of RSS-feed type things directories, products, payments, geo-location. Lots, lots more in the works.
[31:46] So, how does this deal with our criteria for publishing metadata? We’re already avoiding dark data. We’re not hiding anything in the page. It’s already just on the page.
[31:56] We’re avoiding data for specific consumers by publishing it at the same time for robots and for our web users, the same data.
[32:03] We’re making it easy to be truthful because the data’s out in the open, and if you’re lying, your visitors are going to see it and your boss is going to see it.
[32:13] You can embrace existing idioms the way you’re already publishing. You can reuse existing technology, in the way that microformats, also, as well as using HTML for the implementation is reusing, wherever it’s possible to do so, technologies like iCal and vCard that are out there. Why reinvent the wheel?
[32:33] So the need to embrace reusable patterns. Well, microformats are these patterns, so that’s the last criteria.
[32:42] Now, Brian Cant never knew this, but I bet he’d be thrilled.
[32:46] He’s an old man now, but if anyone wants to volunteer to go and explain it to him, that would be great.
[32:52] So, microformats are good. They’re a humane way of using metadata today with existing technology on the Web. They’re easy for us to implement. And they’re readable by our robotic friends.
[33:10] Just a few quick examples. This is an event page fo “Oxford Geek Night”, a month or so ago. They’re already publishing, down at the bottom, the address of the event: the Jericho Tavern there. This is actually marked up as an hCard. And when you click thi “add to address book” link, now it’s filing through a script, so handing it over to a little robot helper, and that little robot helper is passing back, from the data that’s in this page, that address information in a vCard format that you can save to your address book, pull it into your phone, do whatever.
[33:47]“Radio Times”, they publish lots of information program listings on radio and television. All this here is marked up as hCalendar. So, what you can do is you can actually export this information to your Google Calendar or to your phone or what have you.
[34:06] And you can also subscribe to it. So you can say,“Refresh this data from my calendar application once an hour,” or whatever, so you get all the latest updates on a rolling basis.
[34:20] Anyone seen the Google Social Graph API demonstration application? This is using the XFN microformat that describes relationships between people. You can put in a URL that represents a person, say,“Here, I’ve got adactio.com, Jeremy Keith’s website.” You clic “find connections” and it lists all the people, first, who link to you as a contact, who link to you as a friend, who link to you as a coworker. And it’s crawling all that data and bringing back what they call a social graph, ugh, of your connections between people.
[34:57] Now, how powerful is this, if you sign up to a new social-networking site, to just say,“Here’s my URL. Find my friends”? That’s pretty useful.
[35:07] Yahoo have got a new search product called SearchMonkey. I think it’s a labs-type product at the moment. But this is a search result for a search for a model of Canon camera, EOS 400D, and a search for reviews of it. So it’s found, hopefully, no just pure shopping sites.
[35:31] You know how, when you search for reviews of a product, you get 100 different shopping sites that all say,“Be the first to add a review!” Because they’ve got the wor “review”. It’s turn on me.
[35:40] Well, this is actually searching things that have reviews marked up with hReview. So, not only is it a review, it’s from a site where somebody’s actually taken some care to mark the data up well. So, the chances of it being a high-quality result is much higher.
[35:57] And of course, as we’re publishing more and more of this data, there are more and more uses. It’s still pretty early days, but we’re getting to the point where things like this are actually a reality.
[36:09] So, if you’re a robot master, there are places you can go. You can go to microformats.org/wiki/parsers. That lists all the bits of software that you can download and run as part of your web application, or what have you, to help you parse microformats. I’ve also got some things up on tools.microformatic.com, little sort of online demonstrations and what have you.
[36:36] If you’re a human, you can go to microformats.org. Admittedly, you have to be a fairly hardy human. It’s not the most digestible of websites, but if you do stick with it, there’s some really good stuff there.
[36:46] If you’re not quite made of that type of material, O’Reilly have a PDF book, I think it’s about $10, by a chap called Brian Suda, which is a really good introduction. And there’s also this book by John Allsop from Friends of ED that is a really good introduction to microformats. And it has a fantastic reference section in it, which means it’s a great one to keep in the corner of your desk when you’re thinking,“What’s the class name for that?” And you can just flip to it and find it.
[37:18] So that’s what Brian Cant never taught you about metadata. And that’s all I’ve got to say.