I had the pleasure of attending the first day of the O’Reilly Tools of Change (TOC) conference this year. It’s an annual conference that mainly covers digital publishing, and it’s nearly as good as Books in Browsers.
There were a lot of great talks and some interesting startups and companies on exhibit, but I’m going to focus this post on data. It’s a little technical, but it’s a topic that fascinates me, because understanding and applying data can lead to some really cool results.(For those of you who may be interested in what else happened at TOC, O’Reilly has posted some videos of the conference on YouTube and most of the speaker slides.)
First, some interesting tidbits I learned while circling the exhibits. According to Nielsen, rich metadata means more sales (more on metadata later). Also, Shindig is a new company that publishers (and authors) can use to host series events, and do virtual book tours. And lastly, a representative at Qbend, which creates digital books, told me that all of the state of Florida schools (not including colleges and universities), use e-textbooks. Either students own their own tablet, or they rent one from their school. Amazing! (For more on digital textbooks, read my post, “The Coming of Age of E-Textbooks?“)
Also, mobile is getting more important:
And now, on to some highlights from the talks about data!
Mashups, Models, and Monetization: Making your indexes go all the way
Pilar Wyman from Wyman Indexing gave a really interesting talk on the importance of indexes in ebooks. Turns out, indexes can do a lot more than be a helpful guide in the back of a non-fiction book.
Wyman pointed out that there are hardware and software issues with current ebook indexing. Many indexes are inconsistent, and some things that could be great indexes are currently too inconsistent to be of much use (for example, Twitter hashtags). She thinks that linked indexes should be active features. It’s also important to have vocabulary control, especially across multiple platforms and publications–also called a thesaurus or taxonomy.
When you put all the indexes together, you can see better as a whole and control the quality, she said. Also, indexes are sets of metadata tagged to help curate aspects of your book, and one of the basic tenants of indexing is to consolidate the information for the reader.
Indexing is important because it affects how you search the book. According to Wyman, there is precision versus recall. Ideally, indexes should help with both, by showing results only specific to the topic searched for.
“I only want what I’m looking for, not related topics within the book,” Wyman said.
Ultimately, it’s best to marry indexes with full text search and get both precision and total recall. There are different stages of search, so it would be good to serve readers at all stages, Wyman said. For example, a reader may want to answer a question, find a specific concept they know is in there, narrow a broad term, or find out if concept is in a book. People change focus as they read and how and what they’re searching, and to make sales, you want to have a conversation with buyers, she said. That’s what an index allows you to do after ebook has left your hands.
Publishers also need to know and appreciate learning styles as well as subject matter, Wyman said. Reading is only a small part of what students are doing. Publishers can also do text tagging to sell chunks of content (rethinking and repurposing) with the index metadata. The index can also be used for marketing and advertising, and it’s useful for readers.
Screen based indexes are as important as search, Wyman said. It’s a crucial piece of accessibility into non-fiction content. The index is also a navigational tool.
Reading electronically is fine for fiction, she said, but for non-fiction, readers need better navigation and better recognition to help readers place themselves in the text. Readers should be able to keep track of where they are in the book and get back to the information as they read it.
Metadata needs to be as specific as possible too, she said. And Amazon’s “look inside the book” feature should be included in all downloads for ebooks to show readers what’s coming. It’s a great sales tool, because it lets readers know what they’re getting.
Traditional indexes in onscreen environments can work really well, and do for software help, Wyman said.
“We’ve been searching and tagging digital information for over 40 years,” she said. “The digital environment is just another interface, as is paper.”
Publishers need to take advantage of digital usability, enabling readers to type in a term and let the software finish by showing what’s available in the book related to that term. People tend to find better information using index rather than through find or search, so ideally search or find should be combined with the index into one search feature, she said.
This may mean including index in the preliminary part of the book for search tools (the front matter), but it’s giving context for what’s available, and Wyman said that redundancy is the antidote to confusion.
“We need repetition and redundancy to learn,” she said.
Indexes can cross reference via breadcrumbs, and searches can contain contents listings and table of contents. The index can be a safety net, where readers can search for any word. There should also be multiple access points so the reader has a good experience with the text and doesn’t leave it.
Basically, indexes are just collections of metadata, Wyman said. They can map to book chapters, link categories, and be used to build a structure of online portal. The subject categories at the chapter level can help readers build their text for purchase, and customize for purchase. Upon purchase, the readers could get the whole index. This could all also be subscription based, and the index could be how readers come in and decide what books to buy, Wyman said.
So, she said, the question becomes why do the metadata work if you already have them in the index? Using a precision index search, even sub entries are used for metadata, she said. She recommended that if you manage intranet content, to use index metadata.
As companies grow, it becomes difficult to find exact metadata people are searching for, and search engines are not enough. Readers need structural and editorial tags applied to the content. It becomes an index masher, and can grow into a fun, useful tool for finding information. Locators can be coded to tell you which book the information comes from, which gives you discoverability of books as well as monetization, she said. Also, knowing how often a topic is indexed lets you know visually how helpful a book is.
For ebooks, multiple pages or references can be color-coded, so there’s no need to lose any visual cues (such as page numbers or italics). But, it’s important to have a unique ID for each chunk of information.
One toolfor indexing is Syndex, though new tools are constantly being introduced. Excel, Word, and InDesign are other examples of tools, though Wyman said each has its issues. The default is to write the index first and then go back and merge it with the content.
Tagging can be rote work, Wyman said, but you need a human brain to understand and recognize where the content is so that tagging can be precise to the right chunk of information. Certain issues include lowercase versus uppercase, which can prevent successful mash up of indexes. Also, no cross references–sometimes you want links to other sites or books for additional sales. There is also inconsistent terminology, and IDs are not always simple, unique, or stable.
Wyman recommends including indexes not only as chapters in ebooks, but also with some link in the beginning of ebook to help ensure they get searched first, and that indexes are married to full text search. The “look inside” feature should be included as downloadable content, and publishers should consider indexing fiction, at least for certain titles.
Indexes should be free, because they can help with sales, Wyman said. EPUB3 allows for chapter-like index recommendations. The index should be accessible from every page of the content, as a header, footer or icon, and not force the reader to go back to the front of the book to find it.
When showing the results, Wyman said publishers should let readers see snippets of text to get idea of what is in the book and know if its the right section. It would also be helpful to see what’s above, below, and closely related; Wyman said she wants to see context around headings in index. Cross references can also help readers refine their phrasing, and there should be reciprocal linking to let the reader go back to wherever they were before the search.
EPUB 3 also has a pop up option and reverse index (what links to a specific term, what terms in the index link to a map, heading, etc.).
Wyman said she had a wish list for future indexes, and it included being able to open any index for browsing to see all metadata, locator, and generic cross references, and to filter based on decorations (visual cues such as italics, etc.). Also, explanatory notes should always be available for viewing, not just at certain places of the index, the index should always be universally accessible and usable, and while reading, it’d be nice to be able to get a pop up view of the index showing a phrase or term. Lastly, it’d be great to have search in a display and pop up view.
More on Metadata
To see an example of how powerful metadata is, and how it can be applied, see Tim Carmody’s article in The Verge on how Google tracks the flu outbreak. It’s about how Google uses searches on its site to show patterns. A large surge in searches can tell a story, such as where in the country the most flu outbreaks are. This can also be applied to elements in a book, such as plots and characters.
Creating Powerful Metadata
Renee Register, from DataCurate, gave a talk on metadata. I’ve written a fair number of posts on metadata too, which can be found here.
Register said that there are two kinds of book metadata: metadata that goes to websites (for example, using ONIX to distribute information about a book to Amazon), and metadata that is embedded in an ebook (which reading devices, such as Kindle, use to display information, such as title, cover, etc.).
Below is a list of recommended metadata to include:
- Identifier
- Product form/format
- Title/subtitle
- Contributor(s)
- Language
- Extent (page count, file size, etc.)
- Publisher/Imprint/Brand Name
- Subject(s)
- Intended Audience
- Textual Description
- Publisher Status Code
- Publication Date
- Return Code
- Product Availability Code
- Price
- Digital Image of Product
- Territorial Rights
More involved metadata:
- Edition
- Country of Publication
- Series/Set Information
- Strict on sale date
- Age range (for children’s books and YA)
- Distributor/Vendor of record
- Related products
- DRM/Usage constraints
- Software/hardware requirements
Even more metadata (enhanced):
- Author/Contributor Biography
- Illustration Details
- Book Excerpt
- Prizes and Awards
- Reviews
- Original Publication Date
- Reading age (for children’s books)
- Grade range (for children’s books)
- Keywords
- Digital formats (EPUB, Mobi, etc.)
- Digital Product Description
Publisher Evolution: Embracing Change Through Partnerships
In these speaker slides, Laura Baldwin from O’Reilly Media and Phil Ollila from Ingram shared that “Digital is not just about ebooks. It’s about reach and content leverage.”
This is important, and it means digital books should be in multiple languages, have a global reach, and be accessible to blind people and other people who have issues with reading accessibility. It’s also good to engage users and readers, especially as authors.
What Are Altmetrics And What Can They Do For Me?
Todd Carpenter from NISO gave a talk on alternative metrics (altmetrics), which I think has the most interesting applications.
According to Carpenter, standards are familiar, even if you don’t notice. Also, machines gather data about user behavior.
The founders of Google originally focused on web behavior based on bibliometrics. It assessed the quality of websites, and then became Google. Carpenter said that the page rank algorithm is derived from citation analysis, though citations aren’t always what we think they are. Carpenter said that they are references but have no assessment value, and that it’s a numbers game.
For example, you can appear highly ranked on Google’s algorithm because a lot of people link to you, but that doesn’t mean your link is high quality. If you rely on citations, he said, you are waiting for years to know what’s important now. So, it’s better to rely on other metrics–altmetrics.
There are better ways to assess the quality of content and methodologies for distribution, such as usage based metrics. These include the number of times something is downloaded, and page views. There is also clickstream analysis, and user interface, such as how people approach content in the age of content.
Social media metrics (how to quantify and analyze it and not simple count followers, etc.) as well as behavioral metrics, such as how people engage with digital content and physical content, count as altmetrics. An example of a behavioral metric is tracking people moving through supermarkets.
Carpenter also suggested multivariate analysis, that combines metrics and elicits insight of the data.
An example of altmetrics in action in the publishing world is Amazon’s, “if you like x, then y” feature, which shows data from user behavior.
Netflix is another example. The company created content (Netflix only shows) based on user data.
Carpenter also made an interesting point that the people we recognize often take what’s good in one field and draw connections with another to branch out into something new. In a project that mapped the world of science, the data showed which particular scientist read and moved to different journals, and made connections based on user behavior. Researchers could then identify publication connections and overlap points, where innovation is probably happening. It’s real time trend spotting, based on usage data. (This project reminded me of the novel, Mr. Penumbra’s 24-Hour Bookstore).
This kind of thing can help improve products and customer service, and to know people’s barriers to content. Carpenter said there can be editorial applications, such as which organizations are percolating, and which communities should be engaged with. Publishers could use altmetrics to get involved in order to solicit editorial content.
Almetrics could also be used for author services, telling them how their book is doing. It can also have marketing applications.
Flickr, for example, has a feature called Interestingness. It’s how many people commented, favorited photos, as well as tonal variation to determine interestingness.
Tynt is another example. It’s the 9th largest data aggregation site on the web, behind the AOL advertising network, and second behind Facebook sharing app. Tynt allows people to track copy paste and links back to things, which is still the number one sharing methodology.
Ex libris, in the library community also uses altmetrics. They have bx, a search tool that provides users something akin to Amazons “x then y.” It shows the people who looked at x article also looked at y article. The site then tracks to show which articles are read in a session, and present that information to the library patron (it’s not a search methodology based on metadata, but based on what other people read).
PLOS is another example. It exposes on every article all the metrics they can find, how many people cite the paper, how often it’s linked, how many people point back, how often is it posted on social bookmarks, and how often has its been downloaded and in what form. This helps give better analysis and better service to authors about how well content is being distributed (open access).
Sourcebooks is a publishing company that uses a lot of altmetrics and A/B testing. They analyze cover images, metadata, responses to covers, genre categorization, and how to position the book, and they constantly update based on real time data (for example, changing the back cover text).
Carpenter said there are issues we face with alternative metrics. What do we measure and how is a complicated question, he said. But the bigger issue in the scientific community is authorship and what roles people play as contributors. Also, how to disambiguate collective?
Basic definitions are needed so we are all talking about the same thing, Carpenter said, so we know how to rank best seller lists and provide useful industry stats. We need more open exchange of component data, he said, and it needs to be audited in a way so people aren’t gaming the system.
According to Carpenter, successful altmetrics require the following:
- Large amounts of data
- Good analysis
- Implementation strategy
- Creativity
- Standards
It doesn’t require a billion dollar company, but rather A/B testing for authors (for example), and semantic taxonomies.
Altmetrics is still in early days, Carpenter said. It’s a valuable system that focuses not just on the journal, but also on the researcher who contributed. To find out more, visit altmetrics.org.
*Update:
Not mentioned in the altmetrics talk, but still another great example of a company using altmetrics is ImpactStory, which shows metrics around individuals and how they are impacting.