Hands-on Metadata

Welcome to the hands-on metadata session of meta|morphosis! As we discussed in yesterday's session (Practical Metadata) you are already familiar with lots of different kinds of metadata. In today's session we will focus on a particular format for metadata which you may not be as familiar with: XML.

XML: eXtensible Markup Language

Example

  <book isbn="978-0898713619">
    <title>Numerical Linear Algebra</title>
    <author>Lloyd N. Trefethen</author>
    <author>David Bau, III</author>
    <publisher>SIAM</publisher>
    <favorite />
    <callnumber lcc="QA184" ddc="512.5" />
  </book>

Vocabulary

XML code is full of elements which can serve a number of purposes. Some are fields to which you can assign data. Others indicate the structure of these fields in relationship to each other. Elements from the above example include:

book
title
author
publisher
favorite
callnumber

Many elements have start and end tags. Start tags look like:

      <title>
      <author>
      <publisher>

End tags look like (note the /):

      </title>
      </author>
      </publisher>

A value may be assigned to an element by putting it between the start and end tags. For example, the code:

      <title>Numerical Linear Algebra</title>

means that the title is Numerical Linear Algebra.

Other elements may be empty or self-closing; that is, rather than having separate start and end tags, they just have one tag with a / at the end. Some self-closing elements from the above example are:

      <favorite />
      <callnumber lcc="QA184" ddc="512.5" />

Elements may also have attributes associated with them. These appear in the tag along with the element name. In the above example, the following lines have attributes::

      <book isbn="978-0898713619">
      <callnumber lcc="QA184" ddc="512.5" />

These lines mean that:

The book has the isbn 978-0898713619.
The book has two callnumbers, a lcc that is QA184 and ddc that is 512.5.

Note that data about an element may appear between the start and end tags or in an attribute. There is no real reason to make the code start with

      <book isbn="978-0898713619">

rather than

      <book>
        <isbn>978-0898713619<isbn>
        …

except that it was the choice I made when making up this particular XML language.

Note that XML is not any one particular language made of elements and attributes, like HTML or the books language used in the example above. Rather, it is a set of rules that a language must follow to be called an XML application or XML language. It is called extensible because anybody can make up such a language, either from scratch or by adding rules to an existing language. This does not mean that you can just add whatever elements you like to an XML document and have it still be valid.

A document that follows the rules of XML is said to be well-formed. If in addition to that, a document is written in a specific XML language, and it follows the rules of that language, it is said to be valid.

Some Rules of XML

There are about a hundred rules for XML, but here are some of the most common ones that will affect your documents:

Elements must have both start and end tags or be "self-closing".

The following is well-formed:

  <notesStmt>
    <note>
      <p>This is a note in a TEI document.</p>
    </note>
  </notesStmt>

The following is not well-formed:

  <notesStmt>
    <note>
      <p>This is a note in a TEI document.
    </note>
  </notesStmt>

Elements must not overlap; they must be well-nested.

The following is well-formed:

  <poem>
    <title>A poem</title>
    <stanza>
      <line>This stanza's just one line long.</line>
    </stanza>
  </poem>

The following is not well-formed:

  <poem>
    <title>A poem</title>
    <stanza>
      <line>This stanza's just one line long.</stanza>
    </line>
  </poem>

Values of attributes must be in double quotes.

The following is well-formed:

  <select name="animal">
    <option value="dog">Dog</option>
    <option value="cat" selected="selected">Cat</option>
  </select>

The following is not well-formed:

  <select name=animal>
    <option value=dog>Dog</option>
    <option value=cat selected>Cat</option>
  </select>

As mentioned above are many other rules for XML, but these are some of the most common ones that come up when composing documents.

XML tools

There are quite a few special purpose programs you can use to edit and manipulate XML documents. oXygen and XMLSpy are examples of popular editors. Many XML editors have special features such as allowing you to validate or transform your document. Others may just color the words in your document in such a way that you can tell if it is not valid.

You can see from the above examples that XML is plain text; thus, you can edit it with any text editor! You do not need to pay for a program to write, edit or even validate your XML. For example, Xerces is a family of programs which can be downloaded and used to parse and validate any XML language that you have a specification for. There are also freely available resources that can be used for specific XML languages. For example:

W3C Markup Validation Service - a site which validates a number of languages include XHTML, MathML and SVG
The OAC BPG validator for EAD - Online Archive of California's downloadable program which not only validates Encoded Archival Description (EAD) finding aids, but also verifies that they follow Best Practices Guidelines.
Library of Congress provides stylesheets to manage MARCXML documents. There are stylesheets for validating documents and also converting them to other bibliographic metadata formats.

For our examples today, we will not be using an XML-specific editor, but rather a plain text editor, such as a text field in a web browser or the Windows program WordPad. We will also be using freely available tools to check our documents for validity.

Example: Movie reviews

One effective way of creating valid XML documents, especially documents in an XML language with which you are not very familiar, is to start with documents that are already valid and just substitute one's own information for the values already there. For an example of this, visit the MovieXML site, look at other people's movie reviews, and add your own!

Example: XHTML

HTML is the language in which web pages are written. Putting data in an HTML element may have a different kind of meaning depending on the element. For example:

      <title>my web page</title>

means that my web page is the title of the page, and should be displayed in the title bar at the top of the browser. The code:

      i <i>love</i> <b>movies</b>.

means that the word love should be displayed in italics and the word movies should be displayed in bold like this: i love movies.

There are many web pages on the internet with imperfect HTML. Web browsers tend to be forgiving of broken or ambiguous code, and do their best to display it anyway. For example, in the following HTML code:

      <b>howdy</i>

should the word howdy be made bold or italicized? Or both? Different browsers may make different guesses about how the document should look, which means that different people viewing the page would see different things.

Why XHTML?

There are many reasons to standardize the HTML code on your web pages into an XML format.

To make your web pages appear more consistently across various browsers
To make your code more consistent and easier to maintain across your site
To make code more accessible and readable by devices other than standard computers running standard web browsers, such as PDAs, mobile phones, and screen readers
To proudly display this or a similar logo:

How XHTML?

Here are some things you may need to check when standardizing your web pages:

Many web pages do not have the header information they need. Unless you are doing fancy things with your web page (style sheets, scripts) you can replace the top of your web page (everything before the body tag) with the following incantation:
```
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      <title>My web page</title>
    </head>
```
and then change the contents of the title element to what the title of the web page should be. This header information indicates not only the type of document you are making (XHTML) but also the type of characters you are using (utf-8).
All tags (except DOCTYPE) should be lowercase.
Text may not be free-floating in the document -- it must appear in a block element, such as a <p> (paragraph).
The document must be well-formed XML -- that is:
- tags must have matching end tags or be self-closing
- values of attributes must be in double-quotes
- tags must not overlap
There will also be random other things to change -- the validator will let you know what they are!

Exercise: Fix the web page on the flash drive!

Plug your flash drive into a USB port on your computer, and look at the web page in the directory web_page. Double-click the HTML file to view it in a web browser. Notice that (though it is not at all fancy!) it renders reasonably well.

Now edit the HTML document with WordPad; to do this, right-click the file and choose Open With... and choose WordPad. Can you find ways in which it is not well-formed or valid?

The W3C Markup Validation Service is a great way to check your document.

Example: NDNP

The National Digital Newspaper Program has a set of specifications for digitizing historic newspaper. Some of these are specifications for newspaper page images, such as TIFF, JP2 and PDF. Others are XML specifications for metadata about microfilm reels, newspaper issues and OCR (describing the position of words on each page).

Library of Congress also provides a tool, the Digital Viewer and Validator (DVV), for validating digital newspaper collections against that specification, and viewing them logically as well. In this exercise, we will validate (and correct) a small batch of NDNP data (one issue of the Bourbon News). We will also validate it with the DVV and view it to make sure that the data represents what we want it to.

Browse to the directory on the flash drive called batch_ky_20080925_apple. Open the file inside of it called babybatch.xml. You can see that it is a very small batch file which contains one issue.

Start the validator from the LOC DVV link on your desktop. Load the batch into the validator by clicking File -> Open and then browsing to the E drive to find the batch, and select babybatch.xml.

To validate the batch, choose Batch -> Validate All. After about a minute, you should get the following validation error:

Validation error - Original related item must have a physicalDescription child;
/mets[1]/dmdSec[2]/mdWrap[1]/xmlData[1]/mods:mods[1]/mods:relatedItem[1] for
E:\batch_ky_20080924_apple\00100479473\1904081601\1904081601.xml (after modification by validator)

This error is fairly friendly in that it mentions the file that the error is in, and also gives instructions about how to locate the error in the file and how to fix it. The location of the error is given as an XPath:

/mets[1]/dmdSec[2]/mdWrap[1]/xmlData[1]/mods:mods[1]/mods:relatedItem[1]

This means the error is in the first mets element, the second dmdSec element inside that, the third mdWrap element inside that, etc. Browse down the directory structure to find the file mentioned in the error. Double-click the file to open it in a web browser and see if you can find the location mentioned.

The NDNP Newspaper issue template indicates that inside each page, there should be a modsPhysicalDescription child at this location, and that it should look like this (since the page is from microfilm):

    <mods:physicalDescription>
        <mods:form type="microfilm" />
    </mods:physicalDescription>

You can fix the error by editing that XML file with WordPad. Just paste the above code snipped into the code for each page at the beginning of the relatedItem element:

    <mods:relatedItem type="original">
        <mods:physicalDescription>
            <mods:form type="microfilm" />
        </mods:physicalDescription>
        <mods:identifier type="reel number">00100479473</mods:identifier>
	…
    </mods:relatedItem type="original">

With that code added, the issue should now be valid. Validate the issue again to make sure.

The DVV may also be used as a viewer. View the issue to see if there is anything unusual.

You may notice that there are two copies of page 3. This is likely the result of two exposures of the same page on the original microfilm. The filmer may have had two copies of the physical paper. They may have changed lighting between exposures. They may have been interrupted during filming and forgotten whether they had filmed a particular page.

There is some disagreement in the NDNP community about whether both exposures should be digitized and included with the deliverables, or whether a single best copy should be chosen and the other discarded. On one hand, it may be difficult to determine which copy of the page is best. The pages may yield different OCR text, both of which may be required for full recall. However, others view including both copies as just preserving the mistakes of the filmer, and prefer to present just one best copy for the digital version of the intellectual content of the paper.

For the next exercise, assume you agree with the second point of view and want to delete the second instance of page 3. You can make this change by just editing the issue XML file and deleting the images files associated with that page.

Files: Note in the DVV that the images for the page you wish to delete have filenames like 0115.tif, 0115.jp2, etc. Make note of these file names, and then delete the files from the directory.

FILE SECTION: Open the 1904081601.xml file with a text editor such as WordPad and search for the FILE SECTION. This section indicates which files have information about the various newspaper pages. Find the <fileGrp> that has the 0115.* files. Before you delete the fileGrp, make note of its ID attribute and the ID attributes of the files in it. Then delete the fileGrp.

STRUCTURAL MAP: Next find the STRUCTURAL MAP. This section describes how the pages of the newspaper are grouped together (possibly into sections) to form an issue. Find the div that contains the files with FILEIDs from the fileGrp you just deleted. Before you delete the div, make note of its DMDID attribute. Then delete the div.

DESCRIPTIVE METADATA: Next find the DESCRIPTIVE METADATA section of the code. Find the dmdSec that has the ID attribute of the div you just deleted. Before you delete the dmdSec, make note of the page sequence number that appears in the code like this:

  <mods:extent unit="pages">
    <mods:start>4</mods:start>
  </mods:extent>

Delete the dmdSec.

Page Sequence: There are a variety of numbers that appear in the descriptive metadata that all seem like "page number".

Page number: The page number is the number that is actually printed on the page and visible in the page image. It appears in the code as:
```
  <mods:detail type="page number">
    <mods:number>5</mods:number>
  </mods:detail>
```
If there is no page number printed on the page, this code may be omitted from the dmdSec. Note in the example file that not all pages have a page number.
Reel sequence number: For an image scanned from microfilm, the reel sequence number indicates the position on the reel where the page was scanned from. For example, the 110th image has reel sequence number 110. It appears in the code as:
```
<mods:identifier type="reel sequence number">110</mods:identifier>
```
Not all reel sequence numbers need to be represented in the metadata! For example, if you don't scan pages 1-100 of a reel (maybe because they are from another title) but you do scan 101-150, you need only have descriptive metadata records for images with reel sequence numbers 101-150. Similarly, if you choose to omit image 125 (because it is a duplicate page, for example) you need not have a descriptive metadata for that image.
Page sequence number: The page sequence number describes the reading order of a newspaper issue. That is, the first page you should read has page sequence number 1, the second has page sequence number 2, etc. This appears in the code as:
```
  <mods:extent unit="pages">
    <mods:start>6</mods:start>
  </mods:extent>
```
Note that the page sequence numbers for the pages in an issue must form a sequence like 1,2,3... They do not need to be in order in the file, but all numbers in the sequence must be present, and they may only appear once. Every page must have a page sequence number, and you cannot duplicate or omit numbers from this sequence.

Look again at the issue we are editing. We have just removed the descriptive metadata for a page, so we have just removed a number from the page sequence. We can correct this by subtracting 1 from all page sequence numbers that come after this one.

Validate the batch again and make sure that it still validates!

View the batch to make sure the extra copy of page 3 is gone.

Skim back through the NDNP issue metadata file -- hopefully the parts look familiar, as you have edited most parts of them now!

Do you have any questions?