Hands-on Metadata

Welcome to the hands-on metadata session of meta|morphosis! As we discussed in yesterday's session (Practical Metadata) you are already familiar with lots of different kinds of metadata. In today's session we will focus on a particular format for metadata which you may not be as familiar with: XML.

XML: eXtensible Markup Language

Example

  <book isbn="978-0898713619">
    <title>Numerical Linear Algebra</title>
    <author>Lloyd N. Trefethen</author>
    <author>David Bau, III</author>
    <publisher>SIAM</publisher>
    <favorite />
    <callnumber lcc="QA184" ddc="512.5" />
  </book>

Vocabulary

XML code is full of elements which can serve a number of purposes. Some are fields to which you can assign data. Others indicate the structure of these fields in relationship to each other. Elements from the above example include:

Many elements have start and end tags. Start tags look like:

      <title>
      <author>
      <publisher>

End tags look like (note the /):

      </title>
      </author>
      </publisher>

A value may be assigned to an element by putting it between the start and end tags. For example, the code:

      <title>Numerical Linear Algebra</title>

means that the title is Numerical Linear Algebra.

Other elements may be empty or self-closing; that is, rather than having separate start and end tags, they just have one tag with a / at the end. Some self-closing elements from the above example are:

      <favorite />
      <callnumber lcc="QA184" ddc="512.5" />

Elements may also have attributes associated with them. These appear in the tag along with the element name. In the above example, the following lines have attributes::

      <book isbn="978-0898713619">
      <callnumber lcc="QA184" ddc="512.5" />

These lines mean that:

Note that data about an element may appear between the start and end tags or in an attribute. There is no real reason to make the code start with

      <book isbn="978-0898713619">

rather than

      <book>
        <isbn>978-0898713619<isbn>
        …

except that it was the choice I made when making up this particular XML language.

Note that XML is not any one particular language made of elements and attributes, like HTML or the books language used in the example above. Rather, it is a set of rules that a language must follow to be called an XML application or XML language. It is called extensible because anybody can make up such a language, either from scratch or by adding rules to an existing language. This does not mean that you can just add whatever elements you like to an XML document and have it still be valid.

A document that follows the rules of XML is said to be well-formed. If in addition to that, a document is written in a specific XML language, and it follows the rules of that language, it is said to be valid.

Some Rules of XML

There are about a hundred rules for XML, but here are some of the most common ones that will affect your documents:

Elements must have both start and end tags or be "self-closing".

The following is well-formed:

  <notesStmt>
    <note>
      <p>This is a note in a TEI document.</p>
    </note>
  </notesStmt>

The following is not well-formed:

  <notesStmt>
    <note>
      <p>This is a note in a TEI document.
    </note>
  </notesStmt>

Elements must not overlap; they must be well-nested.

The following is well-formed:

  <poem>
    <title>A poem</title>
    <stanza>
      <line>This stanza's just one line long.</line>
    </stanza>
  </poem>

The following is not well-formed:

  <poem>
    <title>A poem</title>
    <stanza>
      <line>This stanza's just one line long.</stanza>
    </line>
  </poem>

Values of attributes must be in double quotes.

The following is well-formed:

  <select name="animal">
    <option value="dog">Dog</option>
    <option value="cat" selected="selected">Cat</option>
  </select>

The following is not well-formed:

  <select name=animal>
    <option value=dog>Dog</option>
    <option value=cat selected>Cat</option>
  </select>

As mentioned above are many other rules for XML, but these are some of the most common ones that come up when composing documents.

XML tools

There are quite a few special purpose programs you can use to edit and manipulate XML documents. oXygen and XMLSpy are examples of popular editors. Many XML editors have special features such as allowing you to validate or transform your document. Others may just color the words in your document in such a way that you can tell if it is not valid.

You can see from the above examples that XML is plain text; thus, you can edit it with any text editor! You do not need to pay for a program to write, edit or even validate your XML. For example, Xerces is a family of programs which can be downloaded and used to parse and validate any XML language that you have a specification for. There are also freely available resources that can be used for specific XML languages. For example:

For our examples today, we will not be using an XML-specific editor, but rather a plain text editor, such as a text field in a web browser or the Windows program WordPad. We will also be using freely available tools to check our documents for validity.

Example: Movie reviews

One effective way of creating valid XML documents, especially documents in an XML language with which you are not very familiar, is to start with documents that are already valid and just substitute one's own information for the values already there. For an example of this, visit the MovieXML site, look at other people's movie reviews, and add your own!

Example: XHTML

HTML is the language in which web pages are written. Putting data in an HTML element may have a different kind of meaning depending on the element. For example:

      <title>my web page</title>

means that my web page is the title of the page, and should be displayed in the title bar at the top of the browser. The code:

      i <i>love</i> <b>movies</b>.

means that the word love should be displayed in italics and the word movies should be displayed in bold like this: i love movies.

There are many web pages on the internet with imperfect HTML. Web browsers tend to be forgiving of broken or ambiguous code, and do their best to display it anyway. For example, in the following HTML code:

      <b>howdy</i>

should the word howdy be made bold or italicized? Or both? Different browsers may make different guesses about how the document should look, which means that different people viewing the page would see different things.

Why XHTML?

There are many reasons to standardize the HTML code on your web pages into an XML format.

How XHTML?

Here are some things you may need to check when standardizing your web pages:

Exercise: Fix the web page on the flash drive!

Plug your flash drive into a USB port on your computer, and look at the web page in the directory web_page. Double-click the HTML file to view it in a web browser. Notice that (though it is not at all fancy!) it renders reasonably well.

Now edit the HTML document with WordPad; to do this, right-click the file and choose Open With... and choose WordPad. Can you find ways in which it is not well-formed or valid?

The W3C Markup Validation Service is a great way to check your document.

Example: NDNP

The National Digital Newspaper Program has a set of specifications for digitizing historic newspaper. Some of these are specifications for newspaper page images, such as TIFF, JP2 and PDF. Others are XML specifications for metadata about microfilm reels, newspaper issues and OCR (describing the position of words on each page).

Library of Congress also provides a tool, the Digital Viewer and Validator (DVV), for validating digital newspaper collections against that specification, and viewing them logically as well. In this exercise, we will validate (and correct) a small batch of NDNP data (one issue of the Bourbon News). We will also validate it with the DVV and view it to make sure that the data represents what we want it to.

Browse to the directory on the flash drive called batch_ky_20080925_apple. Open the file inside of it called babybatch.xml. You can see that it is a very small batch file which contains one issue.

Start the validator from the LOC DVV link on your desktop. Load the batch into the validator by clicking File -> Open and then browsing to the E drive to find the batch, and select babybatch.xml.

To validate the batch, choose Batch -> Validate All. After about a minute, you should get the following validation error:

Validation error - Original related item must have a physicalDescription child;
/mets[1]/dmdSec[2]/mdWrap[1]/xmlData[1]/mods:mods[1]/mods:relatedItem[1] for
E:\batch_ky_20080924_apple\00100479473\1904081601\1904081601.xml (after modification by validator)

This error is fairly friendly in that it mentions the file that the error is in, and also gives instructions about how to locate the error in the file and how to fix it. The location of the error is given as an XPath:

/mets[1]/dmdSec[2]/mdWrap[1]/xmlData[1]/mods:mods[1]/mods:relatedItem[1]

This means the error is in the first mets element, the second dmdSec element inside that, the third mdWrap element inside that, etc. Browse down the directory structure to find the file mentioned in the error. Double-click the file to open it in a web browser and see if you can find the location mentioned.

The NDNP Newspaper issue template indicates that inside each page, there should be a modsPhysicalDescription child at this location, and that it should look like this (since the page is from microfilm):

    <mods:physicalDescription>
        <mods:form type="microfilm" />
    </mods:physicalDescription>

You can fix the error by editing that XML file with WordPad. Just paste the above code snipped into the code for each page at the beginning of the relatedItem element:

    <mods:relatedItem type="original">
        <mods:physicalDescription>
            <mods:form type="microfilm" />
        </mods:physicalDescription>
        <mods:identifier type="reel number">00100479473</mods:identifier>
	…
    </mods:relatedItem type="original">

With that code added, the issue should now be valid. Validate the issue again to make sure.

The DVV may also be used as a viewer. View the issue to see if there is anything unusual.

You may notice that there are two copies of page 3. This is likely the result of two exposures of the same page on the original microfilm. The filmer may have had two copies of the physical paper. They may have changed lighting between exposures. They may have been interrupted during filming and forgotten whether they had filmed a particular page.

There is some disagreement in the NDNP community about whether both exposures should be digitized and included with the deliverables, or whether a single best copy should be chosen and the other discarded. On one hand, it may be difficult to determine which copy of the page is best. The pages may yield different OCR text, both of which may be required for full recall. However, others view including both copies as just preserving the mistakes of the filmer, and prefer to present just one best copy for the digital version of the intellectual content of the paper.

For the next exercise, assume you agree with the second point of view and want to delete the second instance of page 3. You can make this change by just editing the issue XML file and deleting the images files associated with that page.

Files: Note in the DVV that the images for the page you wish to delete have filenames like 0115.tif, 0115.jp2, etc. Make note of these file names, and then delete the files from the directory.

FILE SECTION: Open the 1904081601.xml file with a text editor such as WordPad and search for the FILE SECTION. This section indicates which files have information about the various newspaper pages. Find the <fileGrp> that has the 0115.* files. Before you delete the fileGrp, make note of its ID attribute and the ID attributes of the files in it. Then delete the fileGrp.

STRUCTURAL MAP: Next find the STRUCTURAL MAP. This section describes how the pages of the newspaper are grouped together (possibly into sections) to form an issue. Find the div that contains the files with FILEIDs from the fileGrp you just deleted. Before you delete the div, make note of its DMDID attribute. Then delete the div.

DESCRIPTIVE METADATA: Next find the DESCRIPTIVE METADATA section of the code. Find the dmdSec that has the ID attribute of the div you just deleted. Before you delete the dmdSec, make note of the page sequence number that appears in the code like this:

  <mods:extent unit="pages">
    <mods:start>4</mods:start>
  </mods:extent>

Delete the dmdSec.

Page Sequence: There are a variety of numbers that appear in the descriptive metadata that all seem like "page number".

Look again at the issue we are editing. We have just removed the descriptive metadata for a page, so we have just removed a number from the page sequence. We can correct this by subtracting 1 from all page sequence numbers that come after this one.

Validate the batch again and make sure that it still validates!

View the batch to make sure the extra copy of page 3 is gone.

Skim back through the NDNP issue metadata file -- hopefully the parts look familiar, as you have edited most parts of them now!

Do you have any questions?

Valid XHTML 1.1