Extensible Markup Language (XML)

Introduction

XML although similar to HTML was really developed to serve a different purpose. XML was designed to describe data and to focus on what data is, HTML was designed to display data and to focus on how data looks.

Some Quick Bullets about XML

  • XML stands for E X tensible M arkup L anguage
  • XML is a markup language much like HTML
  • XML was designed to describe data
  • XML tags are not predefined. You must define your own tags
  • XML uses a Document Type Definition (DTD) or an XML Schema to describe the data
  • XML with a DTD or XML Schema is designed to be self-descriptive
  • XML is a W3C Recommendation

XML was really designed to do nothing more than carry data. lets look at a simple example of describing a note:

<note>
<to>Bill</to>
<from>Joe</from>
<heading>Reminder</heading>
<body>Don't forget to grade my assignments</body>
</note>

The note has a header and a message body. It also has sender and receiver information. But still, this XML document does not DO anything. It is just pure information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.

XML tags are not predefined. You must "invent" your own tags.

The tags used to mark up HTML documents and the structure of HTML documents are predefined. The author of HTML documents can only use tags that are defined in the HTML standard (like <p>, <h1>, etc.).

XML allows the author to define his own tags and his own document structure.

The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document.

All XML elements must have a closing tag

With XML, it is illegal to omit the closing tag.

In HTML some elements do not have to have a closing tag. The following code is legal in HTML:

<p>This is a paragraph
<p>This is another paragraph

I should point out that we are moving towards a switch to xhtml where all tags require a closing tag.

XML tags are case sensitive

Unlike HTML, XML tags are case sensitive.

With XML, the tag <Letter> is different from the tag <letter>.

Opening and closing tags must therefore be written with the same case:

<Message>This is incorrect</message>
<message>This is correct</message>

All XML elements must be properly nested

Improper nesting of tags makes no sense to XML.

In HTML some elements can be improperly nested within each other like this:

<b><i>This text is bold and italic</b></i>

In XML all elements must be properly nested within each other like this:

<b><i>This text is bold and italic</i></b>

All XML documents must have a root element

All XML documents must contain a single tag pair to define a root element.

All other elements must be within this root element.

All elements can have sub elements (child elements). Sub elements must be correctly nested within their parent element:

<root>
   <child>
      <subchild>
         ....
      </subchild>
    </child>
</root>

 

Attribute values must always be quoted

With XML, it is illegal to omit quotation marks around attribute values. 

XML elements can have attributes in name/value pairs just like in HTML. In XML the attribute value must always be quoted. Study the two XML documents below. The first one is incorrect, the second is correct:

<?xml version="1.0" encoding="ISO-8859-1"?>
<note date=12/11/2022>
<to>Bill</to>
<from>Joe</from>
</note>

<?xml version="1.0" encoding="ISO-8859-1"?>
<note date="12/11/2022">
<to>Bill</to>
<from>Joe</from>
</note>

The error in the first document is that the date attribute in the note element is not quoted.

This is correct: date="12/11/2022". This is incorrect: date=12/11/2022.

With XML, white space is preserved

With XML, the white space in your document is not truncated .

This is unlike HTML. With HTML, a sentence like this:

Hello              my name is Bill,

will be displayed like this:

Hello my name is Bill,

because HTML reduces multiple, consecutive white space characters to a single white space.

 

With XML, CR / LF is converted to LF

With XML, a new line is always stored as LF .

Do you know what a typewriter is? Well, a typewriter is a mechanical device which was used last century to produce printed documents. :-)

After you have typed one line of text on a typewriter, you have to manually return the printing carriage to the left margin position and manually feed the paper up one line.

In Windows applications, a new line is normally stored as a pair of characters: carriage return (CR) and line feed (LF). The character pair bears some resemblance to the typewriter actions of setting a new line. In Unix applications, a new line is normally stored as a LF character. Macintosh applications use only a CR character to store a new line.

Comments in XML

The syntax for writing comments in XML is similar to that of HTML.

<!-- This is a comment -->

XML Elements are Extensible

XML documents can be extended to carry more information.

Look at the following XML NOTE example:

<note>
<to>Bill</to>
<from>Joe</from>
<body>Don't forget to grade my assignments!</body>
</note>

 

Let's imagine that we created an application that extracted the <to>, <from>, and <body> elements from the XML document to produce this output:

MESSAGE

To: Bill
From: Joe

Don't forget to grade my assignments!

Imagine that the author of the XML document added some extra information to it:

<note>
<date>2002-08-01</date>
<to>Bill</to>
<from>Joe</from>
<heading>Reminder</heading>
<body>Don't forget to grade my assignments </body>
</note>

Should the application break or crash?

No. The application should still be able to find the <to>, <from>, and <body> elements in the XML document and produce the same output.
XML documents are Extensible

XML Elements have Relationships

Elements are related as parents and children.

To understand XML terminology, you have to know how relationships between XML elements are named, and how element content is described.

Imagine that this is a description of a book:

My First XML

Introduction to XML

  • What is HTML
  • What is XML

XML Syntax

  • Elements must have a closing tag
  • Elements must be properly nested

Imagine that this XML document describes the book:

<book>
<title>My First XML</title>
<prod id="33-657" media="paper"></prod><chapter>Introduction to XML
<para>What is HTML</para>
<para>What is XML</para>
</chapter>

<chapter>XML Syntax
<para>Elements must have a closing tag</para>
<para>Elements must be properly nested</para>
</chapter>

</book>

Book is the root element . Title, prod, and chapter are child elements of book. Book is the parent element of title, prod, and chapter. Title, prod, and chapter are siblings (or sister elements ) because they have the same parent.

Elements have Content

Elements can have different content types.

An XML element is everything from (including) the element's start tag to (including) the element's end tag.

An element can have element content, mixed content, simple content, or empty content. An element can also have attributes .

In the example above, book has element content , because it contains other elements. Chapter has mixed content because it contains both text and other elements. Para has simple content (or text content ) because it contains only text. Prod has empty content , because it carries no information.

In the example above only the prod element has attributes . The attribute named id has the value "33-657". The attribute named media has the value "paper". 

Element Naming

XML elements must follow these naming rules:

  • Names can contain letters, numbers, and other characters
  • Names must not start with a number or punctuation character
  • Names must not start with the letters xml (or XML, or Xml, etc)
  • Names cannot contain spaces

Take care when you "invent" element names and follow these simple rules:

Any name can be used, no words are reserved, but the idea is to make names descriptive. Names with an underscore separator are nice.

Examples: <first_name>, <last_name>.

Avoid "-" and "." in names. For example, if you name something "first-name," it could be a mess if your software tries to subtract name from first. Or if you name something "first.name," your software may think that "name" is a property of the object "first."

Element names can be as long as you like, but don't exaggerate. Names should be short and simple, like this: <book_title> not like this: <the_title_of_the_book>. 

XML documents often have a corresponding database, in which fields exist corresponding to elements in the XML document. A good practice is to use the naming rules of your database for the elements in the XML documents.

Non-English letters like éòá are perfectly legal in XML element names, but watch out for problems if your software vendor doesn't support them.

The ":" should not be used in element names because it is reserved to be used for something called namespaces.

XML Attributes

XML elements can have attributes.

From HTML you will remember this: <IMG SRC="computer.gif">. The SRC attribute provides additional information about the IMG element.

In HTML (and in XML) attributes provide additional information about elements:

<img src="computer.gif">
<a href="demo.aspx">

Attributes often provide information that is not a part of the data. In the example below, the file type is irrelevant to the data, but important to the software that wants to manipulate the element:

<file type="gif">computer.gif</file>

Quote Styles, "female" or 'female'?

Attribute values must always be enclosed in quotes, but either single or double quotes can be used. For a person's sex, the person tag can be written like this:

<person sex="female">

or like this:

<person sex='female'>

Note: If the attribute value itself contains double quotes it is necessary to use single quotes, like in this example: 

<gangster name='George "Shotgun" Ziegler'>

Note: If the attribute value itself contains single quotes it is necessary to use double quotes, like in this example: 

<gangster name="George 'Shotgun' Ziegler">

Use of Elements vs. Attributes

Data can be stored in child elements or in attributes.

Take a look at these examples:

<person sex="female"> <firstname>Anna</firstname> <lastname>Smith</lastname> </person>

<person> <sex>female</sex> <firstname>Anna</firstname> <lastname>Smith</lastname> </person>

In the first example sex is an attribute. In the last, sex is a child element. Both examples provide the same information.

There are no rules about when to use attributes, and when to use child elements. My experience is that attributes are handy in HTML, but in XML you should try to avoid them. Use child elements if the information feels like data.

A Better Way ?

It may be better to store data in child elements.

The following three XML documents contain exactly the same information:

A date attribute is used in the first example:

<note date="12/11/2020">
<to>Bill</to>
<from>Joe</from>
<heading>Reminder</heading>
<body>Don't forget to grade my assignments!</body>
</note>

A date element is used in the second example:

<note>
<date>12/11/2020</date>
<to>Bill</to>
<from>Joe</from>
<heading>Reminder</heading>
<body>Don't forget to grade my assignments!</body>
</note>

An expanded date element is used in the third: (THIS IS MY FAVORITE):

<note>
<date>
<day>12</day>
<month>11</month>
<year>2020</year>
</date>
<to>Bill</to>
<from>Joe</from>
<heading>Reminder</heading>
<body>Don't forget to grade my assignments!</body>
</note>

Avoid using attributes?

Should you avoid using attributes?

Some of the problems with using attributes are:

  • attributes cannot contain multiple values (child elements can)
  • attributes are not easily expandable (for future changes)
  • attributes cannot describe structures (child elements can)
  • attributes are more difficult to manipulate by program code
  • attribute values are not easy to test against a Document Type Definition (DTD) - which is used to define the legal elements of an XML document

If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data.

Don't end up like this (this is not how XML should be used):

<note day="12" month="11" year="2006"
to="Bill" from="Joe" heading="Reminder"
body="Don't forget to grade my assignments ">
</note>

 

An Exception to my Attribute rule

Rules always have exceptions.

My rule about attributes has one exception:

Sometimes I assign ID references to elements. These ID references can be used to access XML elements in much the same way as the NAME or ID attributes in HTML. This example demonstrates this:

<messages>
<note id="p501">
<to>Bill</to>
<from>Joe</from>
<heading>Reminder</heading>
<body>Don't forget to grade my assignments!</body>
</note>

<note id="p502">
<to>Joe</to>
<from>Bill</from>
<heading>Re: Reminder</heading>
<body>I will take care of it </body>
</note>
</messages>

The ID in these examples is just a counter, or a unique identifier, to identify the different notes in the XML file, and not a part of the note data.

What I am trying to say here is that metadata (data about data) should be stored as attributes, and that data itself should be stored as elements.

"Well Formed" XML documents

A "Well Formed" XML document has correct XML syntax.

A "Well Formed" XML document is a document that conforms to the XML syntax rules that were described in the previous chapters:

You can check the syntax of your XML documents by opening them with a web browser like Internet Explorer. If it displays ok then you have a well formed document. If not, the IE will display an appropriate error.