1. UNDERSTAND MARKUP LANGUAGE

A markup language is used to add extra information to a text document to control the interpretation of the data it contains. A special character set, such as < and >, is used to separate the markup information from the document data. Interpretation of this extra information naturally requires programs specifically adapted to the particular markup language.

Much programming effort is devoted to creating or interpreting marked up documents, so this lesson takes a detour into markup country before continuing with programming languages.

Markup languages control interpretation of data. You're probably familiar with the way HTML (Hypertext Markup Language) controls the appearance of a marked up page and the possible hypertext link interactions a user can have with the page.

The other highly significant markup language in modern computing is XML (Extensible Markup Language). XML is important in the interpretation, transmission, and storage of all sorts of data. It's extremely likely that you will run into XML marked up documents more and more.

Just a Little History

Markup languages have been created for many purposes. Probably every time some programmer got really frustrated with the effort required to control the exact appearance of printed text or reuse some existing text, they thought about creating a markup language.

A famous example is provided from the programming field. Computer scientist Donald Knuth got annoyed with the poor quality of the typesetting of mathematical expressions in the first volumes of his monumental series of books on programming algorithms. He set about to create a typesetting markup language to improve the situation. The language he invented, called TeX, is widely used in publishing today.

The direct forerunner of HTML and XML was SGML (Standard Generalized Markup Language). GML , which evolved into SGML, was invented at IBM to make the creation of complex program documentation easier. SGML is widely used in industries that have large technical documentation problems, such as aircraft manufacture. It enables you to keep one master document that can be processed (by various programs) in a number of ways to meet a number of different needs. There's that tie-in between markup languages and programming again!

Unfortunately, SGML is so complex that working with it is a full-time job. Too, programs used to display and edit SGML documents are real monsters. The basic idea, however, served as an inspiration for HTML, the markup language of the Web.

2. HTML TAGS AND ELEMENTS

You can use a simple HTML page to get a grasp on several important principles.

<html>
  <head>
    <title>A really simple page</title>
  </head>
  <body>
    <p>This the body text in a paragraph</p>
  <!-- this is a comment -->
  </body>
</html>

The parts enclosed in the < and > pairs are HTML tags that tell a browser how to treat the text between the tags. Extra indenting spaces are added to the example to show how lines are enclosed in pairs of tags, but this indenting is not an HTML requirement. Tags in HTML can be in either uppercase or lowercase to be interpreted correctly by browsers; however, future versions of HTML that are closer to the XML standard will have to keep tags in lowercase.

The entire document is enclosed in the <html> . . . </html> tag pair. The first is called the opening tag and the second is the closing tag. The <html> . . . </html> tag pair forms the root element of the document. The <head> . . . </head> and <body> . . . </body> tag pairs form the head and body elements respectively, which are the only elements that can appear directly inside the root element. The head and body elements are nested inside the html or root element.

Note the special markup starting with <!-- and ending with --> character sequences. This is a comment, not an element, and won't be displayed by a browser.

The HTML specification clearly defines what sort of elements can appear inside other elements. It's this strict control over what can appear where in an HTML document that makes it possible for Web pages to be displayed in different Web browsers without too much variation. In contrast, XML, being extensible, doesn't have this sort of standardization. This is discussed later in the lesson.

Tag Pairs and Empty Tags

When using HTML tag pairs to define elements, any nested tags inside the pair must be closed before the element closing tag appears. In the example, the <p> . . . </p> tag pair must be completely inside the <body> . . . </body> tag pair. An HTML document that adheres to this rule is said to be well formed .

HTML tags can appear in two ways -- paired as in the previous example, and as a singleton such as the <br> tag that causes a break in the text. Singleton tags are also referred to as empty tags because they don't enclose anything. Unfortunately, HTML adopts a convention for empty tags that is contrary to that used in XML. This convention is covered later in this lesson.

It'd be advantageous to programmers if both HTML and XML followed the exact same rules for forming tags, and there's a determined effort to create a version of HTML that follows the XML rules. The W3C (World Wide Web Consortium) has a standard called XHTML that accomplishes this, and establishes a clear road for future HTML extensions.

Tag Attributes

An opening tag can have additional information called attributes that take the form of name="value" . You already saw these in action in Lesson 3 where an input element in a form was defined with the following empty tag:

<input name="msg" type="text" size="40" value=""/>

Conventions dictate that the value of an attribute always be enclosed in quotes, although Web browsers aren't strict about this. Typical uses for attributes are to set color, fonts, and element locations.

HTML Standardization and the DOM

The W3C is the organization responsible for defining the specifications for HTML and creating many other Web-related standards. Many W3C projects are related to standardizing and improving the information carried by markup languages. One of the most fascinating projects is the Semantic Web -- an attempt to make the resources presented on the Web more directly usable by programs.

Many W3C standards are in a nearly constant state of revision, reflecting the incredibly rapid rate of innovation in the World Wide Web as thousands of ideas jostle for acceptance.

The DOM (Document Object Model) is a W3C standard defining how scripting languages and programs can address and modify the various elements of HTML and XML documents. You saw an example in Lesson 3, but the following section takes a look at that again.

3. HTML AND SCRIPTING LANGUAGES

Here is the JavaScript used in the Lesson 3 example of responding to an event:

function doReverse() {
  var s = ctrl.msg.value ;
  ctrl.msg.value = "" ; // erase input
  for( i = s.length + 5 ; i >= 0 ; i-- ){
    ctrl.msg.value += s.charAt( i );
  }
}

The notation that this code uses to address the value that is attached to the form field named msg in the form named ctrl is using the DOM. The DOM enables JavaScript to address many elements and properties of an HTML page.

You may see the term DHTML (Dynamic HTML) applied to pages that use JavaScript to change the appearance of a Web page on the fly.

JavaScript can also be used to add content to an HTML page as it's being loaded. The following is an example in which the browser executes the script in the <body> element, and the script writes the entire contents of the body. This results in the display shown in Figure 5-1.

<html>
<head><title>Weekly Planner</title>
<script language="javascript">
 dayName = new Array ("Sunday","Monday","Tuesday",
  "Wednesday", "Thursday", "Friday","Saturday");
 today = new Date ; 
</script>
</head>
<body>
<script language="javascript" type="text/javascript">
document.write("<h2>Today is " + 
    dayName[ today.getDay() ] + "</h2>" );
if( today.getDay() == 0 || today.getDay() == 6 ){
 document.write("<p>Goof off all day</p>"); 
} else {
 document.write("<p>Look busy!</p>");
}
</script>
</body>
</html>
Figure 5-1: An example of client-side dynamically created Web page content.
Figure 5-1: An example of client-side dynamically created Web page content.

Note that in this example, the JavaScript code is in two pieces -- one in the <head> element of the page, and one in the <body> . The code in the head defines two variables -- an array of day names, and the today variable. The today variable is interesting because it's an object that contains complete information about the instant the object was created. A Date object has functions that can interpret that time instant as various parts of a calendar date.

The code in the body of the page uses the getDay function to get the day of the week and then uses that to compose the text that gets written to the body of the HTML page. The if statement checks to see if the day of the week is 0 (Sunday) or 6 (Saturday), the || symbol is what JavaScript uses for the Boolean logic OR .

In the document.write statements, the document is the way the DOM refers to the object that contains the entire page, and the function call write says to output this text at the current writing point of the page. The document object has ways of addressing all parts of the page, whether they are named or not.

It's important to make a distinction between code that executes in the Web browser or client environment and that which operates on the Web server. On the client side, a scripting language is limited to what it can do by browser security settings. A program on the server side doesn't have these limitations.

HTML and Server-Side Scripting

When HTML was first developed, people immediately figured out that it would be cool to combine the static content of a page with dynamic data created on the fly. As is typical in technological revolutions, many different approaches were tried. All of these different schemes were based on some kind of markup language mixing special tags with the HTML tags and using special processing programs on the Web server.

Essentially, these special tags form a kind of scripting language. Some of these developments survived; some have been replaced. But you may recognize one or more in this list: server-side JavaScript, Cold Fusion, PHP (Hypertext Preprocessor), iHTML (inline HTML), ASP (Active Server Pages), and JSP (Java Server Pages).

The basic idea of server-side scripting languages is that when the server is called upon to send a file, it looks at the file type to see if that file gets special processing. Typically, a marked up file will have plain HTML with special tags that produce output from the scripting language. The following is a simple example marked up with JSP tags that start with <% and end with %> .

<html>
<head><title>Dynamic Page Content Example</title>
</head>
<body>
<h2>The time now: 
<%@ page language="java" import="java.util.Date"%>
<%= new Date() %></h2>
<hr/>
</body>
</html>

When a JSP-capable Web server sees a request for that page, it actually creates a small Java program that outputs the HTML and adds in the current date, resulting in the page shown in Figure 5-2.

Figure 5-2: Server-side dynamic content from JSP.
Figure 5-2: Server-side dynamic content from JSP.

Viw a larger version of this image

Scripting languages using specialized markup on the server side are an essential part of the modern Web. With simple markup tags, you can access databases or draw on resources distributed all over the Internet to create a custom page for your customer.

4. XML BASICS

Similar to HTML, XML is a markup language that follows a standard defined by the W3C. The difference is that the standard only defines the syntax conventions, and a few basic rules. You can invent your own tag vocabulary to apply to your own particular problem. As long as you follow the rules, your XML formatted documents can be processed with standard toolkits that are available in many languages.

The XML standard was derived by simplifying SGML and eliminating much complexity and inconsistencies that make working with SGML a full-time job. The W3C sponsored the XML development effort because they realized the true potential of the Web for communication could only be reached if people had a flexible way to describe just about any kind of data. The use of XML, however, has spread well beyond the Internet and can now be found in many computing applications.

XML has the same requirement for the nesting of tag pairs that HTML does. An XML document that has tags nested correctly and conforms to a few other requirements is said to be well formed. Empty tags in XML are required to end with the /> character sequence, so a program reading XML can be sure that it doesn't have to look for a matching closing tag. In XHTML, the <br> tag must be written <br/> to comply.

All tag names in XML are case sensitive, so <br/> , <Br/> , and <BR/> are three different tags. In XHTML, all tags used like HTML tags must be in lowercase, so only <br/> would produce a text break in an XHTML compliant browser. Another name requirement is that general users must not use tag names beginning with x or X. These names are reserved for future expansion of the standard.

Although it is not required, a complete XML document typically begins with a statement in a specialized format that states the version of the standard it conforms to. The following is an example:

<?xml version="1.0"?>

Complete details on XML formatting are beyond the scope of this course. You can find the formal documents at the W3C Web site, and any search engine can help you locate XML tutorials.

In addition to the concept of well-formedness, XML documents can be valid . The content and arrangement of elements in a valid XML document conform to a specification in a Document Type Declaration (DTD) or an XML Schema. An XML document may be well formed and not valid, but it can never be valid and not well formed.

An Example XML Document

Suppose you want to store all of the text in one of these lessons as an XML document. The following is a skeleton of the document structure you could use, showing just the first page and leaving out the bulk text.

<?xml version="1.0"?>
<lesson number="5" author="Brogden">
<title>Markup Languages</title>
  <page pagetitle="What is a Markup Language?">
    <paragraph>A markup language is used to .etc.
    </paragraph>
   <!-- more paragraphs go here -->
  </page>
   <!-- more pages go here -->
</lesson>

That sure looks very similar to the HTML you might use to display the lesson, but instead of the standard HTML tags you get to make up your own. Why would you want to do that? The following section takes look at some possibilities.

5. XML AND DATA MANIPULATION

The following are some tasks that can be done by processing the XML document representing this lesson:

  • Build a table of contents for the lesson by extracting the pagetitle attributes from the <page> elements.
  • Reformat the document as a single HTML page.
  • Reformat the document as a series of HTML pages, one per <page> element.
  • Reformat the document in the widely used PDF (Portable Document Format).
  • Reformat the document as a series of small pages in WML (Wireless Markup Language), used by WAP-enabled cell phones.
  • Store each <page> element in an XML enabled database.

Libraries of code for processing XML documents exist in many computer languages. There are basically two approaches -- parsing into separate tags and parts of elements, and building objects that contain entire elements.

In the separate parts approach, a parsing program takes the document apart at the points where tags appear and hands the programmer the separate parts in the order of the original document. Your code has to decide what to do with each part of the document on the fly. This approach is particularly good if your program has to extract a small chunk out of the whole document; for example, extracting the pagetitle attributes to build a table of contents.

In the building objects approach, your program is handed an object that contains the whole document, organized so you can rapidly locate any part of it. This is essentially the document object model approach used in JavaScript manipulation of a Web page. It's handy when your program has to access various parts of the document repeatedly; for example, when creating and editing.

XML and the Future of Programming

It's safe to say that in the future, almost all programmers need to know the basics of processing XML, and almost all creators of content for the Web need to know the basics of creating XML documents.

Web Services

If you pay any attention to the business news, you know that something called Web Services is a hot topic. The idea with a Web Service is that a company provides a special sort of Web server that can respond automatically to requests for particular kinds of data. Both the request and the response are formatted as XML documents that can be read and interpreted by programs.

For example, suppose vendors of auto parts provide their entire catalog as Web Services. You could have a program that automatically searches all vendors for the best price and availability for a particular part. By using XML, it is easy for both humans and programs to understand the process.

XML versus EDI

Corporations have been using EDI (Electronic Data Interchange) formats to send documents such as bids, orders, and invoices back and forth for years. The big problem with EDI formats has been that because they were created when transmission speeds were slow and long distance connectivity was expensive, data gets transmitted in a compact form using esoteric codes that can't be read by humans. Now that connectivity is cheap and transmission rapid, it makes more sense to use human-readable XML formats.

XML-Enabled Office Applications

Microsoft has announced that the applications in the 2003 generation of Microsoft Office will be capable of reading, writing, and processing XML documents, abandoning the previous proprietary formats. This move by Microsoft is necessary because more and more businesses are using XML as a standard for document storage and interchange. Having corporate data in XML format makes it easy to communicate with your customers through Web Services or to generate the content of your corporate Web site.

The Open Office project, which provides free programs for typical office productivity applications, provides the option of using XML for documents. For users of Linux and those who don't want to pay Microsoft's licensing fees, the Open Office project is a viable alternative.

Moving On

This lesson introduced you to features of the markup languages HTML and XML, and showed how markup languages are intimately involved with today's programming tools. You also got an indication of how marked up documents can be treated as objects. To understand modern programming, you have to know something about markup languages. The next lesson delves into storing data in memory as well file manipulation and memory management.

Before you move on, be sure to complete the assignment and quiz for this lesson. Don't forget to drop by the Message Board to see what your fellow students have to say.