Jay Greenspan

Subscribe to Jay Greenspan: eMailAlertsEmail Alerts
Get Jay Greenspan: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: XML Magazine

XML: Article

XML Flashback: Understanding XML, Anno 1996

XML Flashback: Understanding XML, Anno 1996

  • Read this article at WebMonkey.com in its original form
  • Visit Jay Greenspan's Trans-City.com site 

    In his early days at Intel, Andy Grove was approached by an employee who suggested the company start work on a personal computer based on its chips. Skeptical, he asked what a personal computer might do. The employee, searching for a good example, said it could be used to store recipes. Grove thought about the millions he'd have to spend on research, development, and marketing, then considered the imperfect but steady quality of an alphabetized loose-leaf binder. He finally passed on the idea and decided to concentrate on the lucrative business of supplying chips for traffic lights.

    What a maroon. Any dolt should have been able to recognize the potential when it was presented. Right? Probably not. Andy Grove, no matter what you think of him, has proven to be a fairly bright guy - and certainly capable of making decisions that profit his company. But in the 1970s, it was impossible for him to envision the potential of a personal computer. If he could have traveled forward in time and seen Excel, Quark, Photoshop, Oracle, or the current use of the Net, he would have understood that putting a powerful processor on the desktop would eventually allow for software to be written for nearly everything.

    But how, without having seen it at work, would you explain "everything"? With a typewriter, adding machine, and pencil as your basis for comparison, how would you explain the PC and its uses?

    Similar problems come up in trying to explain eXtensible Markup Language. It's not really like anything else out there, so there isn't a good comparison to be made. That's why the forced metaphors are so uninformative. You may have heard that XML is the replacement for HTML or that XML is like HTML, where you make up your own tags. Both of these statements are more or less accurate, but in the same way that a PC is a recipe repository might be true.

    So then what exactly is XML? I think I can best explain it by telling about my latest money-making idea. Read on.

    My mother is a killer cook, and I'm sure that if I position her recipes right, I - uh, I mean my family - could make a truckload of money.

    I need to start cheap. So I open up my text editor and start coding some HTML. Real quicklike, I type up a file:


    <HTML>

    <H1 ALIGN=CENTER>Recipe</H1>

    <FONT FACE size=2>Chocolate Chip Bars</FONT>


    After adding a couple dozen more lines, I manage to throw together a functional page that shows off my mother's culinary mastery. Then what do I have? A plain old measly Web page. What am I going to do when the Joy of Cooking people approach me, looking for a last-minute entry? Send them to my URL and ask them to strip out the <P>s and <FONT FACE size=2>s? That's going to take time, and I want to dive right into contract negotiations.

    My life would be a lot easier if XML were fully implemented. Take a look at some potential XML markup:


          <author>Carol Schmidt</author>

          <recipe_name>Chocolate Chip Bars</recipe_name>


    In XML, tags can be invented that best describe the contents. This way, I can be pretty certain that anyone searching for an occurrence of "Chocolate Chip" within a <recipe_name> tag would come across mom's recipe. Furthermore, if my information was surrounded by tags like these - tags that make sense - I could tell other programs what to do with them. With just a touch of coding, I could pull the contents of the <recipe_name> tag into an appropriate database field, then output it to a hard copy for my book. Or better yet, I could use an XML-enabled word processor to make paper publishing of this information a breeze.

    That's what XML is about: using markup that is readable by both humans and machines. When that occurs, really good things start to happen. But before we get to that, you should have an understanding of what's involved in coding XML.

    An XML file must meet two criteria: it has to be well formed and valid. We'll start by making a well-formed document.

    I invented some tags that describe a complete recipe, and then went about structuring them in a way that's reasonable and readable. This may not be the best possible markup, but it will work for this example.


    <?xml version="1.0"?>

    <list>

       <recipe>

          <author>Carol Schmidt</author>

          <recipe_name>Chocolate Chip Bars</recipe_name>

          <meal>Dinner

             <course>Dessert</course>

          </meal>

          <ingredients>

             <item>2/3 C butter</item>

             <item>2 C brown sugar</item>

             <item>1 tsp vanilla</item>

             <item>1 3/4 C unsifted all-purpose flour</item>

             <item>1 1/2 tsp baking powder</item>

             <item>1/2 tsp salt</item>

             <item>3 eggs</item>

             <item>1/2 C chopped nuts</item>

             <item>2 cups (12-oz pkg.) semi-sweet choc. chips</item>

          </ingredients>

          <directions>

    Preheat oven to 350 degrees. Melt butter;
    combine with brown sugar and vanilla in large mixing bowl.

    Set aside to cool.  Combine flour, baking powder, and salt; set aside.

    Add eggs to cooled sugar mixture; beat well.  Stir in reserved dry
    ingredients, nuts, and chips.

    Spread in greased 13-by-9-inch pan. Bake for 25 to 30 minutes until golden
    brown; cool.  Cut into squares.

          </directions>

       </recipe>

    </list>


    There you go. One perfectly acceptable XML document. It's not really very exotic, just a series of tags with corresponding closing tags, but a quick look will give you a pretty good idea of what XML is all about: structuring data in a way that makes sense.

    Though the tags look a lot like HTML, there is, of course, one important difference. In this file there is no information to indicate how this data should be presented. Layout instructions, when we're ready, will come from elsewhere. It's the same principle behind putting your address book in fields and records of a database instead of a list in a word processor. The database gives you the ability to merge your addresses into labels, envelopes, letters, or whatever you want. Eventually, that's what you're going to do with this recipe file, merge it into a presentational language like HTML or CSS.

    As I said, an XML document must be well formed. This means the file must follow three basic rules:

    The document starts with an XML declaration, <?xml version="1.0"?>.
    There is a root element in which all others are contained. The <list> and </list> tags from the above code, for example.
    All elements must be properly nested. No overlapping is permitted.
    In my example above, there are several <item> elements properly nested within the <ingredients> and </ingredients> tags. The following markup, however, would be a serious problem:


    <ingredients><item></ingredients>chocolate chips</item>


    Here the item "chocolate chips" has not been closed within the ingredients list. Therefore, the document is no longer well formed. This might not be a big deal in HTML because the browsers have been designed to handle it.

    But in XML, such errors are fatal.

    This is not my way of being dramatic, it's the terminology adopted by the writers of the XML spec. A file that is not well formed will create what is dubbed a "fatal error." This means, essentially, that applications will refuse to process the file.

    From the above example, being well formed seems like a piece of cake, and it pretty much is. But it can get a tad more involved. Let's take a closer look.

    OK, so my mom's list is going to have dozens of recipes. Hundreds even. That list will be a bear to troubleshoot if it produces a fatal error. Going through line after line of code looking for a missing closing tag will cut into contract-negotiations time. If I decided on a format that had several additional layers of nesting, finding errors would be really nasty.

    There is excellent help available, however. Parsers, applications that examine XML code and report forming errors, are available for free on the Web. The most visible of these is Lark, written by Tim Bray, the technical editor and head cheerleader of the XML specification, and one of the smarter men on earth.

    I passed the document below, complete with forming error, through Lark. Notice that the item "chocolate chips" and its closing tag appear on the wrong side of the </ingredients> tag:


    <?xml version="1.0"?>

    <list>

       <recipe>

          <author>Carol Schmidt</author>

          <recipe_name>Chocolate Chip Bars</recipe_name>

          <meal>Dinner

             <course>Dessert</course>

          </meal>

       <ingredients>

             <item>2/3 C butter</item>

             <item>2 C brown sugar</item>

             <item>1 tsp vanilla</item>

             <item>1 3/4 C unsifted all-purpose flour</item>

             <item>1 1/2 tsp baking powder</item>

             <item>1/2 tsp salt</item>

             <item>3 eggs</item>

             <item>1/2 C chopped nuts</item>

             <item>

          </ingredients>2 cups (12-oz pkg.) semi-sweet choc.

    chips</item>

          <directions>

    Preheat overn to 350 degrees. Melt  butter;
    combine with brown sugar and vanilla in large mixing bowl.

    Set aside to cool.  Combine flour, baking powder, and salt; set aside.

    Add eggs to cooled sugar mixture; beat well.  Stir in reserved dry
    ingredients, nuts, and chips.

    Spread in greased 13-by-9-inch pan. Bake for 25 to 30 minutes
    until golden brown; cool.  Cut into squares.

          </directions>

       </recipe>

    </list>


    Here's what the parser returns:


    Error Report
    Line 17, column 22: Encountered </ingredients> expected </item>

     ... assumed </item>

    Line 18, column 36: Encountered </item> with no start-tag.


    With this info, finding the error should be no problem, and that's all that's involved in being well formed. Now what about making XML files valid?

    Eventually we're going to do something with the information in our well-formed XML document. In fact, we're going to do lots of things with it. But a danger still lurks: There is a possibility that this XML file, though well formed, is missing key information. Take a look at the following:


    <recipe>

       <author>Carol Schmidt</author>

       <recipe_name>Chocolate Chip Bars</recipe_name>

       <meal>Dinner

         <course>Dessert</course>

       </meal>

       <ingredients>

       </ingredients>

       <directions>Melt  butter; combine with, etc. ... </directions>

    </recipe>


    If this were one entry of several dozen it might be tough to notice that this recipe consists of no ingredients, and because it's well formed, the Lark parser will have no problem with it. Anyone who's ever administered even the most modest database knows that we humans need to be protected from ourselves at every turn. If given the opportunity, we'll omit crucial information and include extraneous nonsense. That's why the XML creators, being benevolent and understanding of human frailty, included the Document Type Definition, or DTD. The DTD provides a way to make sure the XML is more or less like you want it.

    Let's take a look at a DTD that could be used for the recipe file.

     


    <!DOCTYPE list [

       <!ELEMENT recipe (recipe_name, author, meal, ingredients, directions)>

       <!ELEMENT ingredients (item+)>

       <!ELEMENT meal (#PCDATA|course)*>

       <!ELEMENT item (#PCDATA|sub_item)*>

       <!ELEMENT recipe_name (#PCDATA)>

       <!ELEMENT author (#PCDATA)>

       <!ELEMENT course (#PCDATA)>

       <!ELEMENT sub_item (#PCDATA)>

       <!ELEMENT directions (#PCDATA)>

    ]>

     


    It's not reader-friendly at first, but when deconstructed it makes sense. Let's go over this in greater detail:

    <!DOCTYPE list [

    This line says that whatever is inside the brackets is the DTD for a document with root element <list>. As mentioned earlier, the root element contains all other elements.

    <!ELEMENT recipe (recipe_name, meal, ingredients, directions)>

    This line defines the <recipe> tag. The parentheses say that these four other sets of tags must appear inside the <recipe> tags, in that particular order.

    <!ELEMENT meal (#PCDATA|course)*>

    This line requires a bit more explanation. I've allowed the following construction:


    <meal>Here's some text that describes the meal

       <course>A course element may appear after the text

    </course>

    </meal>


    I've done this because, to my way of thinking, lunch items don't need to specified in terms of courses, but with dinner items, one might want to indicate appetizer, main course, or dessert. This line accomplishes that by first allowing #PCDATA, which stands for parsed character data (i.e., anything other than binary data, such as a image). In this case, the #PCDATA will be text - for instance, "dinner."

    The asterisk following the "course" says that either zero or more sets of the text and the <course> tag may appear within the <meal> tags. If you think about it for a minute, you might see that this isn't the perfect definition for this data. But we'll have to live with it. DTDs do have their limitations, but that's the subject for another article. Let's move on.

    OK, now what about this:

    <!ELEMENT ingredients (item+)>

    Here the plus sign requires that one or more sets of <item> tags occur within the <ingredients> tag.

    The final line of interest is:

    <!ELEMENT item (#PCDATA|sub_item)*>

    I've added sub_item as a safeguard. In addition to requiring text for each item, I want to allow for an accounting of each item's contents.

    Now let's put it all together and see what we get.

    Here's a complete example. For good measure, I've added another recipe to the body of the file and comments to the DTD. Notice that I've used sub-items in the second recipe.


    <?xml version="1.0"?>

    <!--This starts the DTD. The first four lines address document structure

    -->

     


    <!DOCTYPE list [

       <!ELEMENT recipe (recipe_name, author, meal, ingredients,
    directions)>

       <!ELEMENT ingredients (item+)>

       <!ELEMENT meal (#PCDATA|course)*>

       <!ELEMENT item (#PCDATA|sub_item)*>

       <!--These are the remaining elements of the recipe tag -->

       <!ELEMENT recipe_name (#PCDATA)>

       <!ELEMENT author (#PCDATA)>

       <!ELEMENT directions (#PCDATA)>

       <!--The remaining element of the meal tag -->

       <!ELEMENT course (#PCDATA)>

       <!--The remaining element of the item tag -->

       <!ELEMENT sub_item (#PCDATA)>

    ]>

    <list>

       <recipe>

          <author>Carol Schmidt</author>

          <recipe_name>Chocolate Chip Bars</recipe_name>

          <meal>Dinner

             <course>Dessert</course>

          </meal>

    <ingredients>

             <item>2/3 C butter</item>

             <item>2 C brown sugar</item>

             <item>1 tsp vanilla</item>

             <item>1 3/4 C unsifted all-purpose flour</item>

             <item>1 1/2 tsp baking powder</item>

             <item>1/2 tsp salt</item>

             <item>3 eggs</item>

             <item>1/2 C chopped nuts</item>

             <item>2 cups (12-oz pkg.) semi-sweet
    choc. chips</item>

          </ingredients>

          <directions>

    Preheat oven to 350 degrees. Melt  butter; combine
    with brown sugar and vanilla in large mixing bowl.

    Set aside to cool.  Combine flour, baking powder, and
    salt; set aside.

    Add eggs to cooled sugar mixture; beat well.  Stir in
    reserved dry ingredients, nuts, and chips.

    Spread in greased 13-by-9-inch pan. Bake for 25 to 30
    minutes until golden brown; cool.  Cut into squares.

          </directions>

       </recipe>

       <recipe>

         <recipe_name>Pasta with tomato Sauce</recipe_name>

         <meal>Dinner

            <course>Entree</course>

         </meal>

         <ingredients>

            <item>1 lb spaghetti</item>

            <item>1 16-oz can diced tomatoes</item>

            <item>4 cloves garlic</item>

            <item>1 diced onion</item>

            <item>Italian seasoning

               <sub_item>oregano</sub_item>

               <sub_item>basil</sub_item>

               <sub_item>crushed red pepper</sub_item>


            </item>

         </ingredients>

         <directions>

    Boil pasta. Saute garlic and onion. Add tomatoes.
    Serve hot.

         </directions>

      </recipe>

    </list>


    Now that there is a DTD, this document can be checked to see that everything within the tags adheres to the limitations imposed by the DTD. In other words, we can make sure the document is valid.

    To accomplish this, we need to pull out another tool: the validating parser. Microsoft's IE5 comes with a pretty good validating parser built in. Just save your document with a .xml extension and then open it in IE5 and it will be validated. But if I try to pass a recipe where the ingredient tags contain no items, as shown at the top of the previous page, I get the following message:

    ingredients is not complete. Expected elements [item].

    And that about wraps it up.

    So now you know XML. Sure, structures can and will get a whole lot more complex, and DTDs have all kinds of options to specify more precisely what a document can contain. But that's pretty much it.

    The benefits to my ingenious cash-generating plan are significant, but there are other XML functions that are, arguably, even more important.

    Consider an industry where interchange of data is vital, such as banking. Banks use proprietary systems to track transactions internally, but if they use a common XML format over the Web, then they'd be able to describe transaction information to another institution or an application (like Quicken or MS Money). Of course, they'd also be able to present the data in a pretty Web page. FYI: This markup does exist. It's called OFEX, the Open Financial Exchange format.

    Under certain circumstances, if IE 4 on the PC comes across a <SOFTPKG> tag with the proper contents, a function is started that gives a user the opportunity to update installed software. If you're using Windows 98, it's possible that you've seen this process in action without knowing it was an XML application.

    Here we have three applications of XML that seem as different as an adding machine, a typewriter, and a pencil would have seemed to Andy Grove in the 1970s. But like the applications that eventually came to the PC, XML's benefits can also be described in a very general statement, like: "When you use human- and machine-readable tags to describe your data, good things will happen."

    What are all these good things? I have no idea. But I also have no idea what the next great program for my PC is going to be. As long as data is marked up in this way, different uses can be invented for it.

    Are you starting to get an idea of what extensibility is all about?

    There are some concrete uses of XML we can talk about, and I'll be discussing them in the near future. Since we're all Web people, the next one will be about eXtensible Style Language (XSL).

    By the way, that recipe really is my mom's, and it's outstanding. If you're into it, add half a cup of shredded coconut.

  • Read this article at WebMonkey.com in its original form
  • Visit Jay Greenspan's Trans-City.com site 
  • More Stories By Jay Greenspan

    A consultant, writer, and editor, Jay Greenspan is co-author of a book detailing applications development with PHP and MySQL. For two years he produced Wired Digital's Webmonkey, and has contributed to the sites run by Zend Technologies, Wired News, guru.com, and atomz.com. He is a contributing author for Apple Developer Connection and Principle of Trans-City.com, a service providing marketing text and documentation for companies that need to reach a developer audience.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.