BigAdmin System Administration Portal

HowTos

Archived from Sun's Dot-Com Builder Web Site
This content is archived from Sun's Dot-Com Builder Web Site.
These are the Best Practices > How To's archives.

Some of these pages may contain links that are no longer available. If you see these, you can report it through the Suggestions link and we will remove the link and leave the name (for reference).

Back to Dot-Com Builder How-Tos Archive

Storing XML Data
January 7, 2002

by Todd Sundsted

Before you decide how to store your XML data, ask yourself why you want to store your XML data. You might think the answer to this question is self-evident. After all, storage is storage, right? Not quite.

There are many different ways to store XML data, and there are several ways to use the stored XML data. Asking questions like the following will help you select a storage strategy that meets your needs:

  • Do you need long-term persistence?
  • Do you intend to run queries against the stored data?
  • Do you need to retrieve the XML data in exactly the same form as it was stored?

The answers to these questions may reveal that your XML data storage solution doesn't have to be fancy -- a file system might suffice. Consider XML data from which Web pages are generated via an XSLT transformation. If the XML data is static, a file system storage solution may provide perfectly adequate storage, as it does for HTML pages. "Well, that's obvious!" you might object, "But what about XML data used in a B2B transaction? You can't simply store that on disk."

Perhaps you can. The storage mechanism you select depends on how you will use the XML data. So again, ask yourself why you need to store the B2B XML data. If your answer is that you need it for long-term bulk persistence instead of to support queries, then you'll require a different type of storage technology.

Consider this scenario that involves an XML router. A router accepts messages from one entity that is intended for delivery to another entity, and typically guarantees one-time delivery. To provide guaranteed delivery, even in the event of its failure, the router must persistently store the messages it receives. However, since a router does not need to query the stored data, the persistence mechanism can be very simple. The file system is an extremely simple but completely adequate solution.

Before you immerse yourself further in the technology, take a look at the different types of XML information and then revisit the question of how XML information can be used.

The Fundamental Distinction: Data versus Document

The most fundamental distinction that can be drawn between types of XML content is whether the XML content is to be used as data or as a document.

An XML document is information generated by people for use by people. It consists of both text and markup language. This is XML content being used as it was originally intended to be used. Since XML documents contain information intended for our consumption, the structure of the information tends to be loose and irregular, much like our writing and speech.

XML data, on the other hand, is information generated by machines for machines. This is XML content performing the role of platform-independent data-exchange message. The information source is seldom static, and the XML content itself is highly regular and structured, as is appropriate for information coming out of a structured information repository or report-generation tool.

How Will the Data Be Used?

The essential difference between document and data addresses both the origin of XML content as well as its intended use -- presentation versus data transport. Let's consider its use in more detail.

Ask yourself the following questions:

  • Will the stored XML document ever be retrieved as an XML document?
  • If it is retrieved as an XML document, will it be retrieved in the same form or in a different form?

You may not need to retrieve stored content as an XML document if you're planning to use traditional tools to generate reports from the data once it's stored. Likewise, data may be written as one kind of XML and read as another -- XML orders from customers may go in, and XML commands for the shipping system may go out. The second question should lead you to think up a few more:

  • What are acceptable modifications to a stored XML document?
  • Is it safe to remove the DTD, comments, and processing instructions?
  • What should be done with the CDATA sections?

These questions are important, because storing XML content as rows in a relational database may result in the retrieved content looking different from the original.

What about queries? Does your application need to query the information in an XML document? If so, you won't be able to store the XML document as an opaque chunk of bytes (a "BLOB" in database terminology). What kind of queries will you need to support? XPath-like queries? Relational queries?

Considering queries is important because different storage technologies vary greatly in their support for queries, as you'll learn next.

Before you can select a storage solution, you need answers to questions like the previous ones. Asking these questions -- and answering them -- is a step you should not skip.

Your Options

XML data is used in many ways. Luckily, there are nearly as many ways to store it -- each with associated benefits and costs. What follows is a summary of options currently available, along with the benefits provided and the pitfalls associated with each. It's not a product comparison, but it should give you enough information to make an educated choice:

  • Storage as BLOBs
  • Tables in Relational Databases
  • Object Databases
  • Native XML Databases

Storage as BLOBs
A BLOB is a binary large object -- an opaque bag of bytes containing data. In the database world, from which the term "BLOB" came, a BLOB is unstructured information -- it can be manipulated as an opaque object, but cannot be inspected, queried, and so on. From a practical standpoint, a file in a file system is a BLOB. In fact, any storage mechanism that treats XML data as a single, large entity and does not expose the XML at a greater level of detail fits the category.

BLOBs are easy to use -- what you put in is what you get back -- but they don't permit access to the details of the data stored. If you plan to store XML as a BLOB, store it as a file in the file system. In most cases this solution will perform better than storing the data as a BLOB in a relational database.

Tables in Relational Databases
Even if storage in a relational database is not your first choice, it's worth your time to learn how to approach this technology since relational databases are everywhere.

While storing XML as a BLOB in a relational database has its drawbacks, storing XML information in the tables and rows of a relational database is a solution that has stood the test of time. Relational database technology is mature, stable, and ubiquitous; and there are toolkits that automate much of the work.

This approach is not without its problems, however. There is a fundamental mismatch in the way information is modeled in XML and relational databases. XML models data as a tree or hierarchy of elements. Relationships between data objects are typically indicated by containment. Relational databases model data as tables. Relationships between data objects are captured in foreign keys.

In spite of the mismatch between tables and trees, relational databases work well, particularly when applications need to access the stored data in formats other than XML. Remember, XML is a relatively new data format. For every application that understands XML, there are thousands that do not. However, many of these older applications know how to access the data in a relational database, making the database -- not XML -- the data exchange technology.

Object Databases
In many ways, object databases are a natural match for XML data. First and foremost, object databases more easily model the hierarchical structure implied by XML. This simplifies the problem of getting XML data in and out of the database.

There is a downside, however. Object databases, while useful in specific applications, never replaced relational databases -- though vendors, analysts, and other pundits predicted they would. In fact, the entire object database market is basically a niche market today. The reason object databases failed to catch on when other object-isms did has to do with the fact that relational databases make it easier to look at data in many different ways. Object databases tend to orient themselves toward a limited number of views of the data -- a potential liability when making business decisions.

Native XML Databases
A native XML database stores XML as XML -- at least from an application's perspective. The best-of-breed solutions in this space allow applications to access the stored information through XPath (an XML specification for addressing information in an XML document) and query it with XQuery (an XML specification for extracting information from an XML document) -- a nice plus.

While this level of native support for XML might seem like a boon, it can be limiting. Like object databases, native XML databases have trouble presenting stored data in many different, sometimes ad hoc, forms. The hierarchical model imposed by native XML databases locks applications into a single view of the data, unless you incorporate transformation technology such as XSLT.

Though not as mature as relational and object databases, the future looks bright for XML databases. Analysts predict the market will double or even triple in the next couple years. However, it's still not clear whether native XML databases will match relational databases in terms of flexibility. Current products seem most useful when used to store semistructured data such as documents.

Given the role relational databases play in enterprise computing, you will have to contend with storing XML data in a relational database at some point. The next section addresses a couple of issues you will have to confront.

Mismatch Between XML and Relational Models

As tough a problem as the mismatch between the XML model and the relational model creates, it's not difficult to overcome. The industry has a long history of dealing with this problem in the form of a very similar clash between object persistence and relational databases.

Mapping Relationships
A related issue concerns mapping the relationships between elements in an XML document to the relationships between tables in a relational database. As mentioned earlier, XML models relationships between objects through containment. If you have the luxury of defining your own XML schema, however, you can use ID and IDREF attributes to model relationships beyond containment. It's often easier to map these types of relationships onto relational database tables.

When selecting a storage technology for your XML data, you have a number of options to choose from, each with benefits and costs. Before you select a particular technology, make sure you have asked the necessary questions about how you intend to use the stored XML data and why you need to store it.

Resources


BigAdmin