BigAdmin System Administration Portal

HowTos

Archived from Sun's Dot-Com Builder Web Site
This content is archived from Sun's Dot-Com Builder Web Site.
These are the Best Practices > How To's archives.

Some of these pages may contain links that are no longer available. If you see these, you can report it through the Suggestions link and we will remove the link and leave the name (for reference).

Back to Dot-Com Builder How-Tos Archive

How to Build a Data Warehouse
January 31, 2001

by Lori MacVittie

A data warehouse is a subject-oriented, integrated, non-volatile collection of data used to support the corporate decision-making process. It comprises a database and the end-user applications that allow query and reporting on the data stored in the database. A closer look at a typical enterprise data warehousing architecture reveals a more detailed list of components:

The Components

Operational Database/External Database Layer
The goal of building a data warehouse is to offer access to information that is stored in operational databases and to mix it with information from other, often external, sources of data. This information includes demographic, econometric, competitive, and purchasing trends and may be bought from a third-party or collected by other enterprise applications.

Information Access Layer
This is the layer end users deal with directly. It represents the tools that end users utilize day-to-day, such as Excel, Lotus 1-2-3, Access, and SAS. This layer also includes the hardware and software that display and print reports, spreadsheets, graphs, and charts for analysis and presentation. Increasingly, this layer includes Web-based access to the data warehouse.

Data Access Layer
This layer enables the information access layer to talk to the operational layer. Today, almost all applications requiring database access utilize structured query language (SQL). The layer not only spans different database management systems (DBMS) and file systems on the same hardware, but it also spans manufacturers' and network protocols as well. Universal data access is one of the keys to a successful data warehousing strategy. This means users, regardless of their location or the tools used to access the data warehouse, should be able to get at any or all of the data in the enterprise they need to do their jobs.

Data Directory (Metadata) Layer
To provide for universal data access, it is essential to maintain some kind of data directory or repository of metadata information. Metadata describes data within the enterprise. To have a fully functional warehouse, there must be a variety of metadata available -- information about the end users' views of data and about the operational databases. Ideally, users should be able to access data from the warehouse without having to know where that information resides or the form in which it is stored.

Process Management Layer
This layer pertains to scheduling the tasks that go into building and maintaining the data warehouse and data directory information. The process management layer can be thought of as the scheduler or the high-level job control for the many procedures that keep the data warehouse up-to-date.

Application Messaging Layer
This layer transports information around the enterprise computing network. Application messaging is often referred to as "middleware," though it can involve more than just networking protocols. It can be used to collect transactions or messages and deliver them to a certain location at a specific time.

Data Warehouse (Physical) Layer
The data warehouse is where the actual data used primarily for informational purposes resides. In many instances, the data warehouse may not actually contain "live" data but may contain copies of operational and/or external data. This setup prevents modification of mission-critical data and accommodates the need to change the layout of those data tables in the warehouse for performance reasons. Increasingly, data warehouses are stored on client/server platforms, but they are often stored on mainframes as well.

Data Staging Layer
Data staging is the final component of the data warehouse architecture. It is also called copy management or replication management, and in fact it includes all of the processes necessary to select, edit, summarize, combine, and load data warehouse and information access data from operational and/or external databases.

The Process: Planning and Implementation

An enterprise data warehouse is a highly complex environment; development costs are steep and full of risk. A bottom-up development methodology is suggested in conjunction with a step-by-step implementation. A top-down, large development effort is expensive and unsound and is likely to extend the return on investment (ROI) for many years.

The following development phases are commonly accepted as necessary for a successful data warehouse implementation:

Data Warehouse Planning
  • Business Discovery - Identify opportunities to create business benefits. The deliverable should be a detailed analysis containing prioritization of business issues and quantifiable benefits of the data warehouse as bottom-line impact to the business.
  • Information Discovery - Map business requirements and determine potential solutions. Show customers how data can be turned into useful information, as well as how that information may be used to address critical business needs via data modeling.
  • Logical Modeling - Specify the data content for the warehouse. There are several deliverables that should be created during this phase:

    • Validated Business Solution Model - This is a graphical depiction of primary information required by business functions as relational tables. It should show the major data entities, their relationships, and the keys of the entities.
    • Documentation of Business Rules - Entity relationship diagrams, as well as interaction with business rules, should be well documented.
    • Identification of Data Sources - This needs to include both external and internal sources. If new sources will be used, they should be listed here.


  • Architecture Design - Determine the technical and architectural components of the data warehouse. Decisions made in this phase should include the following:

    • Data Warehouse Location - Determine if it will be centralized, distributed, or in data marts.
    • Application Tools - Include tools necessary for loading and applications required by end users.
    • Sources of Data - Document the databases within the organization, as well as possible external sources of data (often demographics from marketing or government contacts).
    • Business Rules and Data Relationships - Provide documentation, often graphically, of the relationships between tables and business entities. Also include descriptions of the basic business rules by which data is accessed.
    • Warehouse Management -- Identify hardware, software, network, support, restore, and archival requirements.

[Page 1]   [Page 2]  


BigAdmin