IN THIS ARTICLE  
  Tobacco Master Settlement Mandates Data Preservation  
  Functional Requirements Pose Data Management Challenge  
  Combining Open Source Software with Infrastructure Investment  
  Effective Planning and Development Deliver Tangible Results  
  Key Tobacco Archive Learnings  
  Tobacco Master Settlement Mandates Data Preservation
The 1998 Master Settlement Agreement between the National Association of Attorneys General (NAAG) and 5 major tobacco companies and 2 industry organizations required millions of documents be made available to the public in electronic format via the companies' Web sites until 2010.

To ensure a lasting public record of this information on tobacco industry research, manufacturing, marketing, advertising, and sales of cigarettes, the American Legacy Foundation established a grant to create a permanent electronic archive and unified search interface. When the University of California San Francisco (UCSF) Library, Center for Knowledge Management was awarded the project in 2001, work began to build its impressive Legacy Tobacco Documents Library (LTDL) using 40 tapes from the NAAG containing tobacco company image and document index files.

Building a massive digital archive with public access to over 40 million pages of tobacco industry data from the largest civil settlement in history required creativity and commitment. After achieving its objective by going live in January 2002, the UCSF Library has normalized and indexed a total of 4 gigabytes of metadata and warehoused over 6.8 million documents. The result is open, free, searchable, worldwide public access 24x7 to extensive tobacco industry and health related information.

Functional Requirements Pose Data Management Challenge
The demanding functional requirements for the archive and unpredictable data issues posed significant technical challenges for the UCSF Library team. Top priorities for managing the data included:
  • Reading 24 million image files from the tapes and transferring these to the target system and directories running on Sun servers and storage devices.

  • Identifying data errors, non-standard fields and other inconsistencies and data issues in disparate files from multiple sources.

  • Normalizing data by changing file name cases, generating XML, and creating an index for the search engine.

  • Building a search engine that integrates and manages 7 different metadata schemas delivered in files from different sources so that searches can be performed across collections with an easy to use interface.

  • Automating interrogation of 7 major tobacco companies' Web sites to harvest the latest documents to add to the archives, including error handling and interpreting nonstandard fields.

It's difficult to calibrate the challenge posed by the constantly changing data issues faced by the UCSF library team. For example, they discovered over 700 different names for document types (e.g., memo, letter, interview, fax, etc.), 50 different spellings for the word "cigarette," and numerous conventions for document dates. As explained by UCSF Library Programmer Analyst Bob Mason, "No one anticipated the huge scale of this project in terms of harvesting and processing the data, and dealing with so many data errors and inconsistencies. This challenge required ongoing human effort and problem solving by our team beyond just the technology involved."

Combining Open Source Software with Infrastructure Investments
In public projects like the LTDL, operating within a tight budget is the norm. UCSF Library looked to open source software to find the functionality and tools required to build and maintain the archive while conserving cash. Other than the Xpat indexing tool, part of the Digital Library Extension Service (DLXS) from the University of Michigan, and the Solaris operating system from Sun, all key software components selected were open source.

To keep the archive current, LTDL invested considerable effort in adapting a "Spider" written in the Java™ programming language by a programmer at the University of Sydney in Australia to look up records related to Asia on the Phillip Morris Web site. Explains UCSF Library Programmer Bob Mason, "We generalized the spider to work against all the records on 7 industry sites, which involved 5 entirely different web interfaces and 6 different database structures. Adapting and refining the program has been a challenge because the behavior of the industry sites is unpredictable and new issues are always arising."

The key hardware selected for LTDL includes Sun production and Web servers, Sun data storage devices, and Sun technical workstations used by programmers. To improve system performance, a dedicated development server and Storage Area Network (SAN), also from Sun, are being deployed.

 
  Effective Planning and Development Deliver Tangible Results
The Legacy Tobacco Documents Library delivers a permanent, online archive of tobacco documents released through the Master Settlement Agreement. Efficient and effective planning, software, and systems development by LTDL extends important benefits to the public through the archive:
  • Project completed on-time and within budget, delivering tobacco archive to the public one year from start-up.

  • Searchable data repository houses over 6.8 million documents, 40 million pages of tobacco industry documents, 4 gigabytes of metadata, and 1.5 terabytes of image data.

  • Technology and process in place for constantly updating the archive with the latest documents (note 1.4 million new documents were added in 2003 alone).

  • Impressive system uptime of .99999 ensures 24x7 public access.

  • One convenient, coherent archive federates disparate data from multiple document collections using XML.

  • Positive Web user experience features intuitive site design and easy to use search functionality and display of data as tif, gif, and pdf files.

Since the LTDL launch in January 31, 2002, there have been over 4 million page views of the archive and almost 1.5 million documents served to the public. Users of the archive include public health and policy makers, educators, students, lawyers, scientists, and medical professionals from over 100 nations.

 
  Key Tobacco Archive Learnings
  • Largest IT challenge was unpredictability in terms of new and changing issues related to processing documents and normalizing the data.

  • Preservation of documents as fixed content (i.e., stored once in a digital format and never changed) can be problematic when objectives include error checking and normalizing data for integrity and integration with data from other sources.

  • Incremental indexing is important to efficiently update archives when new data is being sourced and integrated on a regular basis.

  • Multi-threaded software can significantly improve system performance by allowing the software to use multiple CPU's concurrently.

  • Writing software in Java speeds development work and adds important flexibility.

 

For more information on the UCSF Legacy Tobacco Documents Library and related resources, visit www.legacy.library.ucsf.edu


©2004 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, Solaris, Ultra, Sun Fire, Sun Enterprise, Sun Blade and StorEdge are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.


Company Info  |  Contact  |  Terms of Use  |  Privacy  |  Trademarks  |  Copyright 1994-2005 Sun Microsystems