E-mail is currently the mother of all electronic records problems. Much of our current migration towards electronic record keeping began with some old backup tapes of the White House's professional office systems and Oasis e-mail systems. A curious bystander found copies of e-mail memos that eventually led to the Iran-Contra investigations and focused various public advocacy groups on the importance of e- mail as a means of conducting business in the federal government.
The result has been an almost continuous stream of litigation aimed at getting the federal government to preserve electronically generated record material in electronic media. So far, the courts have sided with the public advocacy groups and are pressing the National Archives and Records Administration to come up with a plan for managing electronic records. When they do, I expect some fairly significant changes to the way we do business.
We're now in the same unenviable position as a deer looking into the headlights of an oncoming high-speed bullet train. We can see the problem, but it's very different from the problems we're used to dealing with. If we don't move soon, we'll get run over.
So let's move.
I suggested in an earlier Chips article (April 98) that we should save all our e- mail in one giant archive and not worry about weeding out the chaff. I got a few responses (in e-mail, of course) that generated some stimulating discussion.
First, let's look at some of the estimates we've done here at Strategic Command. Our LAN managers estimate that we send an average of 5,000 e-mails per day on the unclassified LAN, and receive about an equal number from outside our network. That adds up to 10,000 unique messages on the network per day.
Second, the average size of an e-mail, including attachments, is 25 kilobytes. That adds up to 250 megabytes of e-mail per day.
So, if we save one copy of what we send from within our organization and one copy of each e-mail that comes in through the firewall, our e-mail collection contains 360,000+ messages that would occupy approximately 90 to 100 gigabytes per year.
Ouch.
Storage, however, is getting less expensive every day. Finding room for all that e-mail wouldn't be that hard. Just buy a terabyte file server and store eight years worth of e-mail on it. Eight years is about the longest period of time we normally keep record material in our local files.
Record material is normally scheduled based on minimum retention. There are very few types of records we are absolutely required to destroy when their retention period is up. We may want to dispose of the material because we need room for more or because having someone else find it would be potentially embarrassing, but most retention periods are a minimum, not a maximum.
There will, of course, be a few e-mails that we'll have to keep longer than eight years. However, these categories of information are easy enough to identify that action officers already routinely file information that belongs to them for long-term retention. Manually sorting out those few e-mails that require long-term archiving shouldn't be that onerous.
Discarding year-sized blocks of e-mail during the ninth year based on file date makes disposition for most of our e-mail fairly simple. However, storage and disposition are the easy parts. The bulk of our problem, given the volume of individual documents that we'll be dealing with, is organizing our e-mail records so we can retrieve information. After all, what's the point of keeping it if you can't find anything specific on demand?
I believe, though, that there should be a better way to do this. The idea of keeping tons of e-mail that really isn't worth anything just to facilitate retention of real records isn't just inelegant, it's ugly. There has to be a better way.
Enter Zippy. Actually, it was more of a return. He and Zippette who finally tied the marital knot had been honeymooning in France for two weeks.
(It was originally only suppose to be a long weekend flying visit to Paris, but everyone at the office chipped in to make sure he stayed away from work for at least two weeks.)
Zippy returned from France with a renewed interest in e-mail. He'd seen a commercial for a popular pain reliever on French television that starts off with a high speed train whizzing by. According to Zippy, the commercial then shows a computer screen with an announcer explaining that "e-mail travels at 300,000 miles per second."
Thanks to this commercial, we now know that e-mail travels at approximately 1.6 times the speed of light. This is faster than anything else in the known universe - faster than anything is actually ever supposed to be able to travel. I have to wonder if this particular advertising company has also been doing ads for most of the major computer software and hardware manufacturers. This assertion is strikingly consistent with many other scientific claims made by the computer industry over the past 30 years.
(Then again, Zippy could have just mistranslated kilometers into miles, but he does speak passable French and, more importantly, Zippette vouched for him.)
In any event, Zippy somehow melded the image of a speeding train traveling faster than light with the number of e-mails generated each day in STRATCOM and had himself a little epiphany.
"You know, Dale," he said, "there's no way we're ever going to be able to archive all our e-mail manually. There's just too much of it coming and going too fast."
I was, of course, forced to agree with this ineluctable logic. "Yes, Zippy, that's true," I replied. "We know what we need to save, but we just don't have the time to do it if we expect to do other work."
"Well then," answered Zippy, "why don't we just teach a computer to do it for us?"
"Because" My response died on my lips as a little voice in the back of my brain echoed: "Yeah, why don't we?"
Over the next few hours, Zippy and I kicked around a variety of ideas, concepts and theories about how we could do automated record keeping. Just to make sure I wasn't suffering from some debilitating hallucinatory condition, I pulled our Command Records Manager (CRM) into the conversation.
Normally, our CRM greets Zippy about the same way Van Helsing would Dracula: with a sharp wooden stake in one hand and a large mallet in the other. This time, however, even he had to admit that Zippy's idea might be our only feasible alternative to an electronic tsunami.
The answer is to develop artificial intelligence (AI), or at least something that resembles it, for sorting electronic records. We need to translate the process we use to categorize and file information into an algorithm that can become the core of our records management applications.
The first requirement is knowing what categories of information we need to save and for how long. We've actually spent most of the last 100 years defining this and developing records management programs with specific filing and disposition instructions for each category. Our records categories and series are not based on media, but on mission, business or legal requirements. Therefore, they should still be valid no matter what media we use to archive our records.
As we move our business from paper-based to electronic systems, I do not expect our information requirements to change that radically. That gives us a fairly structured framework around which to build our AI.
However, while we have the categories, what we currently lack are detailed, documented semantic models for how we sort information into those categories. Because the human mind is capable of making intuitive judgements based on the perceived meaning of words, we don't have much trouble distinguishing between what records go into the General Correspondence or Nuclear Counter-Proliferation Operations file folders.
Computers, on the other hand, need very explicit instructions on what to do with individual documents. We will need to develop detailed lexicons for all the records series we intend to archive electronically. This will not be a trivial project, as there are hundreds of records series to choose from.
The potential payoff, however, is huge. How much would it be worth to us, in man hours alone, to have our networks automatically file our work for us into a system that would also allow immediate sorting and retrieval of entire categories of information?
What would you think of an application that could scan 100,000 one-page documents in less than two hours, build a topographical map that shows clusters of highly related documents, identify them by name and let you retrieve them?
My first thought was, "Gee, this might help solve our e-mail records problem." Zippy had somehow managed to acquire a copy of an application named SPIRE (Spatial Paradigm for Information Retrieval and Exploration) from Pacific Northwest National Laboratories (PNNL). He had been using it, unbeknownst to the rest of us, to sort his e-mail from his Star Trek trivia Usenet group. This is somewhat akin to buying a Porsche and then never driving faster than 20 miles per hour.
According to Mr. Dennis McQuerry, one of the developers at PNNL, SPIRE is not a neural net type of AI. The initial charter to develop SPIRE came from the Office of Research and Development (R&D) at the CIA. One of the criteria was that there would be no manual key-wording and no training data needed. This led PNNL to rule out artificial neural nets, so they resorted to statistical methods.
SPIRE creates a visual representation of unstructured text by selecting specific features from each document, building an n-dimensional vector for each document (with n in the range of ~200), creating clusters of highly related documents, and then building a topographical map of the n-dimensional Galaxies representation for the user. It can process 100,000 scientific abstracts (average length about one page) on a medium sized Unix machine (Sun Ultra 2, 200 MHz) in about 1.5 hours.
Since each document is represented as a single star in the Galaxies representation, SPIRE is well-suited for mapping relationships between short, focused documents, such as news stories, e-mails, patent claims, etc. Long documents which cover a lot of different themes are more difficult to map, as the Galaxies projection will only allow a single document to reside with a single cluster of other documents. This can cause the system to ignore some of the thematic relationships between a document and other potentially related documents in other clusters unless it is segmented into shorter, more focused sub-documents (volumes, chapters, parts, etc.) that can be scanned and mapped independently.
According to Mr. McQuerry, they have a product called Webtheme, which is a Java-based version of SPIRE that runs in a skinny-client-server configuration that they use primarily to harvest documents off the web - though it can also be used as a means of sharing a SPIRE data set with a work group.
However, one limitation of SPIRE is that it creates vectors that describe each document only in terms of its relationship to all the other documents in the set. As a result, SPIRE vectors are only relevant within the context of the other documents that make up the body of the mapped archives.
Another approach, according to Mr. McQuerry, and the one I find very interesting in relation to our problems with electronic records, is to use a predefined set of abstract entities to shape the information visualization process. PNNL has another system, called Hypercube, that builds a portable vector for each document in the archive. These vectors can be used as metadata (data about data) to describe the document regardless of what other collections you place the document in.
Their current work with this system is based on the Uniform Medical Language System, but it is probable that the methodology could be applied to any group of information for which sufficient metadata exists. I believe our current records series, expanded to include more robust metadata, would be a good start for building that ontology.
In the end, all the gee-whiz computer tools notwithstanding, the root of any records management system is a basic understanding of what information you need to record, and how long you need to keep it.
We can build the metadata we need for an intelligent system from our existing framework of records management information. We should, at the same time, be looking at technologies like SPIRE and Hypercube and see where we can meld them together with our business requirements for managing electronic records.
If you want to take a closer look at what PNNL has done in this area, their WWW site is located at http://www.pnl.gov/infoviz. This is not bleeding edge technology, but a proven implementation. PNNL received an R&D 100 award for SPIRE in 1996. Also, since SPIRE was developed with government money, PNNL does not charge a licensing fee to government clients. They do, however, charge for installation, training and support.
Let's say we get this magical mystery archive of all our e-mail sorted by an intelligent computer agent. Now we get into the real crux of records management: retrieval, retention and disposition.
Who should have access to the e-mail archive? Everyone does not need access to everyone else's e-mail. However, there are legitimate reasons for people to pull up compilations of e-mail messages that document particular projects or programs.
For lack of any type of integrated relational system that included access to organizational and personnel data, I would restrict access to the e-mail archive to records managers and technicians. If I want all the e-mails that someone who works for me has exchanged with the LAN Program Management Office, I can have my records technician retrieve them for me.
My ideal, however, would be to have retrieval systems sophisticated enough to know that I'm Sharon's boss and, as such, entitled to retrieve her e-mail records by comparing reporting chains and command or administrative relationships documented in the personnel management database. While we're a long way from that type of capability in practice, the tools, mechanisms and theory to build these types of systems exist today.
We just have to buckle down and build that Enterprise Database we've been talking about all these years.
Retrieving files from a collection of over three million e-mails will depend on a rock solid indexing system. Fortunately, e-mail has well-defined data in the header for who sent it, who received it, the subject and when things happened. And indexing engines exist that are more than capable of compiling a full-text index of the remainder of the messages.
So, we should be able to retrieve all the messages sent between, from or to specific people or groups that discuss certain subjects between or on specific dates. It's a less complex indexing problem than what Lycos, Excite, AltaVista and Yahoo deal with every day on the WWW.
One thing that does complicate filing e-mail, however, is that people frequently discuss more than one subject in a single e-mail. For example, I sometimes send my boss e-mail with status on three or four different issues in the same transmission. It's more efficient than sending four separate transmissions. However, that means if we're keeping paper record sets on all four, we'd have to print out four different copies of the message to file in the different folders. With a relational archive, that one e-mail can be referenced in four different file folders without being duplicated.
With simple index retrieval, we will probably pick up some extraneous material during an e-mail search, just as we do running a search on the WWW. However, e- mail is a much more structured medium than the Web. It won't be as complex a system as some people might think. Add in some of the sorting capabilities of the SPIRE or Hypercube systems we discussed earlier, and I think we have a winner.
The two concepts that distinguish records management from any other type of information management are retention and disposition. We have requirements for how long we must maintain information in particular categories.
We keep records for one of three reasons: we want to, we have to, or we've forgotten they're there. In all cases, there is a cost associated with retention. Paper and microfilm need some type of physical storage. Electronic records need digital media like disks or tape.
While we can, as I mentioned earlier, buy a terabyte server and store everything, we shouldn't have to. What we should get is a records management application (RMA) to manage the information we've sorted into our virtual folders.
Any RMA we buy should be able to manage the contents of our folders, just like we manage the folders in our filing cabinets. If the RMA can reliably dispose of records when they pass their retention period, it should significantly reduce the amount of storage we need. It's a simple concept; I would like to think the execution of it will be as straightforward.
Having said we can save all the e-mail we send, I don't think most of us are ready to go quite that far at present. We can, however, start practicing immediately with one small subset of e-mail as a prototype project: organizational e-mail.
Organizational e-mail messages are those sent from an organizational e-mail account instead of an individual one. In addition, organizational e-mail may task, direct or commit their own or other organizations. Individual e-mail does not inherently carry that authority, though individuals may still task, direct or commit other individuals.
My boss, for example, has three official e-mail identities. She can send as herself, J070 or Command Secretariat. Each has a different meaning and function.
E-mail sent from her individual account is sent from her personally. In theory, she can task, direct or commit me and the other people who work for her from her individual e-mail account, but not the organization as a whole.
(In practice, however, it wouldn't really matter what account she sent it from. It would be rather silly of me to tell her we weren't going to comply until she sends it from the correct e-mail account.)
As J070, she or anyone else in her office authorized to use that identity, can officially task, direct or commit subordinate organizations or provide input for coordination or authorization. An address based on an office's formal functional address symbol is what I would refer to as the official address for an organization. These accounts will become increasingly important as we adopt work processes that rely on e-mail as a delivery backbone.
The other organizational e-mail identity, Command Secretariat, is what I call a super-official address, one based on a title rather than an office symbol. As Command Secretariat, my boss assumes the role of the Commander-in-Chief's representative for administrative matters, including the authority to task and suspense every other organization in the command on behalf of the CINC and issue command-level announcements and policy.
Similar roles apply to our CINC, DCINC and Chief of Staff, who each have personal, official and super-official e-mail identities. Other super-official accounts would include the help desk, CERT or LAN Administration - all entities that conduct official business via e-mail. However, super-official e-mail identities are comparatively rare. Most organizations only have one organizational e-mail identity, which will be based on their functional address symbol.
Common uses for organizational e-mail accounts may include automated message distribution, tasker/suspense systems, policy coordination and transmission of any other staff work that we currently perform using paper-based routing boxes. There are several sophisticated workflow systems that can use individual and organizational e-mail accounts as the basis for whole new systems of work.
At present, however, the volume and complexity of our organizational e-mail traffic is much lower than for individual e-mail accounts. Much of what we're handling is either rerouted AUTODIN messages (which are already saved elsewhere), congratulatory public announcements or public address messages (i.e., flu shots start next week). Your mileage may vary, but unless you already have an e-mail-based workflow system in place or you've dumped AUTODIN and have already implemented the Defense Messaging System, most organizations probably send less than 100 messages from organizational e-mail accounts per week.
This low volume makes organizational e-mail a good starting point for prototyping an e-mail record keeping system. All the messages sent from organizational accounts are, by definition, federal records. However, unless you've already installed a full-scale workflow system, most of the material is neither critical or controversial.
If you'd like to get a head start on the aforementioned oncoming train headlight, here are some suggestions:
1. Set aside space on the network for a prototype e-mail archive; 100 MB should do for a start.
2. Start with your super-official e-mail accounts. Either have the e-mail server automatically archive a copy of anything sent from those accounts to the storage area or manually save copies from the Sent Mail box. (Note: automatic will always be the best choice here, if your server will cooperate.)
3. Build a retrieval system for the e-mail archive. Windows Explorer or File Manager will allow minimal access, but really don't answer the mail here. My best offer, at present, is to point a web-based search engine at the storage area and build an HTML interface to accept queries and display the results. As all of the messages in the archive were probably broadcast to everyone in the first place, I wouldn't worry about controlling the rerelease of any information in the archive. So, if I remember reading a message about the flu shot schedule a month ago but deleted it and want to find the information, I should be able to go to the web page, type flu shot into the search box, and retrieve a link to any archived messages that refer to flu shots.
4. When you are satisfied that the system is stable and reliable, add in the rest of your unclassified organizational e-mail. There may be cases where you'll want to restrict access to some of these messages based on Privacy Act or other concerns. Experiment with and concentrate on developing solid access control schemes.
We've covered a lot of ground in the ERM article series over the last year, but we've still barely scratched the surface. Migrating to a paperless office, enterprise information system or any of the other Holy Grails we subscribe to will take a sustained organizational commitment and a well developed (and well marketed) concept of operations.
It was Oscar Wilde (or maybe a records manager having a bad day) who said, "It is a very sad thing that nowadays there is so little useless information." Our common store of data, information and knowledge is growing at an amazing rate. While the rules of the game should not change just because we're going electronic, we will have to employ and trust some form of automation to help us deal with the flood.
About the Author: Maj Dale Long, USAF