Current worry Past concerns Topics

January 17, 2008Volume 5, issue 7


Document & Media Exploitation

The DOMEX difficulty is to turn digital bits into actionable intelligence. SIMSON L. GARFINKEL, PH.D.

You are watching: What are the steps to the domex process

A computer used by Al Qaeda ends up in the hands of a Wall Street Journal reporter. A laptop from Iran is uncovered that includes details of the country’s nuclear weapons program. Photographs and videos room downloaded indigenous terrorist web sites.

As confirmed by these and also countless various other cases, digital documents and also storage gadgets hold the key to numerous ongoing military and criminal investigations. The many straightforward strategy to utilizing these media and also documents is to discover them with plain tools—open the word files with Microsoft Word, view the internet pages with internet Explorer, and so on.

return this straightforward method is basic to understand, it can miss a lot. Deleted and also invisible records can be made clearly shows using basic forensic tools. Programs called carvers deserve to locate details that isn’t also a complete record and rotate it right into a type that can be readily processed. In-depth examination of e-mail headers and also log records can disclose where a computer system was used and other computers with which that came right into contact. Etymological tools can uncover multiple papers that describe the same individuals, also though surname in the various documents have various spellings and are in various human languages. Data-mining methods such as cross-drive analysis can rebuild social networks—automatically determining, because that example, if the computer’s ahead user was in call with well-known terrorists. This type of advanced analysis is the ingredient of DOMEX, the little-known knowledge practice of record and media exploitation.

The U.S. Intelligence community defines DOMEX as “the processing, translation, analysis, and dissemination of gathered hard-copy documents and electronic media, which are under the U.S. Government’s physical control and also are no publicly available.”1 That definition goes on to exclude “the taking care of of documents and also media throughout the collection, early review, and inventory process.” DOMEX is not around being a digital librarian; it’s around being a digital detective.

although very tiny has to be disclosed about the government’s DOMEX activities, in recent years scholastic researchers—particularly those concerned with electronic privacy—have learned a an excellent deal about the general process of electronic paper and media exploitation. My attention in DOMEX started while examining data left on tough drives and also memory pole after documents had been turned off or the media had been “formatted.” I constructed a device to automatically copy the data off the tough drives, keep it top top a server, and search for confidential information. In the procedure I constructed a rudimentary DOMEX system. Other recent scholastic research in the fields of computer system forensics, data recovery, machine translation, and data mining is additionally directly applicable come DOMEX.

This article introduces electronic document and media exploitation indigenous that academic perspective. It presents a version for performing this kind of exploitation and discusses several of the relevant scholastic research. Appropriately done, DOMEX walk far beyond recovering papers from difficult drives and also storing them in search archives. Expertise this engineering trouble gives understanding that will be helpful for designing any system that works with large amounts the unstructured, heterogeneous data.

Why “Exploitation?”

once researchers say that their occupational is centered on details or file “exploitation,” eyebrows invariably raise. Words exploitation is provocative, attracting unwarranted fist to a process that might just as easily be classified together “computer forensics” or also “data recovery.” But, in fact, the word is apropos.

The indigenous exploit and also exploitation imply using miscellaneous in a manner it is “unfair or selfish.”2 and also it’s true. People who space in the service of file and media exploitation really perform seek to make unfair use of computer documents and also electronic storage devices. Fair, after all, means following the rules. The “rules” that a computer system system space the APIs, the data-storage standards, the document permissions, and also other interfaces the were intended to be offered by the file’s creator. Once a file in the computer’s digital trash is turned off by “emptying the trash,” the rules say the the file’s components should no longer be accessible. The “undelete” command that is part of every forensic toolkit takes benefit of the fact that computer systems typically do no overwrite the materials of deleted files. This is a common problem in computer systems, influence not only deleted records in paper systems but also deleted paragraphs in native processors and even unallocated pages in online memory systems.

computer forensic practitioners working for police departments and also litigation support firms likewise make your living by recovering intentionally deleted data, yet even these processes follow rules—though those affiliated in exploitation might select to disregard them. The score of computer forensics is to aid in some sort of investigation, which generally begins since a crime was committed and, hopefully, ends v the perpetrator gift convicted in a court that law. Through conviction together a goal, forensic practitioners need to be concerned with the evidentiary integrity and chain that custody—and they should limit their search to info that is pertinent to that investigation. In many cases the proof will have actually been derived under a find warrant or discovery procedure, the regards to which might limit the forensic examiner’s actions or even which type of documents may it is in examined. Evidence acquired by break the rules may even be suppressed.

for example, in the situation of U.S. V. Carey, one investigator executing a warrant top top narcotics discovered files with a JPG expansion that had child pornography. Carey was indicted and also convicted for possession of son pornography, but the appellate court reversed the ruling and also remanded the case ago to the trial court, saying that “the seizure of proof was beyond the limit of the warrant.”3 The evidence should have been suppressed.

uneven the investigators in the Carey case, those engaged in file and media exploitation room not bound by any type of rules various other than regulations of physics and nature. The score of info exploitation is come get and also use the data—the end justify the means. The OK if these outcomes aren’t an excellent enough for a conviction. Exploitation seldom seeks to prove or disprove the details of a case; instead, it seeks to make the fullest use of all the data that has been obtained. The traditional of success is the usefulness the the result, no the reliability of the process.

If you discover the coming before paragraph alarming, remember the DOMEX is about exploiting data, not people. “Exploitation” is specifically the mindset that you want when you take it a crashed difficult drive to a data-recovery firm. If you’ve simply lost the only copy the a 400-page manuscript, it’s most likely OK with you if the firm is maybe to recoup the first 200 pages of the September 20 version and the last 180 pages of the respectable 19 version. Although a great defense lawyer might be able to suppress a document that to be made by stitching with each other those two halves, you probably don’t care about that if you room the author and also the alternate is rewriting the 400 pages native memory. Likewise, if you are using some type of desktop search device to index the documents on your hard drive, you don’t mind if the product provides a mistake or two and shows you records that you aren’t “allowed” to see—just as lengthy as you find what you’re searching for.

The 2 DOMEX Problems

extensively speaking, DOMEX addresses 2 problems, which us will call “deep” and “broad.”

The deep difficulty is the much easier of the 2 to understand. Some sort of record or data-storage device—for example, a tough drive, DVD, or cellphone—becomes easily accessible for analysis. Because of the way that this object was obtained, we know that it is of interest. The goal is to find out everything feasible about it.

A an excellent example the the deep difficulty is the evaluation of two tough drives stolen from Al Qaeda’s central office in Kabul ~ above November 12, 2001. The tough drives to be in a laptop and desktop that Alan Cullison, a war correspondent working for the Wall Street Journal, purchase in Kabul.4 evaluation revealed the the desktop had to be used mostly by Ayman al-Zawahiri, one of Al Qaeda’s top leaders. After Cullison confirmed that the computer systems were legitimate, the turned them over to U.S. Knowledge officials. The experts who were given those tools presumably want to know everything possible about them—not simply the documents, but the application programs, the construction settings, the other computers with which this machines had come right into contact, and so on. Although couple of details of just how these computers were analyzed have been made public, it would certainly be reasonable to assume the every applicable forensic and file analytic tool in the U.S. Arsenal was used to the machines.

one more example of the deep problem is the evaluation of a stolen Iranian laptop obtained by the U.S. Federal government in July 2005. Follow to the New York Times, the laptop consisted of “more 보다 a thousand pages that Iranian computer system simulations and accounts of experiments” the “showed a long initiative to design a nuclear warhead.”5 once again, an analyst challenged with analyzing this laptop would want to understand everything around it that was technically and also humanly possible.

The wide DOMEX trouble flips points around. Instead of having endless resources to spend on a specific document, experts are given a huge number of digital objects and also a restricted amount the time to uncover something useful to an investigation. In current years the quantity of digital details seized during the food of regulation enforcement, intelligence, and even during civil litigation has exploded. “Ten year ago, a situation would indicate a couple of computer hard drives,” e-discovery professional Jack Seward said in 2005. “Now a instance is regularly hundreds of difficult drives, many servers, and also tape archives.”6 Indeed, a single case handle by the FBI’s north Texas local Computer Forensics activities in 2002 required more than 8.5 terabytes of warehouse and more than a month of computer work come process.7

This avalanche of digital media renders the broad problem fairly compelling indigenous both a national security and also commercial perspective: a system that can reliably find the “good stuff” have the right to save money, time, and perhaps also lives.

Although this two problems may it seems to be ~ on the surface ar to be rather different, both require countless of the very same tools and also technologies. Using either technique to a difficult drive calls for software that can interpret disk frameworks for a wide selection of operation systems and their different versions. The naïve way to execute this is by mounting the disc partition read-only; a far better approach is to usage forensic paper system recovery software such together The Sleuth Kit.8 Such software application knows how to decode on-disk file system structures, can recover turned off files, and also is tolerant the data frameworks that could be lacking or corrupt.

file recovery is just one of many required technical capabilities. Once documents are recovered, software requirements to extract important “names and entities” such as human names, e-mail addresses, physical addresses, and so on. The software requirements to be able to recognize different spellings or codings because that the exact same information. The device will probably need to develop some kind of hypothesis about what kinds of processes inside the computer system system produced the save data in the an initial place. Finally, the software need to be maybe systematically come organize the info so the it can be instantly processed.

Human-Generated content vs. Technical Content

The intelligence community’s emphasis on translation, analysis, and also dissemination in its DOMEX definition is no accident. Much of the work on DOMEX grows out of previous job-related on DOCEX (document exploitation). Commercial DOCEX systems have been available to the U.S. Government due to the fact that the 1990s.9

now there is still far-ranging emphasis ~ above documents, and on the information developed by human being beings. This is specifically true once DOMEX info is gift in a criminal or polite trial. In a courtroom the start can easily take a printout of an e-mail blog post or a digital photograph uncovered on a difficult drive and also enter it into evidence. Certainly one factor that the Al Qaeda tough drive was an important is that it had correspondence through Osama bin Laden and also other Al Qaeda leaders.

technological content have the right to be equally valuable. For example, one analyst could discover a link in between two women due to the fact that both women have photographs that the same male on your respective tough drives. Another method of finding out that attach might it is in by determining the the 2 women both have digital photographs that come from the very same digital camera (as figured out by a serial number in one EXIF file) or because their copies of windows XP were activated through the same stolen serial number. Information produced by a computer, such together a digital camera’s serial number installed in a JPEG EXIF record, can be an essential in developing a link between two individuals and unlike the analyst that recognizes the exact same man, the technical connection can be made automatically—even if the two hard drives space analyzed at two various locations—provided the there is a correlation step done in ~ a main location.

extracting technical details is complicated because many record formats room either proprietary or poorly documented. This sort of analysis is likewise rare amongst today’s advertising forensic tools, which have tendency to focus on document recovery and also presentation the data from a solitary drive. For example, a data-mining algorithm the discovers the an unprintable fragment of a PDF file has a usual “ancestor” with another PDF file would most likely not be valuable in a court that law: explaining come a jury what such a match actually way would be difficult. Finding one of those PDF papers on a recorded laptop and also another on a terrorist web site, however, can be advantageous in help an analyst understand how information flows with an organization.

The evaluation of technological content is likely to grow much more important in the coming years together the widespread ease of access of disc and file encryption provides human-generated contents harder to access, just as the widespread usage of encryption for interactions increases the prestige of traffic analysis for communications intelligence.10

automatic DOMEX

In the remainder that this article, I current an style for performing automated paper and media exploitation and show how the style can be used to both the deep and large problems of DOMEX. Return I usage the instance of a tough drive that arrives for exploitation, lot of the discussion could use equally fine to a DVD or USB flash storage device.

Step 1: Imaging and also Storage

once a hard drive an initial becomes accessible for exploitation, its condition is generally unknown. The drive can be in perfect working order. On the various other hand, the journey may have actually been damaged or about to fail and may have only a couple of minutes of work life left. Therefore, as soon as a drive arrives because that exploitation, the drive’s materials are generally copied to a high-capacity storage device such together a RAID or san (storage area network). This procedure is called imaging, and also the device to carry out this task is dubbed a disc imager.

A variety of disk imagers have been arisen for usage by police departments and also other computer forensic investigators. This programs make a sector-for-sector decaying copy into one or much more evidence papers on the storage system.

many forensic disk imagers will additionally calculate an MD5 or SHA-1 cryptographic hash that both the initial disk and also the image: by to compare the two the investigator can develop the believers of the copy. In a criminal investigation this hash is videotaped in a police or investigator’s report; if the disk picture is later provided to an expert working because that the defense, that professional can verify that the disk picture the defense team got is the same as the one gained by the police.

A an extensive list of disc imagers is accessible on the Forensics Wiki.11

The following additional features are preferable when a disk imager is supplied for DOMEX:

The imager must be as automated as possible because the the potentially huge number the disks that should be processed.The imager should capture metadata around the tough drive such together its serial number, manufacturer, and firmware version, and also handle bad sectors. (Some so-called negative sectors deserve to nevertheless be review by transforming off error correction; rather can’t. Some poor drives can also crash her host computer when you shot to spin up the drive.) The imager have to incorporate workflow automation features such as choosing a file name and storage place for the image document and detecting if the exact same drive has actually been inadvertently imaged before. In some applications, encrypting the image paper with a public an essential may be preferable so the contents cannot it is in decrypted other than in a secure facility.

also though imaging is well-understood technology, plenty of improvements space possible. Today’s imagers have to be faster, more highly automated, and far better able to take care of disk errors. There is additionally a need for handheld imagers and also covert imagers, and also tools that can begin evaluation before imaging is complete.

Step 2: file system analysis

A 60-gigabyte hard drive has 120 million 512-byte sectors, yet thinking around the journey this means isn’t terribly efficient or useful. Most tough drives have one or much more partitions that may be one or more paper systems. Each record system, in turn, has papers that are resident and also files that have been deleted however are however recoverable. This kind of extraction deserve to be done through an open source tool such as The Sleuth Kit.

once the record system metadata is extracted, it must be intelligently processed and also stored in a common database. Papers can likewise be tokenized and indexed. Together a device makes it feasible to rapidly search hundreds or hundreds of disks by typing a single command.

~ the disk’s metadata has been extracted, a potentially large amount of data may nonetheless remain. This data comes from the sectors found between or at the finish of partitions, sectors the were not ascribable to any kind of file, and even bytes in the slack an are at the ends of sectors and clusters. Forensic investigators who come increase empty looking for incriminating information amongst the disk’s papers will generally use a carving device such together Scalpel or Foremost come search with this additional an are for digital images, indigenous documents, and also whatever various other kinds of valuable information they can find.

although file-system evaluation is a part of practically every civil and also criminal forensic examination today, many of today’s tools space designed for interactive analysis and carry out not work well in a batch environment. This is an area whereby research, engineering, and also product advancement can have significant impact.

an additional area where research is essential is in enhancing performance. Today’s analysis tools, lot like today’s record systems, generally rely on the head the the computer’s hard drive because that seeking information. Handling the materials of whole hard journey (or hard-drive image) could involve a look for to every directory and also then come every file—and that’s prior to the carving starts. The problem here is that both disk capacities and data-transfer times are boosting much quicker than the speed with which tough drives have the right to seek. Together a result, a highly fragmentized disk that deserve to be imaged in one hour typically might call for 15 to 20 hrs for the early analysis—even if the picture is save on a high-performance SAN. A an excellent research trouble in the area that file-system evaluation is the breakthrough of evaluation software that operates in a streaming mode, reading the disk image from beginning to end and also performing all vital data evaluation as the data paris by.

Step 3: file Analysis and Feature Extraction

as soon as the papers are found, they have to be analyzed—automatically, if possible.

Today’s computer forensic systems excel at document analysis, yet only when provided by a trained operator. Advertisement “file filter” software program is easily accessible that can understand, display on the screen, and extract the text from precise hundreds of various kinds of application data record formats. When data is extracted, it have the right to be handle with etymological tools that can detect the language in which the record is written, translate the text into English (if necessary), or transliterate names and also addresses into a standardized English spelling. The original language, the translations, and the transliterations deserve to then every be stored in a full-text find engine, make it basic for a human being analyst to rapidly search thousands of processed tough drives for a details word or term.

Full-blown automatically exploitation have the right to go much more than straightforward indexing, of course. Because that example, hidden data indigenous previous modify sessions is commonly left in Microsoft native files; this data can additionally be instantly extracted and also indexed.12 other information uncovered in the metadata has the time the edits take it place and the registered surname of the person performing the edits. JPEG image records record such details together the serial number of the camera the was used and also the time that day; the JPEG format also has provisions for recording the gps location of each photograph; logfiles discovered on practically every difficult drive deserve to be used to construct a network-centric map the the computer’s digital neighborhood. Every one of this information deserve to be faked—but commonly it isn’t. Analysts can extract, archive, and exploit all this information.

For work that involved documenting privacy violations on discarded hard drives,13 we wrote a program that could immediately find character sequences that had a high probability that being credit card numbers. Using this routine to a corpus of 150 hard drives, we might rapidly distinguish the few drives that had thousands of credit transaction card number from the huge number of cd driver that had actually hardly any. Us then focused our examination on these “hot drives.” one of these drives had actually been used in an ATM before it was marketed on eBay; an additional drive had actually been taken indigenous a computer used for handling credit cards at a supermarket. Neither disk had been erased before being sold.

A surprising amount that both applied and basic research demands to be excellent in this area. Although some commercial and open resource tools are available for data extraction, nearly every one of them focus on extract human-readable text quite than metadata that could be advantageous for secondary analysis. Those more, extraction software invariably lags behind the document formats used by advertisement applications. For example, many open resource programs can now procedure the OLE layout used by Microsoft Word, Excel, and PowerPoint. Unfortunately, Microsoft is now relocating to Office XML.

human being engaged in criminal or terrorist activity may employ useless or obscure indigenous processors, spreadsheets, and image file formats as options to making use of encryption. This is because the presence of encrypted data might be a red flag, attracting the fist of an investigator. Data in oddball paper formats, on the other hand, might simply it is in ignored through the mean investigator uneven there is factor to destruction deeper. Thus, oddball paper formats carry out a type of plausible deniability to those who are trying to hide the contents of your communications.

one more research difficulty is to develop automated software that have the right to understand the data documents on a difficult drive the first time they space encountered, there is no requiring who to sit down and write a parser. Back this sounds like a fantasy, it really isn’t. That’s due to the fact that the usual hard journey contains much more than simply data files—it also has the programs that process those data files. In theory, it should be possible to fill those programs into a digital machine, run them, and also then have actually the programs check out and process the paper files. Numerous security researcher are now using this kind of method for malware analysis. It must be usable for DOMEX as well.

Step 4: Anomaly Detection and Social Network Analysis

at this allude in the process, the data native the difficult drive has been extracted, sliced, refactored, analyzed into multiple representations, and stored in multiple databases. Currently the real work begins.

because that the deep DOMEX problem, automated software should have the ability to perform an analysis that’s at least as thorough as an analysis created by one or an ext humans. This is since the deep software have the right to have access to a much greater store of forensic knowledge and also techniques than also the many renowned investigator. Automatically software, running with an suitable database, deserve to know virtually every version of every program that has ever been offered commercially. It can create a comprehensive hypothesis the the means that the suspect’s hard drive must have been used, climate look for added evidence ~ above the journey (or ~ above the Internet) to assistance that hypothesis. Unlike an expensive forensic investigator, this automatically software might be extensively deployed in ~ both the intelligence and law enforcement communities—assuming, that course, that someone would write it.

Automated software should also be able to excel at the broad DOMEX problem. A DOMEX facility the stores functions from thousands of tough drives in a solitary database could perform large-scale correlations of functions such as e-mail addresses or credit transaction card numbers. This approach, dubbed cross-drive analysis,14 can determine if a specific hard drive was offered by a person who has actually connections to a previously established terrorist network. Alternatively, cross-drive evaluation could be provided to discover a terror network in a sea of data from recorded drives.

This database of the current information environment can likewise improve deep analysis. Because that example, recognize scanned pages from an Al Qaeda training hands-on on a hard drive could be an essential event—unless the the hands-on that was discovered by the Manchester (England) city Police and also now lives on the U.S. Room of Justice internet site.15 top top the other hand, detect a file that matches the very first 25 pages that the room of righteousness manual yet then has actually divergent text can be exceedingly important.

Step 5: Reporting

once the automated evaluation is complete, the results have to be made obtainable to others—investigators, analysts, or even the ultimate consumer of the knowledge product. Now these reports are developed by human analysts who tailor the report for the needs and also knowledge of the to plan recipient. Not surprisingly, generating a report deserve to be time consuming—sometimes more so than the actual analysis.

An automated DOMEX system might generate its very own reports. These reports could be superior to current forensic reports, taking right into account not simply the topic material and the report’s plan consumer, but additionally what information has already been reported to the consumer. The is, the DOMEX system could track each user’s knowledge and fill in the gaps as necessary.

Search and also Research

each successive step in this theoretical automated DOMEX mechanism is further and also further advanced from the existing state the the art. Open source imaging, document extraction, and file-carving software application are easily accessible from a variety of net sites, yet the reporting scenario defined here is plenty of years from gift a usable technology.

some civil libertarians have actually said they have actually reservations around the ethical legitimacy of this work. Automated DOMEX systems, they fear, might easily become far better surveillance devices for the masses. DOMEX software might be run secretly on desktop computers by large corporations, because that example. Software program that has actually the potential to it is in this invasive need to not be developed, lock argue.

automatically DOMEX software, however, actually has actually the power to boost privacy—not so much for the general public, yet for civilization who space targets that investigation. This particular day there room far more disk drives to be analyzed than there are inspectors to work-related with them. The result is delays that have the right to both dangerously impede one investigation and also damage the civil liberties of innocent suspects.

because that example, in 2005 the uk passed legislation extending the time that terrorism suspects might be hosted without gift charged indigenous 14 days come 90 days, in part because the 2 weeks provided by the vault terrorism legislation did not provide sufficient time because that the forensic evaluation of a typical hard drive.16 A high-confidence automated DOMEX system might give police the devices they must clear a doubt in days, if not hours.


together framed here, the DOMEX problem is very unstructured. You have a heap of data that intuitiontells girlfriend is important. The challenge is to do something advantageous with it—ideally with as lot automation together possible.

This kind of broad, unstructured difficulty makes scientistsuncomfortable, due to the fact that there is no hypothesis to test. It makes businesspeople uncomfortable because there is no noticeable metric to measure success or failure. But this type of unstructured difficulty dominates countless of today’s information-rich environments.

We have the data, but getting the data isn’t the difficult part—it’s simply the start.

References Intelligence neighborhood Directive Number 302. 2007. Record and Media Exploitation (July 6). Oxford American Dictionaries. 2005. US v. Carey 98-3077, 172 f.3d 1268 (10th Cir. 1999). Davis, M., Manes, G., Shenoi, S. 2005. A network-based style for save digital evidence. In advancements in Digital Forensics, ed. M. Pollitt and also S. Shenoi. IFIP global Conference on Digital Forensics, National center for Forensic Sciences, Orlando, Florida (February 13-16). Diffie, W., Landau, S. 1998. Privacy on the Line: The politics of Wiretapping and also Encryption. MIT Press. Garfinkel, S., Shelat, A. 2003. Remembrance the data passed: a study of decaying sanitization practices. IEEE Security and also Privacy (January/February). Garfinkel, S. 2006. Forensic function extraction and cross-drive analysis. Digital Forensic study Workshop, Lafayette, Indiana (August 14-16).

SIMSON L. GARFINKEL is an combine professor at the marine Postgraduate institution in Monterey, California, and a fellow at the center for study on Computation and culture at Harvard University. His research interests include computer forensics, the emerging field the usability and also security, and personal information management.

See more: How Much Does Your Weight Change On The Moon Affects Your Weight

The views expressed in this write-up are specifically those that the author and do not necessarily reflect the positions or policies of the navy Postgraduate school or the U.S. Government. This article describes the author’s research and also is issued to further discussion.

Originally published in vol. 5, no. 7—see this article in the Digital Library