One of the most popular email programs used today is Gmail.  Kivu initiated a project to determine the most efficient and defensible process to collect Gmail account information. This blog post is the second in a series of articles that evaluate Gmail collection options for computer forensic purposes.

A common email client that can be incorporated into a forensic email collection is (shock horror) Microsoft Outlook. Outlook is included in the Microsoft Office package, and for many years it was king of email clients for the business environment. As the popularity of mobile phones and web-based clients increased, however, Microsoft Outlook’s use has declined.

We will be using the latest version, Outlook 2013, for our collection of forensic data. While not usually seen as a part of the forensic investigator’s tool kit, Microsoft Outlook has some interesting attributes that can be verified in use, and tested as to its output. You just need to know what you’re doing and (as in all forensic work) be able to confirm the veracity of the data.

Outlook has an option for IMAP setup that allows automatic testing of account credentials. Outlook will send an email from the account to the account to ensure that the account credentials are correct. Outlook 2010 has the ability to disable this test, but in Outlook 2013 the option is greyed out, and the test email is sent automatically. If account intrusion needs to to be kept to a minimum, it is good to keep this in mind.

How to Use Microsoft Outlook for Gmail Collection, Step-by-Step

Change Microsoft Outlook Settings

To start your Gmail collection, check that the settings in the target Gmail account are set to IMAP. Then, open up the email account settings, either though Outlook File>Info>Account Settings or though the Control Panel>Mail>Email accounts. Selecting New… in the Email tab will prompt you for the service you wish to set up. Check E-mail Account, click on Next, and then select Manual Setup. Click Next again.

Unlike GM Vault, which we evaluated in the first article on this topic, a bit more work is needed to ensure a smooth email collection. In addition to User Name and Password, Outlook requests both the incoming and outgoing servers for the IMAP account.

User Information
Your Name:
(Top Level Email Name)
Email Address: (Collection Gmail address)
Server Information
Account Type:
IMAP
Incoming mail server: imap.gmail.com
Outgoing mail server (SMTP): smtp.gmail.com
Logon Information
User Name:
(Collection Gmail address)
Password: (Collection Gmail password)

Click on More Settings to open up Internet email settings. Under Outgoing Server check the box for Outgoing sever requires authentication and use the same setting for your incoming mail server. Click on the Advanced tab and change the server port numbers to 993 for incoming and 465 for outgoing. Select SSL for the encryption type for both, and set the server timeout to 5 min. These are Google’s recommended settings for using the Outlook client for Gmail accounts.

Start Gmail Collection

Go to the Send/Receive tab and click on the drop down list for Send/Receive Groups and select Define Send/Receive Groups…. In the pop-up window, select the All Accounts and click Edit on the right hand side of the window. Check all boxes except Send mail items and select Download complete items… If you want to collect only specific folders, use the custom behavior option to select the folders you to collect. Click OK and click OK again. Then you can either select the Group to Send/Receive drop down menu or use the short cut key (F9).

 

 

Track Gmail Collection

Once the collection has started, there are a few options and settings that can help minimize intrusion and track the collection – again, crucial steps if you are hoping to achieve a forensically sound collection. Outlook’s default setting marks an email as “Read” – whenever you select a new email, the previous email is marked as read. To change this setting, go into reading pane options either via the File>options>Mail>Outlook panes>Reading Pane… or the View tab and click on the Reading Pane drop down menu. In the options screen uncheck all of the boxes. Now, Outlook will not mark the emails you view as read when you look through them.

For tracking, to ensure that you have reviewed the correct number of emails, you’ll need to tell Outlook to show all items in a folder rather than just the unread items. Unfortunately, this can only be done folder by folder. Right click on a folder and select Properties. Select the option Show Total Numbers of Items then click OK. Repeat with all of the folders that you are collecting. If a folder does not show a number, there are 0 emails in the folder. Compare the folder numbers with the counts you can view online at: www.gmail.google.com. Once all of the folder counts match, the collection is finished.

Working with Offline Email Storage

Outlook uses an Off-line Storage Table (OST) format to store emails from POP, IMAP and other web- based email accounts offline when the Internet is not available. When the sever access is resumed, the accounts are synced to the cloud storage. Outlook also uses Personal Storage Tables (PST) files to back up and transfer email files and accounts. While some forensic processing tools can extract data from OST files, almost all of them can extract the data from PST files. PST files can also be opened up on any computer with Outlook.

To export the collected PST files, select File>Open>Import, Export to File, and then select Outlook Data File (.pst). Browse to where you want the file to be saved. Select Allow duplicate items to be created so all items will be exported. Once the PST has been backed up and you have verified that the item count is correct, you can remove the account from the account settings and undo any options changed in the Gmail account. Then, inform your client that they can now access their email and should consider changing their password.

Following are the Pros and Cons of Using Microsoft Outlook for Forensic Investigation:

Pros

• The wide availability of Outlook
• Once all options are set, processing is simple and quick
• Native PST export

Cons

• Options are expansive and sometimes unintuitive
• Can be intrusive – Outlook sends test emails during setup and may mark unread mail as read

About Kivu

Kivu is a licensed California private investigations firm, which combines technical and legal expertise to deliver investigative, discovery and forensic solutions worldwide. Author, Thomas Larsen, is a data analyst in Kivu’s San Francisco office. For more information about how to retrieve and store Gmail messages for forensic investigation, please contact Kivu.

Social media has become a notable source of potential forensic evidence, with social media giant Facebook being a primary source of interest. With over 1.35 billion monthly active users as of September 30, 2014 [1], Facebook is considered the largest social networking platform.

Kivu is finding that forensic collection of Facebook (and other sources of social media evidence) can be a significant challenge because of these factors:

1. Facebook content is not a set of static files, but rather a collection of rendered database content and active programmatic scripts. It’s an interactive application delivered to users via a web-browser. Each page of delivered Facebook content is uniquely created for a user on a specific device and browser.  Ignoring the authentication and legal evidentiary issues, screen prints or PDF printouts of Facebook web pages often do not suffice for collecting this type of information – they simply miss parts of what would have been visible to the user – including, interestingly the unique ads that were tailored to the specific user because of their preferences and prior viewing habits.

2. Most forensic collection tools have limitations in the capture of active Internet content, and this includes Facebook. Specialized tools, such as X1 Social Discovery and PageFreezer, can record and preserve Internet content, but gaps remain in the use of such tools. The forensic collection process must adapt to address the gaps (e.g., X1 Social Discovery does not capture all forms of video).

Below are guidelines that we at Kivu have developed for collecting Facebook account content as forensic evidence:

1. Identify the account or accounts that will be collected – Determine whether or not the custodian has provided their Facebook account credentials. If no credentials have been provided, the investigation is a “public collection” – that is, the collection needs to be based on what a Facebook user who is not “friends” with the target individual (or friends with any of the target individual’s friends, depending on how the target individual has set up their privacy settings) can access. If credentials have been provided, it is considered a “private collection, ” and the investigator will need to confirm the scope of the collection with attorneys or the client, including what content to collect.

2. Verify the ownership of the account – Verifying an online presence through a collection tool as well as a web browser is a good way to validate the presence of the target account.

3. Identify whether friends’ details will be collected.

4. Determine the scope of collection – (e.g. the entire account or just photos).

5. Determine how to perform the collection – which tool or combination of tools will be most effective? Make sure that that your tool of choice can access and view the target profile. The tool X-1 Social Discovery, for example, uses the Facebool API to collect information from Facebook. The Facebook API is documented and provides a foundation for consistent collection versus a custom-built application that may not be entirely validated. Further, Facebook collections from other sources such as cached Google pages provide a method of cross-validating the data targeted for collection.

6. Identify gaps in the collection methodology.

a. If photos are of importance and there is a large volume of photos to be collected, a batch script that can export all photos of interest can speed up the collection process. One method of doing so is a mouse recording tool.

b. Videos do not render properly while being downloaded for preservation, aeven when using forensic capture tools such as X-1 Social Discovery. If videos are an integral part of an investigation, the investigator will need to capture videos in their native format in addition to testing any forensic collection tool. It should be noted that there are tools such as downvids.net to download the videos, and these tools in combination with forensic collection tools such as X-1 Social Discovery provide the capability to authenticate and preserve video-based evidence.

7. Define the best method to deliver the collection – If there are several hundred photos to collect, determine whether all photos can be collected. Identify whether an automated screen capture method is needed.

8. If the collection is ongoing (e.g., once a week), define the recurring collection parameters.

Kivu is a licensed California private investigations firm, which combines technical and legal expertise to deliver investigative, discovery and forensic solutions worldwide. Author Katherine Delude is a Digital Forensic Analyst in Kivu’s San Francisco office. To learn more about forensically preserving Facebook content, please contact Kivu.

[1] http://newsroom.fb.com/company-info/ Accessed 11 December 2014.

Internet technology provides a substantial challenge to the collection and preservation of data, metadata (data that describes and gives information about other data) in particular. This blog post from Kivu will explain the factors to consider in using web pages in forensics investigations.

The challenge stems from the complexity of source-to-endpoint content distribution. Originating content for a single website may be stored on one or more servers and then collectively called and transmitted to an endpoint such as a laptop or tablet. For example, a mobile phone in Germany may receive different content from the same website than a phone in the United States. As content is served (e.g., sent to a tablet), it may be routed through different channels and re-packaged before reaching the final destination (e.g., an online magazine delivered as an iPhone application.)

From a forensics perspective, this dynamic Internet technology increases the difficulty of identifying and preserving content that is presented to a user through a browser or mobile application. To comprehend the issues concerning forensics and Internet technology, we need to understand what web pages are and the differences between the two types of web pages: fixed content (static web pages) and web pages with changing content (dynamic web pages).

What is a Web Page? graphic

A web page is a file that contains content (e.g., a blog article) and links to other files (e.g., an image file). The content within the web page is structured with Hypertext Markup Language (HTML), a formatting protocol that was developed to standardize the display of content in an Internet browser. To illustrate HTML, let’s look at the following example. The web page’s title, “Web Page Example,” is identified by an HTML <title> label and the page content “Hello World” is bolded using a <b> label.

graphic2Web pages that are accessible on the Internet reside on a web server and are accessible through a website address known as a Uniform Resource Locator, or URL (e.g., http://kivuconsulting.com/). The web server distributes web pages to a user as the user navigates through a website. Most visitors reach a website by entering the domain in a URL bar or by typing keywords into a search engine.

Static versus Dynamic Web Pages

Web pages may be classified as static or dynamic. The difference between static and dynamic web pages stems from the level of interactivity within a web page.

A static web page is an HTML page that is “delivered exactly as it is stored,” meaning that the content stored within the HTML page on the source server is the same content that is delivered to an end-user. A static web page may:

• Contain image(s)
• Link to other web pages
• Have some user interactivity such as a form page used to request information
• Employ formatting files, known as Cascading Style Sheets (CSS)

A dynamic web page is an HTML page that is generated on demand as a user visits a web page. A dynamic page is derived from a combination of:

• Programmatic code file(s)
• Files that define formatting
• Static files such as image files
• Data source(s) such as a database

A dynamic web page has the behavior of a software application delivered in a web-browser. Dynamic web page content can vary by numerous factors, including: user, device, geographic location or account type (e.g., paid versus free). The underlying software code may exist on the client-side (stored on a user’s device), the server-side (stored on a remote server) or both. From a user’s perspective, a single dynamic web page is a hidden combination of complex software code, content, images and other files. Finally, the website delivering dynamic web page content can manage multiple concurrent user activities at one time on the same device or manage multiple dynamically-generated web pages during one user session on a single device. This behind-the-scenes management of user activity hides the underlying complexity of the numerous activities for a single user session.

Web Pages Stored on a User Device as Forensics Evidence

To a forensic examiner, web page artifacts that are stored on a user device may have significant value as evidence in an investigation. Web page artifacts are one type of Internet browser artifact. Other Internet artifacts include: Internet browser history, downloaded files and cookie files. If the device of interest is a mobile device, evidence may also reside in database files such as SQLite files.

Forensic examiners review Internet artifacts to answer specific questions such as, “Was web-mail in use?” or “Is there evidence of file transfer?” Forensic analysis may be used to create a timeline of user activity, locate web-based email communications, identify an individual’s geographic location based on Internet use, or establish theft of corporate data using cloud-based storage such as Dropbox.

Web Content Stored on a Server as Forensics Evidence

Depending on the type of investigation (e.g., a computer hacking investigation), a forensic examiner may search for evidence on servers. Server-side content may be composed of stored files such as log files, software code, style sheets and data sources (e.g., databases).

Server-side content may directly or indirectly relate to web pages or files on a user device. If a user downloaded an Adobe PDF file, for example, the file on the server is likely to match the downloaded file on the user’s device. If the evidence on a user device is a dynamic web page, however, there may be a number of individual files that collectively relate as evidence, including: images, scripts, style sheets and log files.

The individual server-side files are component parts of a web page. A forensic examiner would analyze server-side files by investigating the relationship between the web page content on a user device and related server-side files. A forensic examiner may also review server logs for artifacts such as IP address and user account activity.

Factors to Consider in Web Page Forensics Investigations

1. Analyze the domain associated with web page content. Collect information on:

a. Owner of the domain – WHOIS database lookup.
b. Domain registry company – e.g., GoDaddy.
c. Location of domain – IP address and location of web server.

2. Conduct a search using a search engine such as Google, Yahoo or Bing. Review the first page of search results and then review an additional 2 to 10 pages.

a. Depending on the scope of the content, it may be worth filtering search results by date or other criteria.
b. It may be worth using specialty search tools that focus on blogs or social media.
c. Consider searching sites that track plagiarism.

3. Examine the impact of geo-location filtering. Many companies filter individuals by location in order to provide targeted content.

a. Searches may need to be carried out in different countries.
b. Consider using a proxy server account to facilitate international searches.

4. Use caution when examining server-side metadata. Website files are frequently updated, and the updates change file metadata. A limited number of file types such as image files, may provide some degree of historical metadata.

5. There is a small possibility that archival sites, such as The Wayback Machine, may contain web page content. However, archival sites may be limited in the number of historical records, unless a paid archiving service is used.

Kivu is a licensed California private investigations firm, which combines technical and legal expertise to deliver investigative, discovery and forensic solutions worldwide. Author, Megan Bell, directs data analysis projects and manages business development initiatives at Kivu. For more information about using web pages in forensics investigations, please contact Kivu.

The cloud is becoming an ever-increasing repository for email storage. One of the more popular email programs is Gmail, with its 15 GB of free storage and easy access anywhere for users with an Internet connection. Due to the great number of email accounts, the potential for large amounts of data, and no direct income, Google has throttled back on backups to lessen the burden on their servers worldwide.

This blog post is the start of a series of articles that will review Gmail collection options for computer forensic purposes. Kivu initiated a project to find the most efficient and defensible process to collect Gmail account information. The methods tested were Microsoft Outlook, Gmvault, X1 Social Discovery and Google scripts.

All four programs were run through two Gmail collection processes, with a focus on:

  • Discovering how the program stores emails.
  • Identifying whether the program encounters throttling? If so, how does it deal with it?
  • Determining if current forensic tools can process the emails collected.
  • Measuring how long the program takes to process the email, and the level of examiner involvement necessary.

Kivu employees created two Google email accounts for this analysis. Each email account had over 30,000 individual emails, which is a sufficient amount for Google throttling to occur and differences in speed to become apparent. The data included attachments as well as multi-recipient emails to incorporate a wide range of options and test how the programs collect and sort variations in emails. Our first blog post focuses on Gmvault.

What is Gmvault and How Does It Work?

Gmvault is a third party Gmail backup application that can be downloaded at Gmvault.org. Gmvault uses the IMAP protocol to retrieve and store Gmail messages for backup and onsite storage. Gmvault has built-in protocols that help bypass most of the common issues with retrieving email from Google. The process is scriptable to run on a set schedule to ensure a constant backup in case disaster should happen. The file system database created by Gmvault can be uploaded to any other Gmail account for either consolidation or migration.

During forensic investigation, Gmvault can be used to collect Gmail account data with minimal examiner contact with the collected messages. The program requires user interaction with the account twice – once to allow application access to the account and again at the end to remove the access previously granted. Individual emails can be viewed without worrying about changing metadata, such as Read Status, and/or Folders/Labels because this information is stored in a separate file with a .meta file extension.

How to Use Gmvault for Forensic Investigation

Gmvault needs very little user input and can be initiated with this command:

$> gmvault sync [email address]

We suggest using the following options:

$> gmvault sync –d [Destination Directory] –no-compression [email address]

“d” enables the user to change where the download will go, allowing for the data extraction to go directly to an evidence drive, (default: Usercloudgmvault-db)

“no-compression” downloads .eml files rather than the .gzip default. Compression comes with a rare chance of data corruption during both the compression and decompression processes so, unless size is an issue, it is better to use the “no compression” option. Download speed is unaffected by the compression, although compressed files are roughly 50% of the uncompressed size.

Next, sign in to the Gmail account to authorize Gmvault access. The program will create 3 folders in the destination drive you set, and emails will be stored by month. The process is largely automated, and Gmvault manages Google throttling. It accomplishes this by disconnecting from Google, waiting a predetermined number of seconds and retrying. If this fails 4 times, the email is skipped, and Gmvault moves on to the next set of emails. When finished with the email backup, Gmvault checks for chats and downloads them as well.

When Gmvault is finished, a summary of the sync is displayed in the cmd shell. Gmvault performs a check to see if any of the emails were deleted from the account and removes them from the database. This should not be a problem for initial email collections, but it will need to be noted on further syncs for the same account. The summary shows the total time for the sync, number of emails quarantined, number of reconnects, number of emails that could not be fetched, and emails returned by Gmail as blank.

To obtain the emails that could not be fetched by Gmvault, simply run the same cmd line again:

$> gmvault sync –d [Destination Directory] –no-compression [email address]

Gmvault will check to see if the emails are already in the database, if so skip them, and then download the skipped items from the previous sync. It may take up to 10 times to recover all skipped emails, but the process can probably be completed within 5 minutes.

Be sure to remove authorization once the collection is complete.

Now you should have all of the emails from the account in .eml format, stored by date in multiple folders. Gmvault can then be used to export these files into a more useable storage system. The database can be exported as offlineimap, dovecot, maildir or mbox (default). Here’s how:

gmvault-shell>gmvault export -d[Destination Directory] [Export Directory]

Following are the Pros and Cons of Using Gmvault:

Pros:

  • Easy to setup and run
  • Counts total emails/collected emails to quickly know if emails are missing
  • 50% compression
  • Can be scripted to collect multiple accounts

Cons:

  • No friendly UI
  • Needs further processing to get to a user friendly deliverable
  • Will sometimes not retrieve the last few emails

The enduring onslaught of data breach events such as the theft of 4.5 million health records from Community Health Systems or the recent staggering loss of information for 76m JP Morgan accounts continues to highlight the need for robust information security and the ability to proactively prevent and redress potential security incidents. In response, organizations have increased investment in better information security programs and supporting technologies. However, while more organizations may be better positioned to cope with data breach events, information security continues to lack appropriate coverage of cloud and mobile device technology risks.

Lags in InfoSec Deployment:

According to the 2014 Global State of Information Security® Survey of information, executives and security practitioners, organizational leaders expressed confidence in their information security activities (nearly three-quarters of study respondents reported being somewhat or very confident). However, the survey reveals gaps in the application of information security for cloud and mobile technologies. Nearly half of respondents reported that their organizations used cloud computing services but only 18% reported having governance policies for cloud services. Furthermore, less than half of respondents reported having a mobile security strategy or mobile device security measures such as protection(s) for email/ calendaring on employee-owned devices.

Real Issue is Lack of Knowledge

Gaps in cloud and mobile information security represent a broader trend that even exists in regulated industries. For example, in the 2013 Ponemon report, “The Risk of Regulated Data on Mobile Devices & in the Cloud”, 80% of IT professionals could not define the proportion of regulated data stored in the cloud and on mobile devices. The gap in information security does not appear to be limited to the deployment of polices and controls. Instead the potential issues with cloud and mobile information security stem from lack of knowledge concerning storage and use of data. As noted in the study “Data Breach: The Cloud Multiplier Effect” their organizations as having low effectiveness in securing data and applications in the cloud.

Reducing Cloud and Mobile Technology Risks

Developing an appropriate security posture for cloud and mobile technologies should begin with the realization that information security requirements for these technologies differ from traditional IT infrastructure. For example, the responsibility for storage and use of data in the cloud is shared by a greater number of parties—organization, employees, external vendors, etc. Additionally, contracts and written policies for cloud applications must specify more granular coverage for access, use, tracking and management of data. In the event of a potential security incident, possible sources of evidence, such as security logs, are stored externally and may require the assistance of specific employees or service providers.

The following considerations provide a starting point for the development of information security practices that are relevant to cloud and mobile technologies.

1. Identify security measures that are commensurate with cloud and mobile technologies.

a. Use security features that are built into cloud and mobile technologies. This includes access controls and encryption. Frequently, security features that would have prevented major cloud-based breaches (such as multi-factor authentication and text-to-cellphone warnings of suspicious activity) are already made available by cloud service providers. However, users of these services, whether individuals or large corporate clients, are frequently delaying full implementation of available security options due to cost or organizational concerns.

b. Implement additional security tools or services to address gaps in specific cloud and mobile technologies. For example, software-based firewalls to manage traffic flow may also provide logging capability that is missing from a cloud service provider’s capabilities.

2. If possible, use comprehensive solutions for user, device, account, and data management.

a. Manage mobile devices and their contents. Mobile device management (MDM) solutions enable organizations to coordinate the use of applications and control organizational data across multiple users and mobile devices.

b. Use available tools in the cloud. Cloud service providers such as Google Apps provide tools for IT administration to manage users, data and specific services such as Google Drive data storage. Unfortunately, many organizations do not utilize these tools and take risks such as losing control over email account access and content.

3. Maintain control over organizational data.

a. IT should control applications used for file-sharing and collaboration. Cloud- based tools such as Dropbox provide a robust method of sharing data. Unfortunately, Dropbox accounts often belong to the employee and not the organization. In the case of a security incident, IT may be locked out of an employee’s personal account.

b. Users should not be responsible for security. Organizations often entrust employees and business partners with sensitive data. This includes maintaining security requirements such as use of encryption and strong passwords. The organization that owns the data (usually its IT department) should have responsibility for security, and this includes organizational data stored outside of an organization’s internal IT infrastructure.

c. Encryption keys should be secured and available to IT in the case of a potential incident. With the advent of malware such as ransomeware that holds data captive and employees who could destroy encryption keys, securing encryption keys has become becoming a vital step in the potential recovery of data. If IT does not maintain master control over encryption keys, important organizational data could be rendered inaccessible during a security incident.

4. Actively evaluate InfoSec response and readiness in the cloud.

a. IT should have a means to access potential sources of organizational data. If data is stored on an employee’s tablet or at a third-party data storage provider, IT should have a vetted plan for access and retrieval of organizational data. Testing should not occur when a potential security incident arises.

b. Important digital assets should be accessible from more than one source and should be available within hours and not days. IT should have backup repositories of corporate data, in particular for data stored in cloud environments. This may include using a combination of cloud providers to store data and having an explicit agreement on the timing and costs required to retrieve data (in the event of an incident).

c. Audit systems should be turned on and used. Cloud providers often have built-in auditing capability that ranges from data field tracking (e.g., a phone number) to file revision history. The responsibility for setting up audit capability belongs to the organization. As part of using a cloud provider’s technology, the use of auditing should be defined, documented and implemented.

d. IT staff should have the knowledge and skills to access and review log files. The diversity and complexity of log files have grown with the number of technologies in use by an organization. Cross-correlating logs files across differing technology platforms requires specialized knowledge and advanced training. If an organization lacks the skill to analyze logs files, the ability to detect and investigate potential security events may be severely compromised.

5. Incident response plans and investigation practices should cover scenarios where data is stored in the cloud or on mobile devices.

Hackers have become more aggressive in seeking out data repositories. As organizations continue to adopt cloud and mobile technologies, information security must keep pace and extend the same internal focus on information security to external sources of organizational data. In particular, incident response plans should cover an increasing phenomenon—where attackers infiltrate an organization’s physical network solely to gain the keys to its cloud data repository.

The Wayback Machine is a digital archive of Internet content, consisting of snapshots of web pages across time. The frequency of web page snapshots is variable, so all web site updates are not recorded.There are sometimes intervals of several weeks or years between snapshots. Web page snapshots usually become available and searchable on the Internet more than 6 months after they are archived. Kivu uses information archived in The Wayback Machine in its computer forensics investigations.

The Wayback Machine was founded in 1996 by Brewster Kahle and Bruce Gilliat, who were also the founders of a company known as Alexa Internet, now an Amazon company. Alexa is a search engine and analytics company that serves as a primary aggregator of Internet content sources, domains, for theWayback Machine. Individuals may also upload and publish a web page to The Wayback Machine for archiving.

Content accumulated within the Wayback Machine’s repository is collected using spidering or web-crawling software. The Wayback Machine’s spidering software identifies a domain, often derived from Alexa, and then follows a series of rules to catalog and retrieve content. The content is captured and stored as web pages.

The snapshots available for a specific domain can be viewed by using the Uniform Resource Locator(URL) formula in the table below. Using the URL formula, the term DOMAIN.COM (bold) is changed to the domain name of interest and then entered into a browser’s Uniform Resource Identifier (URI) address field.

 

The Wayback Machine does not record everything on the Internet

A web page’s robots.txt file identifies rules for spidering its content. If a web page domain does not permit crawling, the Wayback Machine does not index the domain’s content. In place of content, the Wayback Machine records a “no crawl” message in its archive snapshot for a domain.

The Wayback Machine does not capture content as a user would see content in a browser. Instead, the Wayback Machine extracts content from where it is stored on a server, often, HTML files. For each web page of content, the Wayback Machine captures content that is directly stored in the web page, and if possible, content that is stored in related external files (e.g., image files).

The Wayback Machine searches web pages in a domain by following hyperlinks to other content within the same domain. Hyperlinks to content outside of the domain are not indexed. The Wayback Machine may not capture all content within the same domain. In particular, dynamic web pages may contain missing content, as spidering may not be able to retrieve all software code, images, or other files.

The Wayback Machine works best at cataloging standard HTML pages. However, there are many cases where it does not catalog all content within a web page, and a web page may appear incomplete. Images that are restricted by a robots.txt file appear gray. Dynamic content such as flash applications or content that is reliant on server-side computer code may not be collected.

The Wayback Machine may attempt to compensate for the missing content by linking to other sources (originating from the same domain). One method to substitute missing content is linking to similar content in other Wayback Machine snapshots. A second method is linking to web pages on the “live” web, currently available web pages at the source domain. There are also cases where the Wayback Machine displays an “X”, such as for missing images, or presents what appears to be a blank web page.

HTML or other source code is also archived

The Wayback Machine may capture the links associated with the page content but not acquire all of the content to fully re-create a web page. In the case of a blank archived web page, for example, HTML and other software code can be examined to determine the contents of the page. A review of the underlying HTML code might reveal that the page content is a movie or a flash application. (Underlying software code can be examined using the “View Source” functionality within a browser.)

Wayback Machine data is archived in the United States

The Wayback Machine archives are stored in a Santa Clara, California data center. For disaster recovery purposes, a copy of the Wayback Machine is mirrored to Bibliotheca Alexandrina in Alexandria, Egypt.

Kivu is a licensed California private investigations firm, which combines technical and legal expertise to deliver investigative, discovery and forensic solutions worldwide. Author, Megan Bell, directs data analysis projects and manages business development initiatives at Kivu.

For more information about The Wayback Machine and how it is used in computer forensics investigations, please contact Kivu.