Internet technology provides a substantial challenge to the collection and preservation of data, metadata (data that describes and gives information about other data) in particular. This blog post from Kivu will explain the factors to consider in using web pages in forensics investigations.
The challenge stems from the complexity of source-to-endpoint content distribution. Originating content for a single website may be stored on one or more servers and then collectively called and transmitted to an endpoint such as a laptop or tablet. For example, a mobile phone in Germany may receive different content from the same website than a phone in the United States. As content is served (e.g., sent to a tablet), it may be routed through different channels and re-packaged before reaching the final destination (e.g., an online magazine delivered as an iPhone application.)
From a forensics perspective, this dynamic Internet technology increases the difficulty of identifying and preserving content that is presented to a user through a browser or mobile application. To comprehend the issues concerning forensics and Internet technology, we need to understand what web pages are and the differences between the two types of web pages: fixed content (static web pages) and web pages with changing content (dynamic web pages).
What is a Web Page
A web page is a file that contains content (e.g., a blog article) and links to other files (e.g., an image file). The content within the web page is structured with Hypertext Markup Language (HTML), a formatting protocol that was developed to standardize the display of content in an Internet browser. To illustrate HTML, let’s look at the following example. The web page’s title, “Web Page Example,” is identified by an HTML <title> label and the page content “Hello World” is bolded using a <b> label.
Web pages that are accessible on the Internet reside on a web server and are accessible through a website address known as a Uniform Resource Locator, or URL (e.g., http://kivuconsulting.com/). The web server distributes web pages to a user as the user navigates through a website. Most visitors reach a website by entering the domain in a URL bar or by typing keywords into a search engine.
Static versus Dynamic Web Pages
Web pages may be classified as static or dynamic. The difference between static and dynamic web pages stems from the level of interactivity within a web page.
A static web page is an HTML page that is “delivered exactly as it is stored,” meaning that the content stored within the HTML page on the source server is the same content that is delivered to an end-user. A static web page may:
• Contain image(s)
• Link to other web pages
• Have some user interactivity such as a form page used to request information
• Employ formatting files, known as Cascading Style Sheets (CSS)
A dynamic web page is an HTML page that is generated on demand as a user visits a web page. A dynamic page is derived from a combination of:
• Programmatic code file(s)
• Files that define formatting
• Static files such as image files
• Data source(s) such as a database
A dynamic web page has the behavior of a software application delivered in a web-browser. Dynamic web page content can vary by numerous factors, including: user, device, geographic location or account type (e.g., paid versus free). The underlying software code may exist on the client-side (stored on a user’s device), the server-side (stored on a remote server) or both. From a user’s perspective, a single dynamic web page is a hidden combination of complex software code, content, images and other files. Finally, the website delivering dynamic web page content can manage multiple concurrent user activities at one time on the same device or manage multiple dynamically-generated web pages during one user session on a single device. This behind-the-scenes management of user activity hides the underlying complexity of the numerous activities for a single user session.
Web Pages Stored on a User Device as Forensics Evidence
To a forensic examiner, web page artifacts that are stored on a user device may have significant value as evidence in an investigation. Web page artifacts are one type of Internet browser artifact. Other Internet artifacts include: Internet browser history, downloaded files and cookie files. If the device of interest is a mobile device, evidence may also reside in database files such as SQLite files.
Forensic examiners review Internet artifacts to answer specific questions such as, “Was web-mail in use?” or “Is there evidence of file transfer?” Forensic analysis may be used to create a timeline of user activity, locate web-based email communications, identify an individual’s geographic location based on Internet use, or establish theft of corporate data using cloud-based storage such as Dropbox.
Web Content Stored on a Server as Forensics Evidence
Depending on the type of investigation (e.g., a computer hacking investigation), a forensic examiner may search for evidence on servers. Server-side content may be composed of stored files such as log files, software code, style sheets and data sources (e.g., databases).
Server-side content may directly or indirectly relate to web pages or files on a user device. If a user downloaded an Adobe PDF file, for example, the file on the server is likely to match the downloaded file on the user’s device. If the evidence on a user device is a dynamic web page, however, there may be a number of individual files that collectively relate as evidence, including: images, scripts, style sheets and log files.
The individual server-side files are component parts of a web page. A forensic examiner would analyze server-side files by investigating the relationship between the web page content on a user device and related server-side files. A forensic examiner may also review server logs for artifacts such as IP address and user account activity.
Factors to Consider in Web Page Forensics Investigations
1. Analyze the domain associated with web page content. Collect information on:
a. Owner of the domain – WHOIS database lookup.
b. Domain registry company – e.g., GoDaddy.
c. Location of domain – IP address and location of web server.
2. Conduct a search using a search engine such as Google, Yahoo or Bing. Review the first page of search results and then review an additional 2 to 10 pages.
a. Depending on the scope of the content, it may be worth filtering search results by date or other criteria.
b. It may be worth using specialty search tools that focus on blogs or social media.
c. Consider searching sites that track plagiarism.
3. Examine the impact of geo-location filtering. Many companies filter individuals by location in order to provide targeted content.
a. Searches may need to be carried out in different countries.
b. Consider using a proxy server account to facilitate international searches.
4. Use caution when examining server-side metadata. Website files are frequently updated, and the updates change file metadata. A limited number of file types such as image files may provide some degree of historical metadata.
5. There is a small possibility that archival sites, such as The Wayback Machine, may contain web page content. However, archival sites may be limited in the number of historical records, unless a paid archiving service is used.