Home | About | Contact | FAQ | Search | Privacy Policy | Terms & Conditions | Credits

 
Table of Contents
1 Introduction
2 Internet Investigation
3 Domains
4 Searching the Web
5 Deep Web
6 News & Newsgroups
7 Records Research
8 Organizations
9 Anonymous Investigation
References
Tools & Resources
 
   
 
3. Domains
 

In this chapter:

Anatomy of a URL

Determining Domain Ownership

Recovering Missing Webpages

 

 

 

 

 

Anatomy of a URL

Understanding the anatomy of a Uniform Resource Locator (URL) is essential when conducting online research.  Frequent users have become accustomed to seeing these URLs in their browser’s address bar but not everyone knows what they mean.  The following URL will be used as an example:

This single address actually contains five distinct segments and each of them provides some very basic information.

Protocol.  In the example described above, the protocol is http which stands for Hyper Text Transfer Protocol (Bell, 2003).  There are other types of protocols as well.  Common ones include  ftp, and telnet.  The protocol in a URL is always followed by a colon and two forward-slashes (http:// or telnet:// for example).  HTTP is the protocol that is used for the overwhelming majority of web content that most users are familiar with.  Investigators may also encounter the https:// protocol.  The addition of an "S" indicates Secure Sockets Layer (SSL) encryption is being used.  This protocol is frequently used for secure ecommerce transactions and is accompanied by a key or lock symbol in the lower right-hand corner of the browser window. For more information on SSL, visit Introduction to SSL, provided by Netscape.

Host server.  The next segment identifies the server where the files are hosted (Bell, 2003).   Computers recognize host servers via their IP address, not the text based web addresses that most users are familiar with.  However, since IP addresses are long strings of numbers and are difficult for humans to remember, a system called the Domain Name Service (DNS) was implemented to automatically translate the textual notations that we use into the IP Address numbers that the computer needs (Webopedia, 2003, DNS).

In our example, the host server is www.sample.com/.  Note the .com extension.  This signifies that the website is a Top Level Domain (TLD) of a certain type.  The Internet Corporation for Assigned Names and Numbers (ICANN) is responsible for controlling these TLDs and has standard requirements for certain TLD extensions.  Identifying the type of TLD is important when conducting investigations.  It helps to identify the author and uncover any potential bias that the person has for posting certain information.  Companies might have a profit motive while educational websites might post material simply for the research value.  The TLD can help an investigator determine why information has been made available on the Internet.  Listed below are some common TLDs (Webopedia, 2003, TLD):

  .com Company   .mil Military  
  .org Organization   .biz Business  
  .gov Government   .us United States  
  .edu Educational   .name Personal  

For a complete listing of TLD extensions, see AllWhois.

Directory path.  There may be folders located on the host server.  If the file that is being accessed resides in a folder, there will be several more forward slashes – each indicating a sublevel in a folder hierarchy.  In our example, there are two folders, referenced by  /conferences/2004/.  The file we are accessing resides within the 2004 folder which is within the conferences folder.  Note that punctuation and capitalization count when entering URLs in the address bar.  Different directory structures and operating systems running on the web server machine handle capitalization, spaces, and other punctuation differently.  Therefore, if you are entering a URL directly into the address bar, as opposed to clicking a link, it is important to type it exactly as it is written.  Remember to look for the word “users” or a tilde (~) in the URL.  This frequently denotes a website or webpage hosted by an ISP and means that the author of the page most likely does not own the whole TLD  (Barker, 2003, Glossary).

Filename.  There are many different types of files that can be accessed over the Internet.  The most common file types are html files, most often given with the extension .htm or .html.  In our example, the URL points to the file sessions.html.  This file contains the content that is shown in the browser window.  Other types of files that might be displayed in the browser include text files (e.g. sample.txt), Adobe Portable Document Format files (e.g. sample.pdf), and so on.

Bookmark or Anchor.  In this example, the URL contains some additional characters not always seen in Internet addresses.  The #A in the URL is a bookmark, also known as an anchor (Bell, 2003).  The author of the page has created bookmarks within the page to tell the browser to scroll to a certain point on the webpage.  The title of the bookmark, "A" in our example, is set by the webpage author.

Top of Page

 

Determining Domain Ownership

Whois

It is not uncommon for a claimant who has access to the Internet to have his or her own website.  Increasingly, Internet Service Providers (ISPs) offer free website hosting as part of internet service plans.  Many times individuals will include an “About Me” section which generally provides personal details and photos.  Investigators can find useful information about functionality and daily activities in these areas.  One claimant even had color photos of his surgery posted on his own website!  Claimants operating businesses may have a website for the business as well.  Identifying unreported income or functionality that is inconsistent with the stated limitations associated with a particular diagnosis is key information during a disability claim fraud investigation.  If a website suspected to belong to the claimant is encountered, it is important to determine who the true owner of the website is.

Now that we know what the components of a URL are, how do we find out who owns a website?   Sometimes, the domain owner seems obvious.  Everyone knows that the microsoft.com domain belongs to Microsoft Corporation.  However, making assumptions here is dangerous.  For example, many unsuspecting users have visited the whitehouse.com domain looking for information about the President but found a website featuring adult content instead.

Fortunately, there is a way to verify the owner of a website.  Whois (pronounced "who is") utilities provide information about the owners of Top Level Domains (TLDs).  There are many whois services and users are encouraged to try several when attempting to verify registration information.  As an example, searching for the domain smith.com with Network Solutions Whois, returns:

 

Registrant:
Smith International, Inc. (SMITH20-DOM)
16740 Hardy Street
Houston, TX 77032
US

Domain Name: SMITH.COM

Administrative Contact:
Wheatley, Sherry (SW704) swheatley@SMITH-INTL.COM
Smith International
16740 Hardy Street, Mailbox #26PO Box
Houston, TX 77205-0068
US
7132335362 fax: 7132335237

Technical Contact:
Handley, Rick (RH1966) rhandley@SMITH.COM
Smith International
60068
Houston, TX 77205-0068
US
713-233-5101 fax: 713-233-5237

Record expires on 06-Dec-2006.
Record created on 07-Dec-1998.
Database last updated on 5-Nov-2003 20:59:16 EST.

Domain servers in listed order:

NS.SMITH-INTL.COM 206.229.216.2
NS3-AUTH.SPRINTLINK.NET 144.228.255.10
NS2-AUTH.SPRINTLINK.NET 144.228.254.10
NS1-AUTH.SPRINTLINK.NET 206.228.179.10

 

In this example, Smith International, Inc. is the registered domain owner and contact information is also provided.  An administrative contact is listed.  Note that the administrative contact person is from the same company.  This may not always be the case.  Sometimes, a company or organization may outsource the administration of their website to another agency.  However, this information is usually representative of the true owner and person responsible for the website content.  This is key information for the investigator.  Also take note that, in this case, an email address is provided.

A technical contact is also listed.  This person or group is generally responsible for the actual hosting of the website and may often be an ISP.  Frequently, this person or agency may not be responsible for the content of the site in any way but rather is only responsible for hosting the site on a web server.

Dates of creation and registration expiration along with domain server names are noted near the bottom of the record.

Other popular whois services include AllWhois, and FasterWhois.  Most of these search the common TLDs like .com, .net, and .edu.  Some domains are restricted and require their own whois search.  Visit the ICANN website for more information on how to search restricted domains.

 

Hidden HTML

Also, it may be useful to check out the html code for a webpage.  Sometimes web developers insert comments which are generally hidden from view unless the actual code is displayed.  These comments are generally used for development purposes – notes to other developers, references to a particular programming method, and so on.  However, they may contain references to the author.  To view the html code for the webpage currently being viewed in a browser:

Internet Explorer 5.x click View à  Source

Netscape Navigator 4.x click View à  Page Source

To identify comments in html code, look for the Less Than and Exclamation Point characters, usually followed by two dashes.  For example <!-- Hello --> is an html comment.  The <! symbol denotes the beginning of the comment.  The text in between the Less Than and Greater Than brackets is hidden from webpage viewers but is visible in the actual html code.  A comment might look like:

<!--       Sample Webpage       -->

<!-- Created by Christy Johnson -->

<!--      December 5, 2001      -->

Html comments do not always contain useful intelligence but they may contain helpful identifying information and are worth a quick look.  When viewing the code, use the Find feature on the Edit menu to search for <! to find comments buried in the code.

 

File Properties

Another simple method of getting some historical information about a webpage is simply checking the properties of the html document that is being viewed.  If a user is looking at a webpage in a browser, the page properties can be viewed by:

Internet Explorer 5.x click File à  Properties

Netscape Navigator 4.x click View à  Page Info

The information contained in the properties box includes the date the file was created and last modified.  These dates may be useful in determining when a person was actively involved in editing or changing a website and also when to go back and view a historical copy of a website.  However, be careful when using this information as it may not be accurate (Barker, 2003, Evaluating).

 

Other Techniques

There are some less official methods for checking the identity of a website or webpage owner.  Certain sections of a website are likely to list information about the author.  Obviously, investigators should look for “About Me” or “Resume” pages.  Also be on the lookout for identification information on the bottom of pages.  Website created / designed / maintained / hosted by, etc. information could be useful in identifying the person who maintains or updates the site.  Be sure to make note of any dates available.  Other key pages to look for include Guestbook, Comments, Feedback, and Contact.  These pages often have listings of comments submitted by other website visitors.  In these comments visitors may reference the claimant.  For example, “Great site Christy.  Thanks for emailing me with those fantastic pictures!”  gives the investigator a good clue that Christy is likely the person who is actually operating the website.  Guestbooks usually list the email addresses or contact information of the people who post messages, giving investigators a list of folks who may have interacted with the claimant.  However, beware of relying heavily on this information as it is easily manipulated or falsified and my not be accurate.   Web Logs, commonly referred to as “Blogs,” are online publicly accessible journals which may describe detailed information about a person.

Top of Page

 

Recovering Missing Webpages

The Internet is ever-changing.  Unfortunately for investigators who often look into the past, changes to websites occur frequently and what was posted yesterday may not be available tomorrow (Cohen, 2003, Conducting).  Anyone who has encountered the familiar  HTTP 404 Page Not Found! error knows how frustrating this is.  Fortunately, there are ways to view the web historically. 

 

Drill up & Drill Down

Since websites are always changing and search engine indexes are not immediately updated, sometimes the pages we expect to see are no longer available.  Occasionally, this can be due to a site reorganization.  In such cases, the same material may still be available but the page address or URL has changed.  There are two ways to attempt to find the desired information, drilling up and drilling down.  As mentioned in the Anatomy of a URL section above, the slashes in a URL indicate a subfolder and are part of the path of an html file.  If that file has moved, searchers can drill up a level.  To do this, simply truncate the text to the right of the last slash in the URL (Barker, 2003, Evaluating).  If the address was:

http://www.sample.com/conferences/2004/sessions.html#Intro

then, try truncating the last section, leaving only the following:

http://www.sample.com/conferences/2004/

If this also gives an error, continue the process until arriving at a webpage that works.

http://www.sample.com/conferences/

http://www.sample.com/

Notice that in this example, we drilled up all the way to the homepage to find a working page.  As an alternative to the drill-up technique, users can drill down by going directly to the homepage and looking for new links to the information desired.  In this case, there might be a link to "Conferences" on the homepage that would bring the user to the information that is desired.

 

Cached Pages

Like other search engines, Google creates an index of sites that users are allowed to search.  But unlike most other search engines, Google also generates a cached snapshot of the webpages in the index.  That is, Google takes a picture of the page as it looks when it is indexed (Google, 2003, Cached).  This snapshot is saved in the index and made available to users.  The cached snapshot is created when the is updated frequently - as frequently as the index is updated.  Therefore, the cached version of the page is not very old.  To use this feature, conduct a query using Google.  When the results are listed, look for a "Cached" link under each result.

 Clicking this link will take you to Google's cached version of the page that is saved in the index.  An extra hint: searching for a URL with Google provides some additional options.  Try it out.  Example, search for www.microsoft.com.

 

Internet Archive Wayback Machine

Another service that provides historical views of webpages is the Internet Archive Wayback Machine.  By entering a URL in the search box provided, a the user is provided with a directory of stored copies of the webpage by date.  Clicking a date brings up the stored version.  Sometimes graphics may be unavailable but often the page is shown in-tact.  This tremendous resource catalogues some webpages back to 1996.  While it might be an interesting novelty to see what Yahoo.com looked like back in 1996, the Internet Archive can be used as powerful investigative tool to see what a website looked like days, months, or years ago.  It may also be helpful to note the progression of changes that occur on a particular website over a period of time.  For example, check a claimant’s personal website for what was posted immediately before and after the disabling event.

Top of Page

 

 

Proceed to Chapter 4: Searching the Web

 

   
  © 2003-2004 James D. Ruotolo.  All rights reserved.

last updated November, 2003