|
In this chapter:
Anatomy
of a URL
Determining Domain Ownership
Recovering Missing Webpages
Understanding the anatomy of a Uniform Resource
Locator (URL) is essential when conducting online research. Frequent
users have become accustomed to seeing these URLs in their browser’s
address bar but not everyone knows what they mean. The following URL
will be used as an example:

This single address actually contains five
distinct segments and each of them provides some very basic information.
Protocol. In the example
described above, the protocol is http
which stands for Hyper Text Transfer Protocol
(Bell, 2003).
There are other types of protocols as well. Common ones include
ftp, and telnet. The protocol in a URL is always followed by a colon
and two forward-slashes (http:// or
telnet:// for
example). HTTP is the protocol that is used for the overwhelming
majority of web content that most users are familiar with.
Investigators may also encounter the https://
protocol. The addition of an "S" indicates Secure Sockets Layer (SSL)
encryption is being used. This protocol is frequently used for
secure ecommerce transactions and is accompanied by a key or lock symbol
in the lower right-hand corner of the browser window. For more
information on SSL, visit
Introduction to SSL, provided by Netscape.
| |
.com |
Company |
|
.mil |
Military |
|
| |
.org |
Organization |
|
.biz |
Business |
|
| |
.gov |
Government |
|
.us |
United States |
|
| |
.edu |
Educational |
|
.name |
Personal |
|
Bookmark or Anchor. In this example, the URL contains some
additional characters not always seen in Internet addresses. The
#A
in the URL is a bookmark, also known as an anchor (Bell, 2003). The author of the page has created bookmarks
within the page to tell the browser to scroll to a certain point on the
webpage. The title of the bookmark, "A" in our example, is set by
the webpage author.
Top of Page
Whois
It is not uncommon for a claimant who has access to
the Internet to have his or her own website. Increasingly, Internet
Service Providers (ISPs) offer free website hosting as part of internet
service plans. Many times individuals will include an “About Me”
section which generally provides personal details and photos.
Investigators can find useful information about functionality and daily
activities in these areas. One
claimant even had color photos of his surgery posted on his own
website! Claimants operating businesses may have a website for the
business as well. Identifying unreported income or functionality
that is inconsistent with the stated limitations associated with a
particular diagnosis is key information during a disability claim fraud
investigation. If a website suspected to belong to the claimant is
encountered, it is important to determine who the true owner of the
website is.
Now that we know what the components of a URL are, how do we find out
who owns a website? Sometimes,
the domain owner seems obvious. Everyone knows that the
microsoft.com domain belongs to
Microsoft Corporation. However, making assumptions here is
dangerous. For example, many unsuspecting users have visited the
whitehouse.com domain looking for
information about the President but found a website featuring adult
content instead.
Fortunately, there is a way to verify the owner of a website.
Whois (pronounced "who is") utilities
provide information about the owners of Top Level Domains (TLDs).
There are many whois services and
users are encouraged to try several when attempting to verify registration
information. As an example, searching for the domain
smith.com with
Network Solutions Whois, returns:
| |
Registrant:
Smith International, Inc. (SMITH20-DOM)
16740 Hardy Street
Houston, TX 77032
US
Domain Name: SMITH.COM
Administrative Contact:
Wheatley, Sherry (SW704) swheatley@SMITH-INTL.COM
Smith International
16740 Hardy Street, Mailbox #26PO Box
Houston, TX 77205-0068
US
7132335362 fax: 7132335237
Technical Contact:
Handley, Rick (RH1966) rhandley@SMITH.COM
Smith International
60068
Houston, TX 77205-0068
US
713-233-5101 fax: 713-233-5237
Record expires on 06-Dec-2006.
Record created on 07-Dec-1998.
Database last updated on 5-Nov-2003 20:59:16 EST.
Domain servers in listed order:
NS.SMITH-INTL.COM 206.229.216.2
NS3-AUTH.SPRINTLINK.NET 144.228.255.10
NS2-AUTH.SPRINTLINK.NET 144.228.254.10
NS1-AUTH.SPRINTLINK.NET 206.228.179.10 |
|
In this example, Smith International, Inc. is the
registered domain owner and contact information is also provided. An administrative contact is listed. Note that the
administrative contact person is from the same company. This may
not always be the case. Sometimes, a company or organization may
outsource the administration of their website to another agency.
However, this information is usually representative of the true owner
and person responsible for the website content. This is key
information for the investigator. Also take note that, in this
case, an email address is provided.
A technical contact is also listed. This person or
group is generally responsible for the actual hosting of the website and
may often be an ISP. Frequently, this person or agency may not be
responsible for the content of the site in any way but rather is only
responsible for hosting the site on a web server.
Dates of creation and registration expiration along
with domain server names are
noted near the bottom of the record.
Other popular whois services include
AllWhois, and
FasterWhois.
Most of these search the common TLDs like .com,
.net, and .edu.
Some domains are restricted and require their own whois search.
Visit the ICANN
website for more information on how to search restricted domains.
Hidden HTML
Also, it may be useful to check out the html code
for a webpage. Sometimes web developers insert comments which are
generally hidden from view unless the actual code is displayed. These
comments are generally used for development purposes – notes to other
developers, references to a particular programming method, and so on.
However, they may contain references to the author. To view the html
code for the webpage currently being viewed in a browser:
Internet
Explorer 5.x click View
à
Source
Netscape
Navigator 4.x click View
à
Page Source
To identify comments in html code, look for the
Less Than and Exclamation Point characters, usually followed
by two dashes. For example <!-- Hello --> is
an html comment. The <! symbol denotes the beginning
of the comment. The text in between the Less Than and Greater
Than brackets is hidden from webpage viewers but is visible in the
actual html code. A comment might look like:
<!-- Sample Webpage -->
<!-- Created by Christy Johnson -->
<!-- December 5, 2001 -->
File Properties
Another simple method of getting some historical
information about a webpage is simply checking the properties of the html document that
is being viewed. If a user is looking at a webpage in a browser, the
page properties can be viewed by:
Internet
Explorer 5.x click File
à
Properties
Netscape
Navigator 4.x click View
à
Page Info
The information contained in the properties box
includes the date the file was created and last modified. These dates
may be useful in determining when a person was actively involved in
editing or changing a website and also when to go back and view a
historical copy of a website. However, be careful when using this
information as it may not be accurate (Barker, 2003, Evaluating).
Other Techniques
There are some less official methods for checking
the identity of a website or webpage owner. Certain sections of a website
are likely to list information about the author. Obviously,
investigators should look for “About Me” or “Resume” pages. Also be
on the lookout for identification information on the bottom of pages.
Website created / designed / maintained / hosted by, etc. information
could be useful in identifying the person who maintains or updates the
site. Be sure to make note of any dates available. Other key
pages to look for include Guestbook, Comments, Feedback, and Contact.
These pages often have listings of comments submitted by other website
visitors. In these comments visitors may reference the claimant. For
example, “Great site Christy. Thanks for emailing me with those
fantastic pictures!” gives the investigator a good clue that Christy is
likely the person who is actually operating the website. Guestbooks
usually list the email addresses or contact information of the people
who post messages, giving investigators a list of folks who may have
interacted with the claimant. However, beware of relying heavily
on this information as it is easily manipulated or falsified and my not
be accurate.
Web Logs, commonly referred to as “Blogs,” are online publicly
accessible journals which may describe detailed information about a
person.
Top of Page
The Internet is ever-changing. Unfortunately for
investigators who often look into the past, changes to websites occur
frequently and what was posted yesterday may not be available tomorrow
(Cohen, 2003, Conducting).
Anyone who has encountered the familiar
HTTP 404 Page Not Found! error knows how frustrating this is. Fortunately, there are ways to view the web historically.
Drill up & Drill Down
Since websites are always changing and search
engine indexes are not immediately updated, sometimes the pages we
expect to see are no longer available. Occasionally, this can be
due to a site reorganization. In such cases, the same material may
still be available but the page address or URL has changed. There
are two ways to attempt to find the desired information, drilling up and
drilling down. As mentioned in the
Anatomy of a URL section above, the slashes in a URL indicate a
subfolder and are part of the path of an html file. If that file
has moved, searchers can drill up a level. To do this, simply
truncate the text to the right of the last slash in the URL (Barker,
2003, Evaluating). If
the address was:
http://www.sample.com/conferences/2004/sessions.html#Intro
then, try truncating the last section, leaving only
the following:
http://www.sample.com/conferences/2004/
If this also gives an error, continue the process
until arriving at a webpage that works.
http://www.sample.com/conferences/
http://www.sample.com/
Notice that in this example, we drilled up all the
way to the homepage to find a working page. As an alternative to
the drill-up technique, users can drill down by going directly to the
homepage and looking for new links to the information desired. In
this case, there might be a link to "Conferences" on the homepage that
would bring the user to the information that is desired.
Cached Pages
Like other search engines, Google creates an index
of sites that users are allowed to search. But unlike most other
search engines, Google also generates a cached snapshot of the webpages
in the index. That is, Google takes a picture of the page as it
looks when it is indexed (Google, 2003, Cached). This snapshot is saved in the index and
made available to users. The cached snapshot is created when
the is updated frequently - as frequently as the index is updated.
Therefore, the cached version of the page is not very old. To use
this feature, conduct a query using Google. When the results are
listed, look for a "Cached" link under each result.
Clicking this link will take you to Google's
cached version of the page that is saved in the index. An extra
hint: searching for a URL with Google provides some additional options.
Try it out. Example, search for
www.microsoft.com.
Internet Archive Wayback Machine
Another service that provides historical views of
webpages is the Internet Archive Wayback Machine.
By entering a URL in the search box provided, a the user is provided with a
directory of stored copies of the webpage by date. Clicking a date
brings up the stored version. Sometimes graphics may be unavailable but
often the page is shown in-tact. This tremendous resource catalogues
some webpages back to 1996. While it might be an interesting
novelty to see what Yahoo.com looked like back in 1996, the Internet Archive can be used as
powerful investigative tool to see what a website looked like days, months, or
years ago. It may also be helpful to note the progression of changes
that occur on a particular website over a period of time. For example, check a claimant’s
personal website for what was posted immediately before and after the
disabling event.
Top of Page
Proceed to Chapter 4: Searching the Web
 |