Harley Hahn's
Internet Advisor


Chapter 4...

Internet Addresses

On the Internet, every computer, every person and every resource has its own address. One of the basic skills I want to teach you is how to understand and use these addresses. For example, here is the address of a particular Web page:

http://www.harley.com/25-things/index.html

By the time you finish this chapter, you will understand each part of this address, as well as the most important ideas related to Internet addresses.

Jump to top of page

Hostnames and Top-Level Domains

Every computer on the Internet has its own, unique name. Because Internet computers are sometimes referred to as HOSTS, these unique names are called HOSTNAMES. Here are some typical examples:

www.harley.com
ftp.microsoft.com
architecture.mit.edu
ucsd.edu
www.senate.gov
eff.org
mail.pacbell.net
www.dofa.gov.au
www.austemb.org.cn
www.culture.fr
www.royal.gov.uk
www.rcmp-grc.gc.ca
www.cs.ait.ac.th

Notice that each hostname has two or more parts separated by periods. When you say such a name out loud, you pronounce the period as "dot". For example, www.harley.com is pronounced "w-w-w dot Harley dot com".

Hostnames are part of a system called DNS, the domain name system (which we will discuss later in the chapter). DNS is the system that allows us to give a unique name to every computer on the Internet.

The rightmost part of the name is called the top- level domain (for a reason I will explain later), and tells us general information about the host. For instance, within the name architecture.mit.edu, the top-level domain (edu) tells us that this computer is managed by an educational institution (in this case, MIT).

When you type on a computer keyboard, small letters are called LOWERCASE and capital letters are called UPPERCASE.

With hostnames, you can use either lower- or uppercase letters or a mixture. For example, the following hostnames are all equivalent:

www.harley.com
WWW.HARLEY.COM
Www.Harley.Com

— hint —

On the Internet, the custom is to use lowercase — especially among people who know what they are doing — and that is what you will see almost all the time. Although you can use uppercase when you type a hostname, I personally think lowercase looks a lot nicer, once you get used to it.

(As always, I encourage you to think like me.)

There are two types of top-level domains: ORGANIZATIONAL DOMAINS and GEOGRAPHICAL DOMAINS. Organizational domains describe a category, while geographical domains indicate a particular country. For example, the organizational domain edu is for educational institutions, while the geographical domain au indicates the country of Australia.

Figure 4-1 shows the organizational domains. Figure 4-2 shows some of the geographical domains. There are actually a great many such top-level domains — one for every country on the Net — and for reference I have put the complete list in Appendix A.

As a general rule, organizational domains are used inside the U.S., while the geographical domains are used in other countries. You will, however, see many exceptions.

Figure 4-1: Organizational top-level domains

Domain Description
aeroair-transport industry
bizmiscellaneous [businesses]
commiscellaneous [commercial]
coopcooperative organizations
eduUnited States universities (educational)
govUnited States federal government
infomiscellaneous [information]
intinternational organizations
milUnited States military
museummuseums
namemiscellaneous [individuals]
netmiscellaneous [network providers]
orgmiscellaneous [organizations]

Figure 4-2: Examples: geographical top-level domains

Domain Country/Region
atAustria
auAustralia
beBelgium
caCanada
chSwitzerland (Confoederatio Helvetica)
cnChina
deGermany (Deutschland)
dkDenmark
esSpain (Espaņa)
frFrance
grGreece
ieRepublic of Ireland
itItaly
jpJapan
nzNew Zealand
thThailand
ukUnited Kingdom
usUnited States

— hint —

When you see a hostname, look at the top-level domain (the rightmost part of the name). If the name has three or more letters, it is an organizational domain, and you can look up the meaning in Figure 4-1.

If the name has two letters, it is a geographical domain, probably representing a country. If you don't recognize the abbreviation, look it up in Figure 4-2 or in the master list in Appendix A.

Jump to top of page

Why Are There Two Types of Top-Level Domains?

In its formative years, the Internet was confined to the United States, and there were only a handful of top-level domains: edu, com, gov, mil, org, net and int. When the Internet expanded to other countries, the geographical domains were added so each country could have its own top-level domain.

The original intention was that, eventually, all Internet hostnames would use geographical domains, and indeed, the United States does have a us domain. However, by the time this domain was introduced, the organizational domains had been used for so long that few people in the U.S. were willing to change.

For this reason, you will see two main types of top-level domains: the organizational domains, used mostly in the U.S., and the geographical domains, used everywhere else. (This should not surprise you. After all, the United States is the only country in the world that does not use the metric system.)

The U.S. geographical domain (us) is used by many schools and local governments. For example, the address of the Web site for Austin Community College in Texas is www.austin.cc.tx.us, and the address of the Web site for the city of San Francisco, California, is www.ci.sf.ca.us.

— hint —

Many people prefer to avoid the us top-level domain because the names are too complicated.

Jump to top of page

Exceptions to the Rules

How nice (but how boring) life would be if everyone always cooperated. Although the guidelines for using top-level domains are clear, not everybody follows the rules.

To be sure, some of the top-level domains are used consistently. Within the United States, for instance, gov is used only by the federal government, mil is used only by the military, and edu (educational) is used only by universities. In addition, int is used only by international organizations. Similarly, outside the U.S., the two-letter geographical top-level domains (au, ca, jp, uk, and so on) are almost always used appropriately.

However, com, net and org are a different story. The com designation was supposed to be for commercial organizations, net was supposed to be for network providers, and org was supposed to be for organizations that do not fit into any other category, such as nonprofit organizations. Nevertheless, there are many hostnames that use com, net or org that do not meet the criteria. The reasons are threefold.

First, in the mid-1990s, the Internet expanded so fast as to surprise almost everyone. Many people and organizations in the U.S. wanted their own unique hostnames, but there were not enough good names to go around. At the same time, the organization that administered the com, net and org top-level domains did not enforce the guidelines, so people pretty much did as they pleased.

So what do you think happened when the network administrator at the Acme Company tried to register the name acme.com, and he found that the name was already taken? He simply registered acme.net or acme.org instead. (I explain how to register a name in Chapter 16.)

Moreover, as the organizational top-level domain names became popular, many people wanted their own personalized hostnames, so they registered names like alan.com and harley.com. Other people who found these names already taken registered variations like alan.net, alan.org, harley.net, harley.org, and so on.

Finally, to make matters even more confusing, people and organizations outside the U.S. started to use these same top-level domains as well (even though they should have used their own geographical top-level domains).

Through it all, the registrations for com, net and org were accepted without anyone checking whether they were being used according to the guidelines. As a result, there are many hostnames that end in com that are not used by commercial entities, and there are hostnames that end in net or org that are not used by network providers or organizations.

Is this bad? Maybe yes and maybe no. How you feel about the situation depends on how much you like things to be orderly and well organized. One thing, however, that everyone can agree on is that the Internet needs more names, especially in the United States. As I explained earlier, there is a us top-level domain, but few people want to use it because the hostnames are so complicated.

The problem with the com, net and org domains arose because of the huge demand for top-level domains. To help alleviate the problem, two new domains were added in 2001: info, for anyone who wants to use it, and biz, which is supposed to be used by businesses. In 2002, several other domains were added: aero (air-transport industry), coop (cooperative organizations), museum (museums) and name (individuals).

Jump to top of page

Countries That Sell Access to Their
Top-Level Domains

The two-letter geographical domains were set up so that each country would have its own top-level domain: ca for Canada, fr for France, us for the United States, and so on. As a result, every country in the world has a code. (See Appendix A for the complete list.)

Most of the time, a computer using a geographical domain is actually in that particular country. However, this is not always the case. Strictly speaking, a geographical domain is controlled by the country, and they can use it for any computers they want. For example, the French embassy for Canada is in Ottawa. However, the embassy's Web site uses an address with the fr top-level domain:

http://www.amba-ottawa.fr/

(The French word for "embassy" is ambassade.)

As I explained above, many of the preferred domain names have already been taken. As a result, some countries use their geographical domains to sell customized names in the global marketplace.

For example, the geographical domain for the country of Tonga is to. Some time ago, someone noticed that this domain (which wasn't being used) looks the same as the English word "to". With the cooperation of the government of Tonga, a business was set up to sell personalized domain names, ending in to, to people all over the world. Thus, you will see Web addresses (which are perfectly legitimate), such as:

http://come.to/something-or-other

http://listen.to/something-or-other

http://welcome.to/something-or-other

The Kingdom of Tonga, by the way, is a country of 270 square miles (700 square kilometers), comprising 170 volcanic and coral islands in the South Pacific. Tonga has about 100,000 people, very few of whom actually use the Internet. The to domain business is actually run by a company in the U.S.

When the geographical domain name system was originally set up, the idea was that the top-level domain would identify a particular country. Clearly, when a country licenses the use of their top-level domain to anyone in the world, it runs counter to the spirit of the rules. Moreover, it is confusing, as the top-level domain loses its meaning.

Still, it is being done, and you should be aware of the practice. I have used the to domain as an example, but it is not the only one. Other countries, especially small countries, are licensing the use of their domains. For example, the country of Moldova (a small Eastern European country, northeast of Romania) happens to have the top-level domain md. The government of Moldova has entered into a business arrangement to license md domain names to doctors.

— hint —

If you see a two-letter top-level domain you don't recognize, look it up in the master list in Appendix A.

For example, I received an advertisement asking me to pay money for a name in the cc domain. Would this be a wise thing to do? The advertisement intimated that cc was a brand new Internet domain, and I better reserve a name before all the good ones were gone.

However, by checking the list, I was able to confirm that the cc domain is actually supposed to be used by the Cocos Islands (a group of islands, southwest of Sumatra, in the eastern Indian Ocean).

Jump to top of page

Understanding Hostnames

The secret to understanding hostnames is to look at each part of the name, reading from right to left. In general, the rightmost two parts of the name will tell you which organization manages that particular computer. If a name has more than two parts, the extra parts may provide even more information.

Consider the name ucsd.edu. Reading from right to left, we see that this name represents a computer at a university (edu) which is the University of California at San Diego (ucsd).

Now look at the name architecture.mit.edu. The edu tells us this is also a computer at a university; the mit designation tells us that the university is MIT; and the third part, architecture, shows us that the computer is managed by the architecture department.

Let's take a look at one last example, www.royal.gov.uk. The rightmost part of the name (uk) tells us that this computer is in the United Kingdom. The next part of the name (gov) shows us that the computer is a government computer. The third part (royal) indicates that this computer has something to do with the British royal family.

But what about the leftmost part of the name, the www?

When the Web was first developed, it was called the World Wide Web. As the World Wide Web evolved, the name was shortened, but the old name left a legacy: it has become customary to use the designation www to indicate a computer that acts as a Web server. Thus, www.royal.gov.uk is the name of the computer that hosts the Web site of the British royal family.

As you use the Net, you will see many other hostnames that begin with www. All of these are names of computers that host Web sites. Informally, we often refer to such addresses as if they were the actual Web sites. For example, we might say that www.ibm.com is IBM's Web site, www.microsoft.com is Microsoft's Web site, and www.harley.com is my Web site.

You will also see similar patterns with other types of resources. For instance, there are many anonymous ftp servers whose hostnames begin with ftp. (Anonymous ftp is a system that allows people to download programs and other files for free.) In addition, there are also many mail servers (which provide email services) that have names beginning with mail.

Here are two examples: the computer named ftp.microsoft.com is Microsoft's anonymous ftp server, while the computer mail.pacbell.net is the mail server for an Internet service provider named Pacific Bell Internet.

There are no rules forcing people to use special hostnames, and there are some Web sites whose names do not begin with www. However, when you do see www (or ftp or mail) at the beginning of a hostname, you will know what it means.

Before we leave this section, let's look at one last hostname, www.cs.ait.ac.th. Reading from right to left, we start with the top-level domain th. This is the geographical top-level domain for Thailand.

The next part of the name is ac, a designation commonly used to indicate a university (ac stands for "academic community"). We then see ait, an abbreviation for the Asian Institute of Technology; cs, an abbreviation for "computer science"; and www, which indicates the computer hosts a Web site.

Thus, the hostname www.cs.ait.ac.th represents the Web site of the computer science department at the Asian Institute of Technology in Thailand.

Now, do I expect you to be able to figure out such names instantly? No, of course not. Unless you are already familiar with a country, you probably won't recognize its top-level domain or the names of its national organizations. However, I do want you to appreciate that many hostnames are constructed to make sense, at least to the people who make up the names.

The organizational top-level domains — particularly edu, com and gov — are used widely within the U.S. In other countries, you will see a number of variations. In particular, you will see ac (academic community), co (commercial) and gov (government), followed by a two-letter geographical top-level domain.

For example, in Thailand, the Asian Institute of Technology uses ait.ac.th, while in England, Oxford University uses ox.ac.uk. Similarly, the Times newspaper in England uses the-times.co.uk, while the Bank of Tokyo-Mitsubishi in Japan uses btm.co.jp, and in Australia, the Department of the Treasury uses the name treasury.gov.au.

— hint —

In general, each country can use any pattern of domain names within its own top-level domain. However, as you gain experience on the Net, you will learn to recognize a number of common patterns that are followed around the world.

Jump to top of page

Domains

So far, we have talked about hostnames — the names used for computers on the Internet. I now want to discuss the more subtle concept of domains and how all the hostnames on the Internet are organized.

A DOMAIN is a set of hostnames that have the rightmost part of their names in common. For instance, all the hostnames that end in edu belong to the edu domain. Some examples are:

architecture.mit.edu
www.ucsd.edu
www.med.harvard.edu

Similarly, all the hostnames that end in uk belong to the uk domain. For example:

www.royal.gov.uk
ox.ac.uk
the-times.co.uk

(Notice, by the way, that a hyphen is a legitimate character within a hostname.)

The most general domains, such as edu and uk, are called TOP-LEVEL DOMAINS, because if you draw a diagram of hostnames, showing the most general names above all the others, these domains would be at the top.

As I explained earlier, there are two types of top-level domains: organizational domains and geographical domains (see Figure 4-1 and Figure 4-2).

When hostnames have the two rightmost parts of their names in common, we say that they belong to the same SECOND-LEVEL DOMAIN. For instance, the following four hostnames all belong to the second- level domain pacbell.net:

pacbell.net
news.pacbell.net
mail.pacbell.net
www.pacbell.net

Similarly, these hostnames all belong to the second-level domain gov.au:

treasury.gov.au
hcourt.gov.au
health.gov.au

In such cases, we say that the more specific domain is a SUB-DOMAIN of the more general domain. Thus, pacbell.net is a sub-domain of net, and gov.au is a sub-domain of au.

Similarly, the hostname www.royal.gov.uk belongs to the royal.gov.uk domain, which is a sub-domain of gov.uk, which is itself a sub-domain of uk.

Thus, all the hostnames in the Internet are organized into one large hierarchical system based on domain names. This system is called DNS, the DOMAIN NAME SYSTEM.

Jump to top of page

IP Addresses

In a minute, we'll talk about DNS in more detail, but first I want to introduce you to the idea of IP addresses.

As you know, hostnames are the unique names we use to identify computers on the Internet. Hostnames are convenient because they are simple to use, and (once you get used to them) they are easy for human beings to remember.

However, the real work on the Internet is done by computer programs, not by human beings, and it is easier for programs to deal with numbers, not names. For this reason, every computer on the Internet is given a unique number, and, internally, the Net uses these numbers, not hostnames, to identify specific computers. (These numbers are similar, in spirit anyway, to the personal identification numbers — such as the U.S. Social Security number — that most countries issue to their citizens.)

In Chapter 1, I explained that the glue that holds the Internet together is a family of protocols (technical specifications) called TCP/IP. The most important protocol is IP (Internet Protocol). For this reason, the numbers used to identify Internet computers are called IP ADDRESSES or IP NUMBERS.

As we discussed in Chapter 1, we make use of the Internet by using client programs to request information from servers. For instance, to access the Web, we use a Web client, called a browser, to contact various Web servers on our behalf. What you need to understand is that, the way the Internet is set up, your browser must have an IP address in order to contact a server.

Of course, when we tell a client program the name of a computer, we use a hostname. This means that, before the client can carry out our request, it must translate that hostname into the corresponding IP address.

Let's say, for example, you want to visit my Web site. The address is www.harley.com. In order to contact my Web server, your browser must find out the IP address that corresponds to this address. To do so, the browser calls upon DNS to translate the hostname into an IP address. In this case, the IP address happens to be 209.221.208.10, and once the browser has this information, it is a simple matter to connect to computer number 209.221.208.10 and request data from the Web site.

— hint —

When we talk about the "address" of a computer, we can mean two different things:

  • To a person, the address of a computer is its hostname.
  • To a program, the address of a computer is its IP address.

All IP addresses have the same structure: four numbers separated by periods. The numbers can range from 0 to 255, but the actual details need not concern us, because, after all, IP addresses are for programs and computers, not for people.

For fun, however, here are some of the hostnames I mentioned earlier in the chapter, along with their IP addresses:

Figure 4-3: Sample Hostnames, Corresponding IP Numbers

Hostname IP Number
www.harley.com209.221.208.10
ftp.microsoft.com207.46.133.140
architecture.mit.edu18.113.0.177
ucsd.edu132.239.1.1
www.senate.gov156.33.195.33
eff.org204.253.162.2
mail.pacbell.net64.164.98.8
www.dofa.gov.au147.211.50.107
www.austemb.org.cn211.99.187.99
www.culture.fr143.126.211.220
www.royal.gov.uk193.32.29.66
www.cs.ait.ac.th192.41.170.129

Jump to top of page

DNS

The main job of DNS — the domain name system — is to translate hostnames into IP addresses. Everything is done behind the scenes, but I do want to take a moment to explain a bit about how it works, because it is so cool I know you will appreciate it.

Suppose someone gave you the task of keeping track of all the hostnames in the world and their corresponding IP addresses. How would you do it? One way would be to maintain a master table of all the hostnames and IP addresses. Whenever you had a specific inquiry, you could look up the hostname in the table and read off the appropriate IP address.

This is more or less the scheme that was used in the early days of the Internet, but as the years passed and the Net grew large, the hostname table grew as well and eventually became much too big. Moreover, new hostnames were being added continually, and it became very difficult to keep the table up to date. The solution was to distribute the responsibility around the Net.

As I explained in Chapter 1, the Internet is constructed such that every organization manages its own part of the Net. For this reason, it makes sense for each organization to manage its own hostnames as well. In particular, specific domains are used by organizations, who are then responsible for maintaining all the hostnames and IP addresses within those domains. For example, IBM has ibm.com, Harvard University has harvard.edu, the Australian government has gov.au, and so on. Each organization must arrange for at least two computers, called NAME SERVERS, to provide addressing information for all the hostnames in its domain. (Two servers are used in case one of them goes down temporarily.)

Because DNS distributes the management of the various domains, there is no need for anyone to maintain a gigantic master list of every computer on the Net. Whenever a program needs to translate a hostname into an IP address, all the program has to do is find the name server that handles that particular domain.

But out of all the name servers on the Net, how can a program find the one it needs? The solution is to start with the top-level domain and work its way down.

At various places around the Net, there are a number of special computers called ROOT NAME SERVERS. Each root name server maintains a list of the name servers that handle top-level domains (such as com, edu, au and uk). Each of these name servers maintains a list of other name servers that handle the second-level domains, and so on.

When a program needs the IP address for a hostname, the program starts at a root name server and works its way from one server to the next until it finds the one that has the required IP address.

Here is an example. Say that a program needs to find the IP address of the hostname www.ibm.com. The program starts by contacting a root name server to get the IP address of the com name server. The program then contacts the com name server to get the IP address of the ibm.com name server. Finally, the program contacts the ibm.com name server to get the IP address of www.ibm.com.

In this way, using DNS, it is simple for any program to find out the IP address of any computer on the Net, even though no one maintains a master list. DNS works well because, as the Internet grows, DNS grows as well. For example, when the IBM network administrators create new hostnames (using new IP addresses), all they need to do is update the information in the ibm.com name servers.

Is DNS cool, or what?

Jump to top of page

How DNS Works for You

Your access to DNS is provided by your ISP (Internet service provider). All ISPs maintain DNS SERVERS for the use of their customers. When one of your client programs needs to find the IP address of a particular computer, the program sends a request to your ISP's DNS server. The DNS server does whatever is necessary to find the IP address, which it then sends back to your program.

Let's take an example.

You decide to visit my Web site, so you tell your Web browser to go to www.harley.com (the hostname for my site). In order to get the appropriate IP address, your browser sends the hostname to your ISP's DNS server.

The DNS server contacts the root name server and gets the IP address of the com name server. The DNS server then contacts the com name server and gets the IP address of the harley.com name server. Finally, the DNS server contacts the harley.com name server and gets the IP address of www.harley.com. In this case, the IP address happens to be 209.221.204.136. This information is then sent to your browser which can now use the IP address to contact the Web server directly.

DNS is a good system, but it does involve a certain amount of overhead. To save time, each server at every level keeps a list of the most recently requested names and addresses. If a subsequent request comes in for the same information, the server is able to respond right away.

For example, once your DNS server has gone to the trouble of finding the IP address for www.harley.com, the information is kept for a certain period of time (usually somewhere between a half hour and a day), in case another program requests the same address.

As you use the Web, you will sometimes notice a delay after you type the address of a Web site. There are several reasons for this delay.

First, your browser must send the hostname to the DNS server and wait for a reply. Once the IP address is sent back, your browser then uses that address to contact the Web server. You then have to wait for data to be sent from the Web server to your computer.

During these delays, you will see informative messages, showing the progress. With the Internet Explorer browser, you will see a series of messages like:

Connecting to site 209.221.204.136
Web site found. Waiting for reply.

Now when you see these messages, you will know what they mean.

Jump to top of page

Mail Addresses

The most important service on the Internet is electronic mail. To send someone mail, you need to know his MAIL ADDRESS. Similarly, if someone knows your mail address, he can send you mail.

Once you understand hostnames, mail addresses are easy. They all follow the same pattern: a name, followed by an @ character (the "at" sign), followed by a hostname:

name@hostname

Here are some examples:

billg@microsoft.com
charles@royal.gov.uk
pope@vatican.va
president@whitehouse.gov
abuse@aol.com

When you say a mail address out loud, the @ character is pronounced "at". For example, the address billg@microsoft.com is pronounced "Bill G at Microsoft dot com". The address charles@royal.gov.uk is pronounced "Charles at royal dot gov dot U.K.".

Mail addresses are not case sensitive. Thus, the following three addresses are equivalent. However, we generally use all lowercase letters (as in the first address) because it looks nicer:

president@whitehouse.gov
President@Whitehouse.Gov
PRESIDENT@WHITEHOUSE.GOV

Today, electronic mail is used widely, and mail addresses are ubiquitous. (For instance, some phone companies list email addresses in their telephone books.) For this reason, it is common to hear people use the word ADDRESS to refer to an email address.

If you are talking to someone on the Net, and he uses the word "address", you can assume he means an email address. If the person wants to refer to a regular street address, he will usually make it clear by context. This idea is illustrated in the following real-life example:

A conversation in a Web chat room...

UNKNOWN PERSON: Excuse me, do you use Windows on your computer?

YOU: Yes, I do. Why do you ask?

UNKNOWN PERSON: Well, I am Bill Gates, and I am interested in what you think of Windows and other fine Microsoft products. If I give you my personal address, would you mind sending me email with your comments?

YOU: Not at all, Bill, I would be glad to help.

UNKNOWN PERSON: Thank you, that would be great. To show my appreciation, if you give me your postal address, I would be pleased to send you a box of money.

YOU: Sorry, Bill, but I make it a point to never give out personal information on the Net. I hope you understand.

UNKNOWN PERSON: Of course I do, and I admire your good judgment. Well, bye for now.

Where does your personal email address come from? In most cases, you get your address from the company that supplies you with mail service. If you use an ISP for mail service, they will assign you a name to use. For example, if your name is Benjamin Dover and your ISP is the Undependable Internet Company, your email address might be bendover@undependable.com.

Sometimes you can get your ISP to assign you a special name if it is not being used by someone else. So if everyone calls you Benjy, you might ask for the address benjy@undependable.com. The main rule everyone has to follow is that no two people can have the exact same address.

Most addresses are used by a person. However, there are some addresses that are used for special services. Companies often use names like sales, support or feedback to provide mailboxes for the general public. You might, for example, see the Undependable Internet Company set up the address support@undependable.com for customers who need a place to send a message when they have a problem.

Almost all mail systems have a standard address called postmaster to which you can send messages pertaining to the mail service. If you are having trouble sending mail to someone at, say, IBM, you can send a message to postmaster@ibm.com.

— hint —

Many ISPs (Internet service providers) set up a mailbox with the name abuse to field complaints about problems caused by the ISP's customers. For instance, AOL uses the address abuse@aol.com.

If someone sends you threatening messages or bothers you with unsolicited advertising (spam), send a note to the abuse mailbox at that person's ISP. If the ISP does not have an abuse mailbox (and your mail comes back with an error message), send the complaint to postmaster.

Most ISPs take such complaints seriously.

Jump to top of page

URLs

At the beginning of this chapter, I said that every computer, every person and every resource on the Internet has its own address. We have already talked about two types of addresses: hostnames (for computers) and mail addresses (for people). We will now talk about the addresses we use to identify the vast number of resources available on the Net.

The most popular resources on the Net are the many millions of Web pages stored on Web servers all over the world. Anyone with an Internet connection and a browser (Web client program) can access these pages. Of course, in order to fetch a Web page for you, your browser needs to know where to find the page. To describe the location of Web pages, we use a special type of address called a URL or UNIFORM RESOURCE LOCATOR. The name URL is pronounced as three separate letters, "U-R-L".

When we use a URL to specify the address of a particular resource, we say that the URL POINTS to that resource. For example, here is the URL that points to the main page of my Web site (don't worry about the details just yet):

http://www.harley.com/

Web sites can consist of many separate Web pages, and, strictly speaking, a URL points only to a single page. However, when a URL points to the main page of a Web site, we often say, informally, that the URL points to the site as a whole. Thus, I might say that the URL above points to the Harley Hahn Web site.

URLs can be used to point to all types of resources, not just Web pages. For this reason, URLs were designed to be as general as possible. As you use the Net, you will see two slightly different formats:

scheme://hostname/description
scheme: description

Here are some examples:

http://www.harley.com/

http://www.harley.com/25-things/index.html

http://www.ibm.com/

mailto:billg@microsoft.com

news:rec.pets.cats.anecdotes

ftp://ftp.microsoft.com/Products/msmq/demos.zip

The SCHEME (short for "addressing scheme") identifies the type of resource. Figure 4-4 shows the most commonly used schemes. As you use the Web, most of the URLs you will encounter will point to Web pages and, hence, will use the http scheme. (The name stands for Hypertext Transfer Protocol, the protocol used to transfer Web page data.) However, you will also see mail addresses (mailto), Usenet newsgroups (news) and anonymous ftp files (ftp).

Figure 4-4: The most common schemes used within URLs

Scheme Meaning
httpWeb page (hypertext)
mailtoMail address
newsUsenet newsgroup
ftpFile accessed via ftp
fileFile on your computer

Although there are many other schemes, the ones I talked about are the most common. The only other scheme you are likely to see is file, and you will only see it if you use your browser to display a file that is stored on your own computer.

After the scheme, the next part of a URL is the hostname. This is the name of the computer on which the resource resides. For instance, the following URL points to a Web page on the computer named www.harley.com:

http://www.harley.com/25-things/index.html

Here is another example. This URL points to a file that is available via anonymous ftp from the computer named ftp.microsoft.com:

ftp://ftp.microsoft.com/Products/msmq/demos.zip

If you look at the list of examples above, you will notice that some types of URLs do not need hostnames. These URLs point to resources that, by their nature, do not reside on a specific computer.

An example of this format is mailto, the type of URL that specifies a mail address. If you see a Web page with a mailto URL, and you click on it, your browser will start your mail program and send it the specified address. This makes it easy for you to send a message to that particular address. For example, the following URL will set up a message to be sent to the person whose email address is billg@microsoft.com:

mailto:billg@microsoft.com

Aside from mailto, there is another common type of URL that does not require a hostname. This type of URL specifies the name of a Usenet newsgroup (discussion group). For instance:

news:rec.pets.cats.anecdotes

(By the way, this newsgroup is the one to which people send stories about cats.)

To read the articles in a newsgroup, you use a client program called a newsreader to access a news server. There are many news servers on the Net.

Before you can access Usenet, you need to tell your newsreader the hostname of the news server you will be using. Then, whenever you encounter a URL with a newsgroup name, your newsreader knows which computer to contact. Most people use a news server maintained by their ISP (Internet service provider) as a service to their customers.

For this reason, a URL that points to a newsgroup cannot contain a specific hostname. The URL contains the name of the group, but it is up to your browser to know the location of your news server.

So let's summarize what we have discussed so far.

URLs are designed to describe a variety of resources. There are two common variations of URLs:

scheme://hostname/description
scheme: description

The scheme identifies the type of resource described by the URL. The hostname, if it is included, specifies the location of the computer that contains the resource. The last part of the URL, the description, contains whatever other information is needed to find the exact resource.

With a mailto URL, the description is a mail address; with a news URL, the description is the name of a Usenet newsgroup. However, with an http or ftp URL, the description must show exactly where a particular file resides. Such descriptions can be complicated, so we will discuss them in more detail, one step at a time.

Jump to top of page

File Names and Extensions

A computer FILE is a collection of data stored under a specific name. Files can hold any type of data that can be stored on a computer: text, numbers, pictures, sounds, video, and so on. In particular, the Web pages we view with our browsers are all stored in files on some computer or another. Thus, when we use a URL to point to a Web page, we need to specify not only the hostname of the computer, but the name of the exact file we want to look at.

Different computer systems have different rules for how files can be named. I won't go into all the variations, but I do want to mention one common characteristic: file names usually have two parts. The first part is chosen to describe the contents of the file in some way, the second part indicates the type of data stored in the file, and the two parts are separated by a period. Here are two examples:

invoice.doc
sales-order.doc

By looking at the first part of these file names, we can guess that the first file holds an invoice and the second file holds a sales order. The second part of a file name is called an EXTENSION. In this case, both files have the same extension, doc, which indicates that the files are documents. (It is common for word processor programs, such as Microsoft Word, to save files with an extension of doc.)

On the Internet, there are a number of extensions that you will see a lot. The most common is html. This indicates a file that contains hypertext — that is, a Web page. Now take a look at the following URL:

http://www.harley.com/25-things/index.html

Notice it ends with the name of a file, index.html. This tells you that the URL points to a Web page (although I still haven't explained all the details).

There are many different file extensions, and Figure 4-5 shows the ones you are most likely to see on the Internet. For now, don't worry about each type of file and what it means. We will discuss the various types of files as we encounter them throughout the book. I just want you to see a list of the most common extensions all in one place.

Figure 4-5: The most common file extensions used on the Internet

Extension Pronunciation Meaning
html"h-t-m-l"Web page
htm"h-t-m"Web page
asp"a-s-p"Web page generated in special way
gif"giff"; "jiffPicture stored in GIF format
jpg"jay-peg"Picture stored in JPEG format
txt"t-x-t"; "text"Plain text
zip"zip"Compressed collection of files
exe"e-x-e";"exyExecutable program
wav"wave"Sound/music file
mp3"m-p-3"; "em-peg"Music file
mid"midi"Music file
mov"move"Video (movie) file

When we talk about a file name, we pronounce the period as "dot". The pronunciations of the extensions vary and are shown in Figure 4-5. As an example, the name index.html is pronounced "index dot h-t-m-l", while the name cat.jpg is pronounced "cat dot jay-peg".

What's in a Name?

html
htm
asp


Web pages are written using a set of specifications called Hypertext Markup Language or HTML (which we will discuss in Chapter 15). For this reason, Web pages are commonly stored in files with the extension of html.

Some computer systems do not allow more than three characters in a file extension, and, on such systems, Web pages are given the extension htm.

Files with an asp extension contain hypertext, just like html files, but are generated by a special system called Active Server Pages. (Hence the extension asp.) However, when you use your browser to look at an asp page, it works the same as a regular html page.

Jump to top of page

Directories and Subdirectories

Computers have so many files that we need a way to organize them. We do so by grouping files into DIRECTORIES. Each directory is given a name and can hold any number of files. For example, you might use a directory called documents to hold your word processing documents. It is easy to create and delete directories, so, on your own computer, you can modify the storage arrangement as you see fit.

Directories can contain not only files, but other directories. When a directory contains another directory, we call the first one a PARENT DIRECTORY and the second one a SUBDIRECTORY. A directory can have as many subdirectories as you need. The power of this system is that it allows you to create a hierarchy of directories to reflect a particular way of organizing files. Moreover, as your needs change, it is a simple matter to modify the directory structure and to move files from one directory to another.

Here is an example. You create a directory named documents to hold your word processing documents. This works fine for a few weeks, but then you notice that you have too many files in one directory, so you decide to reorganize. Within the documents directory, you create three subdirectories: letters, writing and miscellaneous. You then move each file in the documents directory to one of the three new subdirectories.

— hint —

With Windows, the program you use to work with files and directories is called WINDOWS EXPLORER.

On some computer systems, it is customary to refer to directories as FOLDERS. This idea arose in the early days of PCs, when it was thought that personal computers were too intimidating for many people and everything should be "user friendly". The idea was that, by referring to directories as "folders", people would see the analogy between directories that contain computer files and folders in a filing cabinet.

Personally, I think the whole idea was pretty lame, and, to this day, computer companies (especially Microsoft and Apple) still continue to underestimate the intelligence of the average consumer. Thus, you will see both terms being used. When you do, just remember that a folder is the same thing as a directory, and a subfolder is the same as a subdirectory.

In general, people who like and understand computers use the word "directory", not the word "folder".

— hint —

If you are a "directory" person, do not marry a "folder" person.

Jump to top of page

Pathnames

Now that we have discussed file names and directories, we can fill in the last part of the URL puzzle. You will remember that one type of URL uses the following general format:

scheme://hostname/description

The description is nothing more than a list of directories and a file name. Take a look at one of our previous examples:

http://www.harley.com/25-things/index.html

In this case, 25-things is the name of a directory. Thus, the description refers to a file named index.html within a directory named 25-things.

Here is another example:

ftp://ftp.microsoft.com/Products/msmq/demos.zip

In this URL, there are two directory names, Products and msmq. Thus, the description refers to a directory named Products. Within this directory lies a subdirectory named msmq, and within this subdirectory is a file named demos.zip. Or, to say it another way, the demos.zip file is in a directory named msmq that itself is in a directory named Products.

As you use the Web, you will see a lot of URLs that have a series of directories ending with the name of a file. Such specifications are called PATHNAMES or, more simply, PATHS. Within a pathname, we use a / (slash) character to separate the various directory and file names.

Once you understand pathnames, it's easy to make sense out of most URLs if you remember the format:

scheme://hostname/description

The scheme shows you the type of resource, the hostname tells you the name of the computer, and the description contains a pathname.

— technical hint —

When you write URLs, you use / (slash) characters to separate the directory names. For example:

http://www.harley.com/25-things/index.html

When you are working within Windows, however, you use \ (backslash) characters in pathnames. For instance, here is a typical Windows pathname that points to a file named quikview.exe:

C:\Windows\System\Viewers\quikview.exe

In this example, the file resides in the Viewers directory, which is a subdirectory of the System directory, which is a subdirectory of the Windows directory on the C: disk.

Don't be confused. Within Windows, we use backslashes in a pathname, but in a URL we always use slashes.

What's in a Name?

/   slash: for URL pathnames
\   backslash: for Windows pathnames


Why do we use \ (backslash) characters in Windows pathnames, when using a / (slash) would be so much easier? It is a historical accident.

Windows is based on a old operating system called DOS, which uses commands that are typed by hand. The first DOS, version 1.0, was released in August 1981 along with the first IBM PC. Within DOS 1.0, the / character was used to indicate an option when you typed a command. For example, the dir command displays the names of all the files in a directory. The dir /w command displays the names using a "wide" format (more than one name per line).

At that time, PCs did not have hard disks, only floppy disks, and the amount of storage available on a floppy disk was small (just 160K). Since such a disk could not hold many files, DOS 1.0 did not support subdirectories, which meant there was no need for pathnames. Thus, it didn't matter that the / character was not available for pathnames.

In March 1983, IBM released the PC XT computer, the first PC to have a hard disk. The hard disk stored 10MB of data, which meant it was possible to store literally hundreds of files on a single disk. Thus, for the first time, there was a need for directories on a PC. A brand new DOS, version 2.0, which supported directories, was released along with the PC XT. Now there were pathnames.

Since the / character was used for command options, the designers of DOS 2.0 were reluctant to make PC users change their habits. For this reason, the \ character was used for pathnames. Through the years, DOS was improved and expanded, and at every step of the way, IBM and Microsoft opted for backward compatibility.

You might ask, why do we use a / character within Web addresses (URLs)? The answer is simple. The Web was originally designed on computers that used the Unix operating system, and from the very beginning, Unix has always used a / character for pathnames.

Jump to top of page

URL Abbreviations

Since URLs often contain a pathname, they can be quite long, and it's nice to be able to abbreviate whenever possible. To meet this need, there are conventions followed by Web servers that allow us to shorten URLs under certain conditions.

First, most Web servers follow a rule that says if the URL specifies a directory but no file name, the server will automatically look for a file with a specific name. Some servers look for a file named index.html, while others look for a file named default.html or default.asp. That means that, if you see a URL that ends in index.html, default.html or default.asp, you can usually leave off the file name and the URL will still work. Consider this URL from my own Web site:

http://www.harley.com/25-things/index.html

When you see a URL like this, you can guess that it might be okay to leave out the index.html, and most of the time (but not always) you will be right. With our example, it will work just fine if you use:

http://www.harley.com/25-things/

Similarly, either of the following URLs will point to the main Web page at my Web site:

http://www.harley.com/
http://www.harley.com/index.html

Another convention that lets us abbreviate has to do with the browser. If you ever type a URL that does not have a scheme at the beginning, your browser will assume the URL points to a Web page and insert http:// for you.

In addition, if you leave out a final slash (/) character, the browser will usually insert the slash for you. For example, if you want to visit my Web site, either of these URLs will work:

www.harley.com
http://www.harley.com/

How does a Web server know if a pathname ends with a directory or a file name?

The Web server assumes that if the pathname ends with a slash (/), it indicates a directory. Thus, if you type a URL that does not have a file name at the end, you should be sure to include the slash. For example, use:

http://www.harley.com/ http://www.kingfeatures.com/comics/

not:

http://www.harley.com
http://www.kingfeatures.com/comics

Most of the time such URLs will work, even without the slash, but it is proper to include it, and, as one of my readers, I know you take pride in being proper at all times.

Jump to top of page

Case Sensitive Pathnames

There are just a few more ideas I want to explain before we finish our discussion of pathnames. However, in order to do so, I need to talk about operating systems for a moment.

An operating system is the master control program that runs a computer. Most PCs use a version of the Windows operating system: Windows XP, Windows Me, Windows 98 or Windows 95. For more powerful computers — especially those providing database and networking services — there are special versions: Windows 2000, as well as the older Windows NT.

Aside from Windows, there is a completely different family of operating systems called Unix. There are many types of Unix — with different names — that run on all types of computers, not just PCs.

The reason I mention this is that most of the Web servers on the Net run either Windows 2000/NT or Unix, and the two systems use slightly different rules for pathnames.

First, with Windows, you can write pathnames using either lower- or uppercase letters. For example, within a URL, the following pathnames are all equivalent (as long as the server is running some type of Windows):

products/msmq/demos.zip
Products/msmq/demos.zip
Products/Msmq/Demos.Zip
PRODUCTS/MSMQ/DEMOS.ZIP

Unix always distinguishes between lower- and uppercase. For instance, in Unix, the name demos.zip is considered completely different from the name Demos.zip. We describe such names as being CASE SENSITIVE.

Recall our example:

ftp://ftp.microsoft.com/Products/msmq/demos.zip

If the computer to which this URL points is a Windows computer, you could use a lowercase p and the URL would still work. However, if the server is a Unix machine, the URL would not work.

In this case, the server does run Windows (which you might guess from looking at the hostname), so the following URL will work just fine:

ftp://ftp.microsoft.com/products/msmq/demos.zip

On the other hand, consider this URL from my Web site:

http://www.harley.com/get-rich/

This Web page resides on a Unix server where pathnames are case sensitive, and the last part of the pathname, get-rich, must be in all lowercase letters. If you change these letters to uppercase, the URL will not work:

http://www.harley.com/GET-RICH/

— hint —

If you type a URL that doesn't work, check to see if you might have accidentally typed some uppercase letters in lowercase, or vice versa.

Jump to top of page

What a ~ (Tilde) Means in a Pathname

You will sometimes see a ~ character within a pathname. For instance:

http://www.psych.ucsb.edu/~kopeikin/

This character is called a TILDE (pronounced "til- duh"). In the United States, the standard PC keyboard has the tilde in the top left-hand corner, above the Tab key. To type a tilde, you need to hold down the Shift key. (By the way, the character you get if you don't hold down the Shift key is called a backquote [`]. The backquote is rarely used.) In other countries, the location of the tilde will vary. In the U.K., for example, the tilde is just to the left of the Enter key.

The tilde character is used primarily in Unix systems. Unix is designed to support many users at the same time, and each user is given a HOME DIRECTORY in which to store his files. A Unix user can create whatever files and subdirectories he wants in his home directory.

Within Unix, a tilde combined with a name refers to the home directory of a particular person. For example, if you had an account on a Unix computer under the name harley, your home directory would be known as ~harley.

Thus, when you see a URL that has a directory name beginning with a tilde, it means the Web site resides on a Unix system, and the directory is someone's home directory. In the example above, the URL points to the Web site of Hal Kopeikin, a member of the Department of Psychology at the University of California at Santa Barbara.

Jump to top of page

Putting It All Together

At the beginning of the chapter, I promised that by the time you finished this chapter, you would be able to understand addresses like this one:

http://www.harley.com/25-things/index.html

Notice how easy it is to understand such addresses once you understand the pattern. Just look for the general format:

scheme://hostname/description

Then identify the various parts. In this case:

  • The scheme, http, tells you the URL points to a Web page.
  • The hostname, www.harley.com, is the address of the Web server.
  • The description, 25-things/index.html, is the pathname of the Web page.

In other words, this URL is the address of a Web page named index.html in the 25-things directory on the www.harley.com Web server.

If only all of life were so easy.

Jump to top of page