Perl Notes(III) -- Introduction To Berkeley Sockets

3 Introduction To Berkeley Sockets

3.1 Basic Concepts

3.1.1 Binary versus Text-Oriented Protocols

Before they can exchange information across the network, hosts have a fundamental choice to make. They can exchange data either in binary form or as human-readable text. The choice has far-reaching ramifications.

To understand this, consider exchanging the number 1984. To exchange it as text, one host sends the other the string 1984, which, in the common ASCII character set, corresponds to the four hexadecimal bytes 0x31 0x39 0x38 0x34. These four bytes will be transferred in order across the network, and (provided the other host also speaks ASCII) will appear at the other end as "1984".

However, 1984 can also be treated as a number, in which case it can fit into the two-byte integer represented in hexadecimal as 0x7C0. If this number is already stored in the local host as a number, it seems sensible to transfer it across the network in its native two-byte form rather than convert it into its four-byte text representation, transfer it, and convert it back into a two-byte number at the other end. Not only does this save some computation, but it uses only half as much network capacity.

Unfortunately, there's a hitch. Different computer architectures have different ways of storing integers and floating point numbers. Some machines use two-byte integers, others four-byte integers, and still others use eight-byte integers. This is called word size. Furthermore, computer architectures have two different conventions for storing integers in memory. In some systems, called big-endian architectures, the most significant part of the integer is stored in the first byte of a two-byte integer. On such systems, reading from low to high, 1984 is represented in memory as the two bytes:

0x07    0xC0
low  -> high

On little-endian architectures, this convention is reversed, and 1984 is stored in the opposite orientation:

0xC0    0x07
low  -> high

These architectures are a matter of convention, and neither has a significant advantage over the other. The problem comes when transferring such data across the network, because this byte pair has to be transferred serially as two bytes. Data in memory is sent across the network from low to high, so for big-endian machines the number 1984 will be transferred as 0x07 0xC0, while for little-endian machines the numbers will be sent in the reverse order. As long as the machine at the other end has the same native word size and byte order, these bytes will be correctly interpreted as 1984 when they arrive. However, if the recipient uses a different byte order, then the two bytes will be interpreted in the wrong order, yielding hexadecimal 0xC007, or decimal 49,159. Even worse, if the recipient interprets these bytes as the top half of a four-byte integer, it will end up as 0xC0070000, or 3,221,684,224. Someone's anniversary party is going to be very late.

Because of the potential for such binary chaos, text-based protocols are the norm on the Internet. All the common protocols convert numeric information into text prior to transferring them, even though this can result in more data being transferred across the net. Some protocols even convert data that doesn't have a sensible text representation, such as audio files, into a form that uses the ASCII character set, because this is generally easier to work with. By the same token, a great many protocols are line-oriented, meaning that they accept commands and transmit data in the form of discrete lines, each terminated by a commonly agreed-upon newline sequence.

A few protocols, however, are binary. Examples include Sun's Remote Procedure Call (RPC) system, and the Napster peer-to-peer file exchange protocol. Such protocols have to be exceptionally careful to represent binary data in a common format. For integer numbers, there is a commonly recognized network format. In network format, a "short" integer is represented in two big-endian bytes, while a "long" integer is represented with four big-endian bytes. Perl's pack() and unpack () functions provide the ability to convert numbers into network format and back again.

Floating point numbers and more complicated things like data structures have no commonly accepted network representation. When exchanging binary data, each protocol has to work out its own way of representing such data in a platform-neutral fashion.

3.1.2 Berkeley Sockets

Berkeley sockets are part of an application programming interface (API) that specifies the data structures and function calls that interact with the operating system's network subsystem. Berkeley sockets are part of an API, not a specific protocol, which defines how the programmer interacts with an idealized network.

3.2 The Anatomy of a Socket

A socket is an endpoint for communications, a portal to the outside world that we can use to send outgoing messages to other processes, and to receive incoming traffic from processes interested in sending messages to us.

To create a socket, we need to provide the system with a minimum of three pieces of information.

3.2.1 The Socket's Domain

The domain defines the family of networking protocols and addressing schemes that the socket will support. The domain is selected from a small number of integer constants defined by the operating system and exported by Perl's Socket module. There are only two common domains

Table 3.1. Common Socket Domains

Constant	Description
`AF_INET`	The Internet protocols
`AF_UNIX`	Networking within a single host

In addition to these domains, there are many others including AF_APPLETALK, AF_IPX, and AF_X25, each corresponding to a particular addressing scheme. AF_INET6, corresponding to the long addresses of TCP/IP version 6, will become important in the future, but is not yet supported by Perl. The AF_ prefix stands for "address family." In addition, there is a series of "protocol family" constants starting with the PF_ prefix.

3.2.2 The Socket's Type

The socket type identifies the basic properties of socket communications.

Table 3.2. Constants Exported by Socket

Constant	Description
`SOCK_STREAM`	A continuous stream of data
`SOCK_DGRAM`	Individual packets of data
`SOCK_RAW`	Access to internal protocols and interfaces

Perl fully supports the SOCK_STREAM and SOCK_DGRAM socket types. SOCK_RAW is supported through an add-on module named Net::Raw.

3.2.3 The Socket's Protocol

Like the domain and socket type, the protocol is a small integer. However, the protocol numbers are not available as constants, but instead must be looked up at run time using the Perl getprotobyname() function.

Table 3.3. Some Socket Protocols

Protocol	Description
`tcp`	Transmission Control Protocol for stream sockets
`udp`	User Datagram Protocol for datagram sockets
`icmp`	Internet Control Message Protocol
`raw`	Creates IP packets manually

The TCP and UDP protocols are supported directly by the Perl sockets API. You can get access to the ICMP and raw protocols via the Net::ICMP and Net::Raw third-party modules.

The allowed combinations of socket domain, type, and protocol are few. SOCK_STREAM goes with TCP, and SOCK_DGRAM goes with UDP. Also notice that the AF_UNIX address family doesn't use a named protocol, but a pseudoprotocol named PF_UNSPEC (for "unspecified").

Table 3.4. Allowed Combinations of Socket Type and Protocol in the INET and UNIX Domains

Domain Type Protocol

AF_INET SOCK_STREAM tcp

AF_INET SOCK_DGRAM udp

AF_UNIX SOCK_STREAM PF_UNSPEC

AF_UNIX SOCK_DGRAM PF_UNSPEC

Domain	Type	Protocol
`AF_INET`	`SOCK_STREAM`	tcp
`AF_INET`	`SOCK_DGRAM`	udp
`AF_UNIX`	`SOCK_STREAM`	`PF_UNSPEC`
`AF_UNIX`	`SOCK_DGRAM`	`PF_UNSPEC`

3.2.4 Datagram Sockets

Datagram-type sockets provide for the transmission of connectionless, unreliable, unsequenced messages. The UDP is the chief datagram-style protocol used by the Internet protocol family.

As the diagram in Figure 3.2 shows, datagram services resemble the postal system. Like a letter or a telegram, each datagram in the system carries its destination address, its return address, and a certain amount of data. The Internet protocols will make the best effort to get the datagram delivered to its destination.

Figure 3.2. Datagram sockets provide connectionless, unreliable, unsequenced transmission of message

There is no long-term relationship between the sending socket and the recipient socket: A client can send a datagram off to one server, then immediately turn around and send a datagram to another server using the same socket. But the connectionless nature of UDP comes at a price. Like certain countries' postal systems, it is very possible for a datagram to get "lost in the mail." A client cannot know whether a server has received its message until it receives an acknowledgment in reply. Even then, it can't know for sure that a message was lost, because the server might have received the original message and the acknowledgment got lost!

Datagrams are neither synchronized nor flow controlled. If you send a set of datagrams out in a particular order, they might not arrive in that order. Because of the vagaries of the Internet, the first datagram may go by one route, and the second one may take a different path. If the second route is faster than the first one, the two datagrams may arrive in the opposite order from which they were sent. It is also possible for a datagram to get duplicated in transit, resulting in the same message being received twice.

Because of the connectionless nature of datagrams, there is no flow control between the sender and the recipient. If the sender transmits datagrams faster than the recipient can process them, the recipient has no way to signal the sender to slow down, and will eventually start to discard packets.

Although a datagram's delivery is not reliable, its contents are. Modern implementations of UDP provide each datagram with a checksum that ensures that its data portion is not corrupted in transit.

3.2.5 Stream Sockets

The other major paradigm is stream sockets, implemented in the Internet domain as the TCP protocol. Stream sockets provide sequenced, reliable bidirectional communications via byte-oriented streams. As depicted in Figure 3.3, stream sockets resemble a telephone conversation. Clients connect to servers using their address, the two exchange data for a period of time, and then one of the pair breaks off the connection.

Figure 3.3. Stream sockets provide sequenced, reliable, bidirectional communications

Reading and writing to stream sockets is a lot like reading and writing to a file. There are no arbitrary size limits or record boundaries, although you can impose a record-oriented structure on the stream if you like. Because stream sockets are sequenced and reliable, you can write a series of bytes into a socket secure in the knowledge that they will emerge at the other end in the correct order, provided that they emerge at all ("reliable" does not mean immune to network errors).

TCP also implements flow control. Unlike UDP, where the danger of filling the data-receiving buffer is very real, TCP automatically signals the sending host to suspend transmission temporarily when the reading host is falling behind, and to resume sending data when the reading host is again ready. This flow control happens behind the scenes and is ordinarily invisible.

Although it looks and acts like a continuous byte stream, the TCP protocol is actually implemented on top of a datagram-style service, in this case the low-level IP protocol. IP packets are just as unreliable as UDP datagrams, so behind the scenes TCP is responsible for keeping track of packet sequence numbers, acknowledging received packets, and retransmitting lost packets.

3.2.6 Datagram versus Stream Sockets

With all its reliability problems, you might wonder why anyone uses UDP. The answer is that most client/server programs on the Internet use TCP stream sockets instead. In most cases, TCP is the right solution for you, too.

There are some circumstances, however, in which UDP might be a better choice. For example, time servers use UDP datagrams to transmit the time of day to clients who use the information for clock synchronization. If a datagram disappears in transit, it's neither necessary nor desirable to retransmit it because by the time it arrives it will no longer be relevant.

UDP is also preferred when the interaction between one host and the other is very short. The length of time to set up and take down a TCP connection is about eightfold greater than the exchange of a single byte of data via UDP (for details, see [Stevens 1996]). If relatively small amounts of data are being exchanged, the TCP setup time will dominate performance. Even after a TCP connection is established, each transmitted byte consumes more bandwidth than UDP because of the additional overhead for ensuring reliability.

Another common scenario occurs when a host must send the same data to many places; for example, it wants to transmit a video stream to multiple viewers. The overhead to set up and manage a large number of TCP connections can quickly exhaust operating system resources, because a different socket must be used for each connection. In contrast, sending a series of UDP datagrams is much more sparing of resources. The same socket can be reused to send datagrams to many hosts.

Whereas TCP is always a one-to-one connection, UDP also allows one-to-many and many-to-many transmissions. At one end of the spectrum, you can address a UDP datagram to the "broadcast address," broadcasting a message to all listening hosts on the local area network. At the other end of the spectrum, you can target a message to a predefined group of hosts using the "multicast" facility of modern IP implementations. These advanced features are covered in Chapters 20 and 21.

The Internet's DNS is a common example of a UDP-based service. It is responsible for translating hostnames into IP addresses, and vice versa, using a loose-knit network of DNS servers. If a client does not get a response from a DNS server, it just retransmits its request. The overhead of an occasional lost datagram outweighs the overhead of setting up a new TCP connection for each request. Other common examples of UDP services include Sun's Network File System (NFS) and the Trivial File Transfer Protocol (TFTP). The latter is used by diskless workstations during boot in order to load their operating system over the network. UDP was originally chosen for this purpose because its implementation is relatively small. Therefore, UDP fit more easily into the limited ROM space available to workstations at the time the protocol was designed.

3.3 Socket Addressing

For the UNIX domain, which can be used only between two processes on the same host machine, addresses are simply paths on the host's filesystem, such as /usr/tmp/log. For the Internet domain, each socket address has three parts: the IP address, the port, and the protocol.

3.3.1 IP Addresses

Many of Perl's networking calls require you to work with IP addresses in the form of packed binary strings. IP addresses can be converted manually to binary format and back again using pack() and unpack() with a template of "C4" (four unsigned characters). For example, here's how to convert 18.157.0.125 into its packed form and then reverse the process:

($a,$b,$c,$d)      = split //./, '18.157.0.125';
$packed_ip_address = pack 'C4',$a,$b,$c,$d;
($a,$b,$c,$d)      = unpack 'C4',$packed_ip_address;
$dotted_ip_address = join '.', $a,$b,$c,$d;

Most hosts have two addresses, the "loopback" address 127.0.0.1 (often known by its symbolic name "localhost") and its public Internet address. The loopback address is associated with a device that loops transmissions back onto itself, allowing a client on the host to make an outgoing connection to a server running on the same host. Although this sounds a bit pointless, it is a powerful technique for application development, because it means that you can develop and test software on the local machine without access to the network.

The public Internet address is associated with the host's network interface card, such as an Ethernet card. The address is either assigned to the host by the network administrator or, in systems with dynamic host addressing, by a Boot Protocol (BOOTP) or Dynamic Host Configuration Protocol (DHCP) server. If a host has multiple network interfaces installed, each one can have a distinct IP address. It's also possible for a single interface to be configured to use several addresses.

3.3.2 Reserved IP Addresses, Subnets, and Netmasks

In order for a packet of information to travel from one location to another across the Internet, it must hop across a series of physical networks. For example, a packet leaving your desktop computer must travel across your LAN (local area network) to a modem or router, then across your Internet service provider's (ISP) regional network, then across a backbone to another ISP's regional network, and finally to its destination machine.

Network routers keep track of how the networks interconnect, and are responsible for determining the most efficient route to get a packet from point A to point B. However, if IP addresses were allocated ad hoc, this task would not be feasible because each router would have to maintain a map showing the locations of all IP addresses. Instead, IP addresses are allocated in contiguous chunks for use in organizational and regional networks.

For example, my employer, the Cold Spring Harbor Laboratory (CSHL), owns the block of IP addresses that range from 143.48.0.0 through 143.48.255.255 (this is a so-called class B address). When a backbone router sees a packet addressed to an IP address in this range, it needs only to determine how to get the packet into CSHL's network. It is then the responsibility of CSHL's routers to get the packet to its destination. In practice, CSHL and other large organizations split their allocated address ranges into several subnets and use routers to interconnect the parts.

A computer that is sending out an IP packet must determine whether the destination machine is directly reachable (e.g., over the Ethernet) or whether the packet must be directed to a router that interconnects the local network to more distant locations. The basic decision is whether the packet is part of the local network or part of a distant network.

To make this decision possible, IP addresses are arbitrarily split into a host part and a network part. For example, in CSHL's network, the split occurs after the second byte: the network part is 143.48. and the host part is the rest. So 143.48.0.0 is the first address in CSHL's network, and 143.48.255.255 is the last.

To describe where the network/host split occurs for routing purposes, networks use a netmask, which is a bitmask with 1s in the positions of the network part of the IP address. Like the IP address itself, the netmask is usually written in dotted-quad form. Continuing with our example, CSHL has a netmask of 255.255.0.0, which, when written in binary, is 11111111,11111111,00000000,00000000.

Historically, IP networks were divided into three classes on the basis of their netmasks (Table 3.5). Class A networks have a netmask of 255.0.0.0 and approximately 16 million hosts. Class B networks have a netmask of 255.255.0.0 and some 65,000 hosts, and class C networks use the netmask 255.255.255.0 and support 254 hosts (as we will see, the first and last host numbers in a network range are unavailable for use as a normal host address).

Table 3.5. Address Classes and Their Netmasks


Class	Netmask	Example Address	Network Park	Host Part
A	255.0.0.0	120.155.32.5	120.	155.32.5
B	255.255.0.0	128.157.32.5	128.157.	32.5
C	255.255.255.0	192.66.12.56	192.66.12.	56

As the Internet has become more crowded, however, networks have had to be split up in more flexible ways. It's common now to see netmasks that don't end at byte boundaries. For example, the netmask 255.255.255.128 (binary 11111111,11111111,11111111,10000000) splits the last byte in half, creating a set of 126-host networks. The modern Internet routes packets based on this more flexible scheme, called Classless Inter-Domain Routing (CIDR). CIDR uses a concise convention to describe networks in which the network address is followed by a slash and an integer containing the number of 1s in the mask. For example, CSHL's network is described by the CIDR address 143.48.0.0/16. CIDR is described in detail in RFCs 1517 through 1520, and in the FAQs listed in Appendix D.

Figuring out the network and broadcast addresses can be confusing when you work with netmasks that do not end at byte boundaries. The Net::Netmask module, available on CPAN, provides facilities for calculating these values in an intuitive way. You'll also find a short module that I wrote, Net::NetmaskLite, in Appendix A. You might want to peruse this code in order to learn the relationships among the network address, broadcast address, and netmask.

The first and last addresses in a subnet have special significance and cannot be used as ordinary host addresses. The first address, sometimes known as the all-zeroes address, is reserved for use in routing tables to denote the network as a whole (network address). The last address in the range, known as the all-ones address, is reserved for use as the broadcast address. IP packets sent to this address will be received by all hosts on the subnet. For example, for the network 192.18.4.x (a class C address or 192.18.4.0/24 in CIDR format), the network address is 192.18.4.0 and the broadcast address is 192.18.4.255.

In addition, several IP address ranges have been set aside for special purposes (Table 3.6). The class A network 10.x.x.x, the 16 class B networks 172.16.x.x through 172.31.x.x, and the 255 class C addresses 192.168.0.x through 192.168.255.x are reserved for use as internal networks. An organization may use any of these networks internally, but must not connect the network directly to the Internet. The 192.168.x.x networks are used frequently in testing, or placed behind firewall systems that translate all the internal network addresses into a single public IP address. The network addresses 224.x.x.x through 239.x.x.x are reserved for multicasting applications, and everything above 240.x.x.x is reserved for future expansion.

Table 3.6. Reserved IP Addresses


Address	Description
127.0.0.x	Loopback interface
10.x.x.x	Private class A address
172.16.x.x–172.32.x.x	Private class B addresses
192.168.0.x–172.168.255.x	Private class C addresses

Finally, IP address 127.0.0.x is reserved for use as the loopback network. Anything sent to an address in this range is received by the local host.

3.3.3 Network Ports

Once a message reaches its destination IP address, there's still the matter of finding the correct program to deliver it to. It's common for a host to be running multiple network servers, and it would be impractical, not to say confusing, to deliver the same message to them all. That's where the port number comes in. The port number part of the socket address is an unsigned 16-bit number ranging from 1 to 65535. In addition to its IP address, each active socket on a host is identified by a unique port number; this allows messages to be delivered unambiguously to the correct program. When a program creates a socket, it may ask the operating system to associate a port with the socket. If the port is not being used, the operating system will grant this request, and will refuse other programs access to the port until the port is no longer in use. If the program doesn't specifically request a port, one will be assigned to it from the pool of unused port numbers.

There are actually two sets of port numbers, one for use by TCP sockets, and the other for use by UDP-based programs. It is perfectly all right for two programs to be using the same port number provided that one is using it for TCP and the other for UDP.

Not all port numbers are created equal. The ports in the range 0 through 1023 are reserved for the use of "well-known" services, which are assigned and maintained by ICANN, the Internet Corporation for Assigned Names and Numbers. For example, TCP port 80 is reserved for use for the HTTP used by Web servers, TCP port 25 is used for the SMTP used by e-mail transport agents, and UDP port 53 is used for the domain name service (DNS). Because these ports are well known, you can be pretty certain that a Web server running on a remote machine will be listening on port 80. On UNIX systems, only the root user (i.e., the superuser) is allowed to create a socket using a reserved port. This is partly to prevent unprivileged users on the system inadvertently running code that will interfere with the operations of the host's network services.

Most services are either TCP- or UDP-based, but some can communicate with both protocols. In the interest of future compatibility, ICANN usually reserves both the UDP and TCP ports for each service. However, there are many exceptions to this rule. For example, TCP port 514 is used on UNIX systems for remote shell (login) services, while UDP port 514 is used for the system logging daemon.

In some versions of UNIX, the high-numbered ports in the range 49152 through 65535 are reserved by the operating system for use as "ephemeral" ports to be assigned automatically to outgoing TCP/IP connections when a port number hasn't been explicitly requested. The remaining ports, those in the range 1024 through 49151, are free for use in your own applications, provided that some other service has not already claimed them. It is a good idea to check the ports in use on your machine by using one of the network tools introduced later in this chapter (Network Analysis Tools) before claiming one.

3.3.4 The `sockaddr_in` Structure

A socket address is the combination of the host address and the port, packed together in a binary structure called a sockaddr_in. This corresponds to a C structure of the same name that is used internally to call the system networking routines. (By analogy, UNIX domain sockets use a packed structure called a sockaddr_un.) Functions provided by the standard Perl Socket module allow you to create and manipulate sockaddr_in structures easily:

$packed_address = inet_aton($dotted_quad)

Given an IP address in dotted-quad form, this function packs it into binary form suitable for use by sockaddr_in(). The function will also operate on symbolic hostnames. If the hostname cannot be looked up, it returns undef.

$dotted_quad = inet_ntoa($packed_address)

This function takes a packed IP address and converts it into human-readable dotted-quad form. It does not attempt to translate IP addresses into hostnames. You can achieve this effect by using gethostbyaddr(), discussed later.

$socket_addr = sockaddr_in($port,$address)

($port,$address) = sockaddr_in($socket_addr)

When called in a scalar context, sockaddr_in() takes a port number and a binary IP address and packs them together into a socket address, suitable for use by socket(). When called in a list context, sockaddr_in() does the opposite, translating a socket address into the port and IP address. The IP address must still be passed through inet_ntoa() to obtain a human-readable string.

$socket_addr = pack_sockaddr_in($port,$address)

($port,$address) = unpack_sockaddr_in($socket_addr)

If you don't like the confusing behavior of sockaddr_in(), you can use these two functions to pack and unpack socket addresses in a context-insensitive manner.

In some references, you'll see a socket's address referred to as its "name." Don't let this confuse you. A socket's address and its name are one and the same.

3.4