Apple Store的错误

今天在Apple4us上看到Apple将于今晚发布新的产品,并且Apple Store的网页正在升级,已经无法正常显示产品信息,于是就随便打开看了一看。想不到不看则以,一看还真让我深感意外: Apple竟然做起了张冠李戴、分裂我中华国土的勾当!

 

尽管本人对Apple景仰已久,但大是大非面前岂能是非不分?遂给Apple发邮件一封,内容如下:

Dear Sir/Ms. ,
I found something confusing in the webpage http://store.apple.com/(following "You can contact our telesales team at the following numbers:").
The logo of Taiwan is not appropriate: you add two Chinese characters "日本" above "Taiwan", which will confuse many people. Taiwan is a part of China, so you should add "中国" instead.
I sincerely remind you about this negligence and hope you can change the logo as soon as possible.
Thank you in advance!
Best wishes!
我们还是对Apple的行为拭目以待吧。
Apple的错误

 

网页取词并调用Google Translate

1. translate.html

 

<html> 
  <head>     
    <mce:script src="http://graphics8.nytimes.com/js/common.js" mce_src="http://graphics8.nytimes.com/js/common.js" type="text/javascript"></mce:script> 
    <mce:script type="text/javascript" language="JavaScript" src="file:///C:/Documents%20and%20Settings/bonny/My%20Documents/GoogleApps/FreeTranslator/altClickToSearch.js" mce_src="file:///C:/Documents%20and%20Settings/bonny/My%20Documents/GoogleApps/FreeTranslator/altClickToSearch.js"></mce:script>
    <mce:script type="text/javascript" src="http://www.google.com/jsapi" mce_src="http://www.google.com/jsapi"></mce:script>
    <link href="http://www.google.com/uds/modules/elements/transliteration/api.css" mce_href="http://www.google.com/uds/modules/elements/transliteration/api.css"
      type="text/css" rel="stylesheet"/> 
    <mce:script type="text/javascript"><!--
 
    
    google.load("language", "1");
 
    function initialize() {
      var text = document.getElementById("text").innerHTML;
      google.language.detect(text, function(result) {
        if (!result.error && result.language) {
          google.language.translate(text, result.language, "en",
                                    function(result) {
            var translated = document.getElementById("translation");
            if (result.translation) {
              translated.innerHTML = result.translation;
            }
          });
        }
      });
    }
    google.setOnLoadCallback(initialize);
 
    
// --></mce:script> 
  </head> 
  <body> 
    <div id="text">你好,见到你很高兴!</div> 
      <div id="translation"></div> 
    <p>Detection of Unicode font rendering supportNew!
Browsers and operating systems may or may not have the support for rendering particular Unicode fonts. You can detect whether the user using your webpage has support for rendering the fonts of a given language in a readable way or not using the font rendering support detection API. Please note that this works correctly for Unicode fonts alone. If your webpage renders using non-Unicode fonts, this function will not be useful for you. If font rendering support is not present for a given language, there are several solutions available to fix this - please read this article for more information.</p> 
   <textarea id="transliterateTextarea" style="width:600px;height:200px">

 

 

2. altClickToSearch.js

 

NYTD.require("http://graphics8.nytimes.com/js/app/lib/prototype/1.6.0.2/prototype.js", function(){NYTD.WordReference.initialize();});
NYTD.WordReference = (function(){
  
  var selection, selectionText, selectionButton, newRange;
  
  function handleClick(event) {
    if (selectionButton){
      cleanUp();
    }
    
    selection = getSelection();
    selectionText = selection && selection.toString();
    if (selectionText) {
      window.setTimeout(insertButton, 0);
      event.stop();
    }
  }
  
  function getSelection() {
    return Try.these(
      function() { return window.getSelection() },
      function() { return document.getSelection() },
      function() { 
        var selection = document.selection && document.selection.createRange();
        selection.toString = function() { return this.text };
        return selection;
      }
    ) || false;
  }
  
  function insertButton() {
    
    selectionButton = new Element(
        'span', {
          'className':'nytd_selection_button',
          'id':'nytd_selection_button',
          'title':'Lookup Word',
          'style': 'margin:-20px 0 0 -20px; position:absolute; background:url
(http://graphics8.nytimes.com/images/global/word_reference/ref_bubble.png);width:25px;height:29px;cursor:pointer;_background-image: none;filter: 
progid:DXImageTransform.Microsoft.AlphaImageLoader(src="http://graphics8.nytimes.com/images/global/word_reference/ref_bubble.png", sizingMethod="image");'
        }
    )
      
    if (Prototype.Browser.IE) {
      var tmp = new Element('div');
      tmp.appendChild(selectionButton);
      newRange = selection.duplicate();
      newRange.setEndPoint( "StartToEnd", selection);
      newRange.pasteHTML(tmp.innerHTML);
      selectionButton = $('nytd_selection_button');
    }
    else {
      var range = selection.getRangeAt(0);
      newRange = document.createRange();
      newRange.setStart(selection.focusNode, range.endOffset);
      newRange.insertNode(selectionButton);
    }
    
    Element.observe(selectionButton, 'mouseup', exportSelection, true);
    
  }
  
  function cleanUp() {
    selection = null;
    selectionButton.stopObserving('mouseup', exportSelection);
    newRange && newRange.pasteHTML && newRange.pasteHTML('');
    newRange = null;
    selectionButton.remove();
    selectionButton = null;
    selectionText = '';
  }
  
  function exportSelection(event) {
 //    var url = 'http://www.google.com/dictionary?langpair=en|zh-CN&q=' + encodeURIComponent(selectionText) + '&hl=zh-CN&aq=f';
 //   var newwin = window.open(url,'answersdotcom','height=450,width=820,location=false,menubar=false,toolbar=false,status=false,resizable, scrollbars');
//    if (newwin) newwin.focus();
//    event.stop();
	google.language.translate(selectionText, "en", "zh-CN", function(result) {
  	if (!result.error) {
    		alert(encodeURIComponent(selectionText) +":"+result.translation);
  	}
	
	});
  }
  
  return {
    initialize: function() {
      document.observe('mouseup', handleClick, false);
    }
  };
  
})();

 

Google老矣,能不宕否?

 

Goolge老矣,能不宕否?

 

金融风暴来袭, Google难以独善其身本在我意料之中,但令我感到意外的是Google竟也裁员,更令我大跌眼镜的是Google的各项服务质量竟直线下降许多常用的服务下线(如Google Notes),宕机不断(Goolge AppsGmail啊),Android雷声大雨点小(连Android的创始人都出走了),Chrome发展缓慢……

 

我应该算Google的忠实Fans了:我一直在用Google Documents, Gmail, Google  Map, Google Earth, Google Notes, Chrome, Google Finance, Google Reader, iGoogle, Picasa, Youtube…… 还准备研究一下Google Apps 下载了Android SDK 并且编了一个hello world程序,但由于没有时间就没有继续。当然就更不要说Google Search了。

 

以前,Google每推出一项新的服务都会令人眼前一亮,即便今天google 仍然占据着新闻媒体的头条。对于这样一个诞生不过十年,上市不过五年的Baby Giant而言,媒体、用户给予它的赞誉、光环实在是太多了。而Google也确实不负众望,推出了一个又一个服务。但随着服务的增加,人们开始迷惑:Google到底要干什么?一大堆杂七杂八的在线服务,相互之间并没有任何根本的联系,难免让人一头雾水。

 

Google也感觉到了用户的迷惑,于是在2008年初抛出了“云计算”的概念。当然,云计算早已有之,而且第一个推出商业云计算平台的也不是Google而是Amazon。但无论如何,Google的云计算确实能够从某种程度上解释它的一系列服务,但细细想来却也有些牵强,毕竟有些服务很难往云计算上靠拢,比如Picasa 等桌面软件;另一方面,把搜索引擎、网络邮箱等等都归结为云计算,未免有些宽泛,如此一来岂不是所有推出网络邮箱的公司都可以算作云计算服务商?那云计算岂不是也没有多大的技术含量?尽管我很早就对“网格计算”心向往之,对于云计算也是非常期待,但我对Google云计算的一些定义表示怀疑。业界短期内也不能就“云计算”达成共识,现在还只是处于概念炒作阶段。毕竟Google的迅猛发展很大程度上是通过并购实现的,Picasa本来并不是Google的产品。对于Google云计算总体的感觉是比较粗糙。但我坚信在所有口口声声要做云计算的公司中,Google无疑是最有技术实力的一个。

 

由于Google的优势在于网络,本能的要将所有的服务放在“云”上,尽可能的摆脱“端”的束缚。而微软恰恰相反,微软的云计算理念是尽可能的延缓应用向“云”的迁移,延缓“端”(Windows)的寿命。只有“云”而没有“端”,则“云”就成了断线的风筝,无法为用户所使用。而Google又要慢慢绕开微软的“端”,因此便于去年9月份推出了Chrome浏览器,要用自己的浏览器去取代操作系统,所有的云端应用都可以在Chrome上完成。而且Chrome本身的设计就融入了操作系统的理念,其意图相当明显。Google的确很高明,推出Chrome的目的并不是真的要花大力气加入浏览器的竞争,即便Chrome失败了或者市场占用率很低(这几乎是可以肯定的,毕竟优秀如Firefox者深耕多年也只落得20%的占有率),也能加剧浏览器的竞争,促进浏览器的发展,从而加速网络应用的普及。

 

可就是这样一款颇具战略意义的产品,其表现却有些令人失望。Chrome的优点自不待言,但缺点也很明显:自身兼容性较差——Google已经有很多应用,但这些应用有时竟然不能使用(最令我不能容忍的就是半年过去了谷歌金山词霸仍然不支持Chrome屏幕取词);网页显示的极为丑陋——很多网页在IE下面非常漂亮(HTML标准的网页),但Chrome现实出来就很难看,同样大小的竟然显示的有大有小、有粗有细;地址栏不够只能——和Firefox的地址栏相比,尽管Chrome有自己的优势(集成搜索),但明显没有Firefox的智能;收藏夹很难找到——这其实是一个用户习惯的问题,我很欣赏Google的简洁明快,但Google没有把收藏夹放在显著的位置而且如果不打开新的标签页就很难找到,这一点我实在有点想不通。另外,ChromeLinux版本迟迟没有发布,也着实令人不爽。

 

说到Google的“端”,就不得不再捎带谈一谈Google AndroidAndroid本来也不是google的产品,Google貌似于2004年收购了Android公司,并继续在此基础上开发Android手机操作系统。Android本来可以是一款划时代的手机操作系统,可惜生不逢时,风头被iPhone占尽。Google于去年也仿照Apple Store推出了Android Store,尽管条件比Apple Store优惠的多,但至今里面的应用不足1000个。也许是因为很难在手机领域有重大突破,Google已经将Andorid的重心转到了netbook上面(离职的前Goolge Andoid负责人披露),据说有人轻松的将Android移植到EeePC上。但netbook上的竞争对手又何其之多:MoblinUbuntuWindowsXP/Windows 7……尽管出师不利,Google已经决定在移动领域大展拳脚了。

 

Google近来的表现,愈发显示出其龙钟老态。先是股价大跌、取消福利、员工离职,然后就是部分产品下线,随后又有取消年终奖(每人发了一部G1),紧接着就是把所有网页误认为恶意网页,始终伴随整个过程的就是宕机宕机…… 负面消息不断涌现。显然,金融危机下的Google 已经进入到了一个内外交困的关键阶段。曾经那个朝气蓬勃“永不不作恶”的Google,已经由当年的初出茅庐逐渐长大成熟,所有微软们、yahoo们遇到过的和没遇到过的问题都摆在了它的面前。

 

最近Twitter风头正劲,有人语言Twitter将取代Google,因为Twitter对于实时新闻的搜索要好于Google。昨晚试了一下Twitter,注册只需几步,但浪费了我40分钟的时间也没有注册成功——先是验证码以图片的形式出现,仔细看半天也看不懂是什么,听语音更是不懂,换了好几个验证码总算看清了也注册完成了,结果等到真正需要登录时却又说我用户名和密码错误!My God! 也许我才疏学浅,我目前实在看不出Twitter有什么能力打败Google。今天Google真正的威胁还是来自它自己……

 

【说明】:欢迎讨论,转载请注明出处……

 

乌龙记


 乌龙记

 

过去两个月一直都在忙一个Feature(我们公司把一个大项目分解成很多小功能,每个小功能叫做一个Feature),这也是我工作以来真正意义上的第一个Feature,总体而言,累并快乐着。 

 

公司是电信设备商,我去年7月一进公司就听说了并且逐渐感受到了公司产品的复杂——既有历史的客观原因,也有人为的主观因素。通过做这个feature,才真正体会到了什么叫“浩如烟海”,哪个是“沧海一粟”。其实一开始在另外一个Team,等到快要真正开始做项目时又被转到了现在的team。现在做的是公司无线产品的仿真软件,已经由美国人做了十几年了,去年开始将一部分功能转移到中国做,而我就成了中国新组建的这个Team的一员。

 

尽管公司已有百余年的历史,而且奠定了现在通用的好多计算机/通信技术,但公司的传统却是身教重于言传,言传胜过文档,而文档却又七零八落。美国的同事给我们培训的时候我们问的最多的就是“文档在哪里”,而美国的同事回答最多就是“Reading the code”。尽管公司现在已今非昔比,但公司里其实还是有很多牛人的,通过看code就知道,因为我经常看不懂。

 

产品的code只是很小的一个方面,作为一个电信厂商,有很多更重要的东西需要Care,销售、管理、运营等我不熟悉的方面我就不谈了(因为无话可谈),但就软件而言(硬件我也不太懂),系统架构、代码管理、配置工具等等缺一不可。而这些东西在工作中又时常用到,人多手杂,天长日久也难免会出各种各样的问题。

 

第一次做feature,最难的不是实现feature,而是熟悉一大套工具和流程。即便是feature本身也大多不是非常有创意或技术含量的东西,本应是对现有产品的尽可能的仿真却由于各种原因蜕变成了现实与模拟的妥协,而我们要做的事就是在各种铜墙铁壁中找出一条夹缝,然后在夹缝中求生存。做产品的仿真软件,有时要截弯取直,有时又不得不曲径通幽,变化之间,奇妙无穷。

 

由于是公司内部用的仿真软件,因此不像电信产品那样要求稳定性、可靠性,只要能用来做测试就行了。一旦出了问题,重启一下软件继续。但是也正因为质量上要求不高,问题也是非常的多,每天都有很多用户发邮件找我们解决问题。同时我们又得不断开发新的feature以支持产品测试,每个人的压力还是很大的。我们刚接手,一个人也就当半个人用,可美国的那些同事,最少的也已经做了十年了,在经历了身边同事不断被裁掉的劫波之后,肩上的担子也越来越重。忙中也就难免会出错。

 

前两天我在测试自己的Feature时,发现了一个问题,就是将产品中代码和我们自己代码build出来Product在用的时候一直出问题。看log并且反复测试了一天最后终于发现了问题的所在:产品中改了一个struct的定义,增加了一个成员,而我们的代码中没有做相应更改。于是就把我的发现发邮件告诉了美国的Mentor。晚上在家里加班时,美国的mentor在MOC(Microsoft Office Communicator,一款企业IM软件,说实话做的非常一般)上问我具体情况,我告诉了他我的判断。他起初很怀疑我的判断,我就演示给他看,他终于承认了这个问题的存在,但还是试图找出一种他更愿意接受的原因。用了两个小时,最终证明确实不是我们的问题,而是产品team造成的。Mentor查看了那个文件修改的历史,发现了一个不是我们team的人改了那个文件,于是告诉我可以问问那个人是否改过源文件。那天晚上我睡觉时已经凌晨3:50了。第二天白天我给那个人发了一封邮件,可是始终没有得到回复。

 

第二天晚上,美国的Mentor和我查找产品team的负责人,由于公司team太多,每个team相对独立,很是费了一番力气。找到以后,mentor告诉我可以联系那个负责人。因为已经很晚了,于是我说我明天给他发邮件,可是mentor说他可以帮我立刻发邮件确认一下。由于那个产品也是由美国人负责,很快邮件得到恢复,说那人已经不再负责那个产品了,让我们联系另外一个人。我们有联系了另外一个人,那个人说他是manager,具体技术问题联系他的一个手下的员工。Mentor在MOC上加了那个产品的owner,由于之前有默契,Mentor和我组团忽悠产品owner,希望他们能调整一下那个struct新加入成员的顺序,把它加到最后,那样对我们的影响是最小的。尽管我们很卖力的忽悠,但人家认定现在已经不好改了,而且我们仿真软件就应该服务于产品,不能本末倒置,无果。于是mentor让我着手改我们的代码,而且越快解决越好。那天晚上睡觉时意识凌晨4:00多。

 

第三天白天下午我才去上班,上午在睡觉。刚到公司就有一个人给我打电话,问我关于那个struct的问题。原来那人已经再上一个release对我们的代码做了private的更改,上一个release的更改已经提交,只是这个release的还没有提交,要等到产品的代码提交后才可以。可是现在产品已经提交了,他们却并不知道,只能等到下一个load在提交了。难怪一个月以来没有用户反应这个问题,而在最新的release中我却碰到了问题呢。形势豁然开朗,终于不用我们再负责这个问题了。立马发邮件给mentor,告诉他这个好消息。

 

晚上mentor回复我的邮件,深表欣慰。他说那人提交代码的时候应该找我们team的人作为inspector,他找了吗?我回复说应该没有,不然我们应该会知道更改的。没想到mentor很快又发邮件给我说他想起来了,那个人确实找我们team做inspect了,那个inspector就是他——我的mentor!我顿时无语,只好给他恢复了“OMG!^_^”。

 

这种骑驴找驴的乌龙事件尽管耽误了我们不少时间,但想想也很有意思,因此尽管已经很晚了,我还是坚持把它记下来(其实我的生物钟已经和格林威治时间同步了,再修炼一段时间估计去美国就不用倒时差了^_^)。

Perl Notes(III) -- Introduction To Berkeley Sockets


3 Introduction To Berkeley Sockets

3.1 Basic Concepts

3.1.1 Binary versus Text-Oriented Protocols

Before they can exchange information across the network, hosts have a fundamental choice to make. They can exchange data either in binary form or as human-readable text. The choice has far-reaching ramifications.

To understand this, consider exchanging the number 1984. To exchange it as text, one host sends the other the string 1984, which, in the common ASCII character set, corresponds to the four hexadecimal bytes 0x31 0x39 0x38 0x34. These four bytes will be transferred in order across the network, and (provided the other host also speaks ASCII) will appear at the other end as "1984".

However, 1984 can also be treated as a number, in which case it can fit into the two-byte integer represented in hexadecimal as 0x7C0. If this number is already stored in the local host as a number, it seems sensible to transfer it across the network in its native two-byte form rather than convert it into its four-byte text representation, transfer it, and convert it back into a two-byte number at the other end. Not only does this save some computation, but it uses only half as much network capacity.

Unfortunately, there's a hitch. Different computer architectures have different ways of storing integers and floating point numbers. Some machines use two-byte integers, others four-byte integers, and still others use eight-byte integers. This is called word size. Furthermore, computer architectures have two different conventions for storing integers in memory. In some systems, called big-endian architectures, the most significant part of the integer is stored in the first byte of a two-byte integer. On such systems, reading from low to high, 1984 is represented in memory as the two bytes:

0x07    0xC0
low  -> high

On little-endian architectures, this convention is reversed, and 1984 is stored in the opposite orientation:

0xC0    0x07
low  -> high

These architectures are a matter of convention, and neither has a significant advantage over the other. The problem comes when transferring such data across the network, because this byte pair has to be transferred serially as two bytes. Data in memory is sent across the network from low to high, so for big-endian machines the number 1984 will be transferred as 0x07 0xC0, while for little-endian machines the numbers will be sent in the reverse order. As long as the machine at the other end has the same native word size and byte order, these bytes will be correctly interpreted as 1984 when they arrive. However, if the recipient uses a different byte order, then the two bytes will be interpreted in the wrong order, yielding hexadecimal 0xC007, or decimal 49,159. Even worse, if the recipient interprets these bytes as the top half of a four-byte integer, it will end up as 0xC0070000, or 3,221,684,224. Someone's anniversary party is going to be very late.

Because of the potential for such binary chaos, text-based protocols are the norm on the Internet. All the common protocols convert numeric information into text prior to transferring them, even though this can result in more data being transferred across the net. Some protocols even convert data that doesn't have a sensible text representation, such as audio files, into a form that uses the ASCII character set, because this is generally easier to work with. By the same token, a great many protocols are line-oriented, meaning that they accept commands and transmit data in the form of discrete lines, each terminated by a commonly agreed-upon newline sequence.

A few protocols, however, are binary. Examples include Sun's Remote Procedure Call (RPC) system, and the Napster peer-to-peer file exchange protocol. Such protocols have to be exceptionally careful to represent binary data in a common format. For integer numbers, there is a commonly recognized network format. In network format, a "short" integer is represented in two big-endian bytes, while a "long" integer is represented with four big-endian bytes. Perl's pack() and unpack () functions provide the ability to convert numbers into network format and back again.

Floating point numbers and more complicated things like data structures have no commonly accepted network representation. When exchanging binary data, each protocol has to work out its own way of representing such data in a platform-neutral fashion.

3.1.2 Berkeley Sockets

Berkeley sockets are part of an application programming interface (API) that specifies the data structures and function calls that interact with the operating system's network subsystem. Berkeley sockets are part of an API, not a specific protocol, which defines how the programmer interacts with an idealized network.

3.2 The Anatomy of a Socket

A socket is an endpoint for communications, a portal to the outside world that we can use to send outgoing messages to other processes, and to receive incoming traffic from processes interested in sending messages to us.

To create a socket, we need to provide the system with a minimum of three pieces of information.

3.2.1 The Socket's Domain

The domain defines the family of networking protocols and addressing schemes that the socket will support. The domain is selected from a small number of integer constants defined by the operating system and exported by Perl's Socket module. There are only two common domains

Table 3.1. Common Socket Domains
Constant Description
AF_INET The Internet protocols
AF_UNIX Networking within a single host

In addition to these domains, there are many others including AF_APPLETALKAF_IPX, and AF_X25, each corresponding to a particular addressing scheme. AF_INET6, corresponding to the long addresses of TCP/IP version 6, will become important in the future, but is not yet supported by Perl. The AF_ prefix stands for "address family." In addition, there is a series of "protocol family" constants starting with the PF_ prefix.

3.2.2 The Socket's Type

The socket type identifies the basic properties of socket communications.

Table 3.2. Constants Exported by Socket
Constant Description
SOCK_STREAM A continuous stream of data
SOCK_DGRAM Individual packets of data
SOCK_RAW Access to internal protocols and interfaces

Perl fully supports the SOCK_STREAM and SOCK_DGRAM socket types. SOCK_RAW is supported through an add-on module named Net::Raw.

3.2.3 The Socket's Protocol

Like the domain and socket type, the protocol is a small integer. However, the protocol numbers are not available as constants, but instead must be looked up at run time using the Perl getprotobyname() function.

Table 3.3. Some Socket Protocols
Protocol Description
tcp Transmission Control Protocol for stream sockets
udp User Datagram Protocol for datagram sockets
icmp Internet Control Message Protocol
raw Creates IP packets manually

The TCP and UDP protocols are supported directly by the Perl sockets API. You can get access to the ICMP and raw protocols via the Net::ICMP and Net::Raw third-party modules.

The allowed combinations of socket domain, type, and protocol are few. SOCK_STREAM goes with TCP, and SOCK_DGRAM goes with UDP. Also notice that the AF_UNIX address family doesn't use a named protocol, but a pseudoprotocol named PF_UNSPEC (for "unspecified").

Table 3.4. Allowed Combinations of Socket Type and Protocol in the INET and UNIX Domains

Domain Type Protocol
AF_INET SOCK_STREAM tcp
AF_INET SOCK_DGRAM udp
AF_UNIX SOCK_STREAM PF_UNSPEC
AF_UNIX SOCK_DGRAM PF_UNSPEC

3.2.4 Datagram Sockets

Datagram-type sockets provide for the transmission of connectionless, unreliable, unsequenced messages. The UDP is the chief datagram-style protocol used by the Internet protocol family.

As the diagram in Figure 3.2 shows, datagram services resemble the postal system. Like a letter or a telegram, each datagram in the system carries its destination address, its return address, and a certain amount of data. The Internet protocols will make the best effort to get the datagram delivered to its destination.

Figure 3.2. Datagram sockets provide connectionless, unreliable, unsequenced transmission of message

There is no long-term relationship between the sending socket and the recipient socket: A client can send a datagram off to one server, then immediately turn around and send a datagram to another server using the same socket. But the connectionless nature of UDP comes at a price. Like certain countries' postal systems, it is very possible for a datagram to get "lost in the mail." A client cannot know whether a server has received its message until it receives an acknowledgment in reply. Even then, it can't know for sure that a message was lost, because the server might have received the original message and the acknowledgment got lost!

Datagrams are neither synchronized nor flow controlled. If you send a set of datagrams out in a particular order, they might not arrive in that order. Because of the vagaries of the Internet, the first datagram may go by one route, and the second one may take a different path. If the second route is faster than the first one, the two datagrams may arrive in the opposite order from which they were sent. It is also possible for a datagram to get duplicated in transit, resulting in the same message being received twice.

Because of the connectionless nature of datagrams, there is no flow control between the sender and the recipient. If the sender transmits datagrams faster than the recipient can process them, the recipient has no way to signal the sender to slow down, and will eventually start to discard packets.

Although a datagram's delivery is not reliable, its contents are. Modern implementations of UDP provide each datagram with a checksum that ensures that its data portion is not corrupted in transit.

3.2.5 Stream Sockets

The other major paradigm is stream sockets, implemented in the Internet domain as the TCP protocol. Stream sockets provide sequenced, reliable bidirectional communications via byte-oriented streams. As depicted in Figure 3.3, stream sockets resemble a telephone conversation. Clients connect to servers using their address, the two exchange data for a period of time, and then one of the pair breaks off the connection.

Figure 3.3. Stream sockets provide sequenced, reliable, bidirectional communications

Reading and writing to stream sockets is a lot like reading and writing to a file. There are no arbitrary size limits or record boundaries, although you can impose a record-oriented structure on the stream if you like. Because stream sockets are sequenced and reliable, you can write a series of bytes into a socket secure in the knowledge that they will emerge at the other end in the correct order, provided that they emerge at all ("reliable" does not mean immune to network errors).

TCP also implements flow control. Unlike UDP, where the danger of filling the data-receiving buffer is very real, TCP automatically signals the sending host to suspend transmission temporarily when the reading host is falling behind, and to resume sending data when the reading host is again ready. This flow control happens behind the scenes and is ordinarily invisible.

Although it looks and acts like a continuous byte stream, the TCP protocol is actually implemented on top of a datagram-style service, in this case the low-level IP protocol. IP packets are just as unreliable as UDP datagrams, so behind the scenes TCP is responsible for keeping track of packet sequence numbers, acknowledging received packets, and retransmitting lost packets.

3.2.6 Datagram versus Stream Sockets

With all its reliability problems, you might wonder why anyone uses UDP. The answer is that most client/server programs on the Internet use TCP stream sockets instead. In most cases, TCP is the right solution for you, too.

There are some circumstances, however, in which UDP might be a better choice. For example, time servers use UDP datagrams to transmit the time of day to clients who use the information for clock synchronization. If a datagram disappears in transit, it's neither necessary nor desirable to retransmit it because by the time it arrives it will no longer be relevant.

UDP is also preferred when the interaction between one host and the other is very short. The length of time to set up and take down a TCP connection is about eightfold greater than the exchange of a single byte of data via UDP (for details, see [Stevens 1996]). If relatively small amounts of data are being exchanged, the TCP setup time will dominate performance. Even after a TCP connection is established, each transmitted byte consumes more bandwidth than UDP because of the additional overhead for ensuring reliability.

Another common scenario occurs when a host must send the same data to many places; for example, it wants to transmit a video stream to multiple viewers. The overhead to set up and manage a large number of TCP connections can quickly exhaust operating system resources, because a different socket must be used for each connection. In contrast, sending a series of UDP datagrams is much more sparing of resources. The same socket can be reused to send datagrams to many hosts.

Whereas TCP is always a one-to-one connection, UDP also allows one-to-many and many-to-many transmissions. At one end of the spectrum, you can address a UDP datagram to the "broadcast address," broadcasting a message to all listening hosts on the local area network. At the other end of the spectrum, you can target a message to a predefined group of hosts using the "multicast" facility of modern IP implementations. These advanced features are covered in Chapters 20 and 21.

The Internet's DNS is a common example of a UDP-based service. It is responsible for translating hostnames into IP addresses, and vice versa, using a loose-knit network of DNS servers. If a client does not get a response from a DNS server, it just retransmits its request. The overhead of an occasional lost datagram outweighs the overhead of setting up a new TCP connection for each request. Other common examples of UDP services include Sun's Network File System (NFS) and the Trivial File Transfer Protocol (TFTP). The latter is used by diskless workstations during boot in order to load their operating system over the network. UDP was originally chosen for this purpose because its implementation is relatively small. Therefore, UDP fit more easily into the limited ROM space available to workstations at the time the protocol was designed.

3.3 Socket Addressing

For the UNIX domain, which can be used only between two processes on the same host machine, addresses are simply paths on the host's filesystem, such as /usr/tmp/log. For the Internet domain, each socket address has three parts: the IP address, the port, and the protocol.

3.3.1 IP Addresses

Many of Perl's networking calls require you to work with IP addresses in the form of packed binary strings. IP addresses can be converted manually to binary format and back again using pack() and unpack() with a template of "C4" (four unsigned characters). For example, here's how to convert 18.157.0.125 into its packed form and then reverse the process:

($a,$b,$c,$d)      = split //./, '18.157.0.125';
$packed_ip_address = pack 'C4',$a,$b,$c,$d;
($a,$b,$c,$d)      = unpack 'C4',$packed_ip_address;
$dotted_ip_address = join '.', $a,$b,$c,$d;

Most hosts have two addresses, the "loopback" address 127.0.0.1 (often known by its symbolic name "localhost") and its public Internet address. The loopback address is associated with a device that loops transmissions back onto itself, allowing a client on the host to make an outgoing connection to a server running on the same host. Although this sounds a bit pointless, it is a powerful technique for application development, because it means that you can develop and test software on the local machine without access to the network.

The public Internet address is associated with the host's network interface card, such as an Ethernet card. The address is either assigned to the host by the network administrator or, in systems with dynamic host addressing, by a Boot Protocol (BOOTP) or Dynamic Host Configuration Protocol (DHCP) server. If a host has multiple network interfaces installed, each one can  have a distinct IP address. It's also possible for a single interface to be configured to use several addresses. 

3.3.2 Reserved IP Addresses, Subnets, and Netmasks

In order for a packet of information to travel from one location to another across the Internet, it must hop across a series of physical networks. For example, a packet leaving your desktop computer must travel across your LAN (local area network) to a modem or router, then across your Internet service provider's (ISP) regional network, then across a backbone to another ISP's regional network, and finally to its destination machine.

Network routers keep track of how the networks interconnect, and are responsible for determining the most efficient route to get a packet from point A to point B. However, if IP addresses were allocated ad hoc, this task would not be feasible because each router would have to maintain a map showing the locations of all IP addresses. Instead, IP addresses are allocated in contiguous chunks for use in organizational and regional networks.

For example, my employer, the Cold Spring Harbor Laboratory (CSHL), owns the block of IP addresses that range from 143.48.0.0 through 143.48.255.255 (this is a so-called class B address). When a backbone router sees a packet addressed to an IP address in this range, it needs only to determine how to get the packet into CSHL's network. It is then the responsibility of CSHL's routers to get the packet to its destination. In practice, CSHL and other large organizations split their allocated address ranges into several subnets and use routers to interconnect the parts.

A computer that is sending out an IP packet must determine whether the destination machine is directly reachable (e.g., over the Ethernet) or whether the packet must be directed to a router that interconnects the local network to more distant locations. The basic decision is whether the packet is part of the local network or part of a distant network.

To make this decision possible, IP addresses are arbitrarily split into a host part and a network part. For example, in CSHL's network, the split occurs after the second byte: the network part is 143.48. and the host part is the rest. So 143.48.0.0 is the first address in CSHL's network, and 143.48.255.255 is the last.

To describe where the network/host split occurs for routing purposes, networks use a netmask, which is a bitmask with 1s in the positions of the network part of the IP address. Like the IP address itself, the netmask is usually written in dotted-quad form. Continuing with our example, CSHL has a netmask of 255.255.0.0, which, when written in binary, is 11111111,11111111,00000000,00000000.

Historically, IP networks were divided into three classes on the basis of their netmasks (Table 3.5). Class A networks have a netmask of 255.0.0.0 and approximately 16 million hosts. Class B networks have a netmask of 255.255.0.0 and some 65,000 hosts, and class C networks use the netmask 255.255.255.0 and support 254 hosts (as we will see, the first and last host numbers in a network range are unavailable for use as a normal host address).

Table 3.5. Address Classes and Their Netmasks

Class Netmask Example Address Network Park Host Part
A 255.0.0.0 120.155.32.5 120. 155.32.5
B 255.255.0.0 128.157.32.5 128.157. 32.5
C 255.255.255.0 192.66.12.56 192.66.12. 56

As the Internet has become more crowded, however, networks have had to be split up in more flexible ways. It's common now to see netmasks that don't end at byte boundaries. For example, the netmask 255.255.255.128 (binary 11111111,11111111,11111111,10000000) splits the last byte in half, creating a set of 126-host networks. The modern Internet routes packets based on this more flexible scheme, called Classless Inter-Domain Routing (CIDR). CIDR uses a concise convention to describe networks in which the network address is followed by a slash and an integer containing the number of 1s in the mask. For example, CSHL's network is described by the CIDR address 143.48.0.0/16. CIDR is described in detail in RFCs 1517 through 1520, and in the FAQs listed in Appendix D.

Figuring out the network and broadcast addresses can be confusing when you work with netmasks that do not end at byte boundaries. The Net::Netmask module, available on CPAN, provides facilities for calculating these values in an intuitive way. You'll also find a short module that I wrote, Net::NetmaskLite, in Appendix A. You might want to peruse this code in order to learn the relationships among the network address, broadcast address, and netmask.

The first and last addresses in a subnet have special significance and cannot be used as ordinary host addresses. The first address, sometimes known as the all-zeroes address, is reserved for use in routing tables to denote the network as a whole (network address). The last address in the range, known as the all-ones address, is reserved for use as the broadcast address. IP packets sent to this address will be received by all hosts on the subnet. For example, for the network 192.18.4.x (a class C address or 192.18.4.0/24 in CIDR format), the network address is 192.18.4.0 and the broadcast address is 192.18.4.255. 

In addition, several IP address ranges have been set aside for special purposes (Table 3.6). The class A network 10.x.x.x, the 16 class B networks 172.16.x.x through 172.31.x.x, and the 255 class C addresses 192.168.0.x through 192.168.255.x are reserved for use as internal networks. An organization may use any of these networks internally, but must not connect the network directly to the Internet. The 192.168.x.x networks are used frequently in testing, or placed behind firewall systems that translate all the internal network addresses into a single public IP address. The network addresses 224.x.x.x through 239.x.x.x are reserved for multicasting applications, and everything above 240.x.x.x is reserved for future expansion.

Table 3.6. Reserved IP Addresses

Address Description
127.0.0.x Loopback interface
10.x.x.x Private class A address
172.16.x.x–172.32.x.x Private class B addresses
192.168.0.x–172.168.255.x Private class C addresses

Finally, IP address 127.0.0.x is reserved for use as the loopback network. Anything sent to an address in this range is received by the local host.

3.3.3 Network Ports

Once a message reaches its destination IP address, there's still the matter of finding the correct program to deliver it to. It's common for a host to be running multiple network servers, and it would be impractical, not to say confusing, to deliver the same message to them all. That's where the port number comes in. The port number part of the socket address is an unsigned 16-bit number ranging from 1 to 65535. In addition to its IP address, each active socket on a host is identified by a unique port number; this allows messages to be delivered unambiguously to the correct program. When a program creates a socket, it may ask the operating system to associate a port with the socket. If the port is not being used, the operating system will grant this request, and will refuse other programs access to the port until the port is no longer in use. If the program doesn't specifically request a port, one will be assigned to it from the pool of unused port numbers.

There are actually two sets of port numbers, one for use by TCP sockets, and the other for use by UDP-based programs. It is perfectly all right for two programs to be using the same port number provided that one is using it for TCP and the other for UDP.

Not all port numbers are created equal. The ports in the range 0 through 1023 are reserved for the use of "well-known" services, which are assigned and maintained by ICANN, the Internet Corporation for Assigned Names and Numbers. For example, TCP port 80 is reserved for use for the HTTP used by Web servers, TCP port 25 is used for the SMTP used by e-mail transport agents, and UDP port 53 is used for the domain name service (DNS). Because these ports are well known, you can be pretty certain that a Web server running on a remote machine will be listening on port 80. On UNIX systems, only the root user (i.e., the superuser) is allowed to create a socket using a reserved port. This is partly to prevent unprivileged users on the system inadvertently running code that will interfere with the operations of the host's network services.

Most services are either TCP- or UDP-based, but some can communicate with both protocols. In the interest of future compatibility, ICANN usually reserves both the UDP and TCP ports for each service. However, there are many exceptions to this rule. For example, TCP port 514 is used on UNIX systems for remote shell (login) services, while UDP port 514 is used for the system logging daemon.

In some versions of UNIX, the high-numbered ports in the range 49152 through 65535 are reserved by the operating system for use as "ephemeral" ports to be assigned automatically to outgoing TCP/IP connections when a port number hasn't been explicitly requested. The remaining ports, those in the range 1024 through 49151, are free for use in your own applications, provided that some other service has not already claimed them. It is a good idea to check the ports in use on your machine by using one of the network tools introduced later in this chapter (Network Analysis Tools) before claiming one.

3.3.4 The sockaddr_in Structure

A socket address is the combination of the host address and the port, packed together in a binary structure called a sockaddr_in. This corresponds to a C structure of the same name that is used internally to call the system networking routines. (By analogy, UNIX domain sockets use a packed structure called a sockaddr_un.) Functions provided by the standard Perl Socket module allow you to create and manipulate sockaddr_in structures easily:

$packed_address = inet_aton($dotted_quad)

Given an IP address in dotted-quad form, this function packs it into binary form suitable for use by sockaddr_in(). The function will also operate on symbolic hostnames. If the hostname cannot be looked up, it returns undef.

$dotted_quad = inet_ntoa($packed_address)

This function takes a packed IP address and converts it into human-readable dotted-quad form. It does not attempt to translate IP addresses into hostnames. You can achieve this effect by using gethostbyaddr(), discussed later.

$socket_addr = sockaddr_in($port,$address)

($port,$address) = sockaddr_in($socket_addr)

When called in a scalar context, sockaddr_in() takes a port number and a binary IP address and packs them together into a socket address, suitable for use by socket(). When called in a list context, sockaddr_in() does the opposite, translating a socket address into the port and IP address. The IP address must still be passed through inet_ntoa() to obtain a human-readable string.

$socket_addr = pack_sockaddr_in($port,$address)

($port,$address) = unpack_sockaddr_in($socket_addr)

If you don't like the confusing behavior of sockaddr_in(), you can use these two functions to pack and unpack socket addresses in a context-insensitive manner.

In some references, you'll see a socket's address referred to as its "name." Don't let this confuse you. A socket's address and its name are one and the same.

3.4 

Perl Notes(II)

 Part II    Network Programming With Perl

1 Input/Output Basics

1.1 Filehandles

Filehandles are the foundation of networked applications. In this section we review the ins and outs of filehandles. Even if you're an experienced Perl programmer, you might want to scan this section to refresh your memory on some of the more obscure aspects of Perl I/O.


1.1.1 Standard Filehandles

A filehandle connects a Perl script to the outside world. Reading from a filehandle brings in outside data, and writing to one exports data. Depending on how it was created, a filehandle may be connected to a disk file, to a hardware device such as a serial port, to a local process such as a command-line window in a windowing system, or to a remote process such as a network server. It's also possible for a filehandle to be connected to a "bit bucket" device that just sucks up data and ignores it.

A filehandle is any valid Perl identifier that consists of uppercase and lowercase letters, digits, and the underscore character. Unlike other variables, a filehandle does not have a distinctive prefix (like "$"). So to make them distinct, Perl programmers often represent them in all capital letters, or caps.

When a Perl script starts, exactly three filehandles are open by default: STDOUTSTDIN, and STDERR. The STDOUT filehandle, for "standard output," is the default filehandle for output. Data sent to this filehandle appears on the user's preferred output device, usually the command-line window from which the script was launched. STDIN, for "standard input," is the default input filehandle. Data read from this filehandle is taken from the user's preferred input device, usually the keyboard. STDERR ("standard error") is used for error messages, diagnostics, debugging, and other such incidental output. By default STDERR uses the same output device as STDOUT, but this can be changed at the user's discretion. The reason that there are separate filehandles for normal and abnormal output is so that the user can divert them independently; for example, to send normal output to a file and error output to the screen.

This code fragment will read a line of input from STDIN, remove the terminating end-of-line character with the chomp() function, and echo it to standard output:

$input = <STDIN>;
chomp($input);
print STDOUT "If I heard you correctly, you said: $input/n";

By taking advantage of the fact that STDIN and STDOUT are the defaults for many I/O operations, and by combining chomp() with the input operation, the same code could be written more succinctly like this:

chomp($input = <>);
print "If I heard you correctly, you said: $input/n";

We review the <> and print() functions in the next section. Similarly, STDERR is the default destination for the warn() and die() functions.

The user can change the attachment of the three standard filehandles before launching the script. On UNIX and Windows systems, this is done using the redirect metacharacters "<" and ">". For example, given a script named muncher.pl this command will change the script's standard input so that it comes from the file data.txt, and its standard output so that processed data ends up in crunched.txt:

% muncher.pl <data.txt >crunched.txt
						

Standard error isn't changed, so diagnostic messages (e.g., from the built-in warn() and die() functions) appear on the screen.

On Macintosh systems, users can change the source of the three standard filehandles by selecting filenames from a dialog box within the MacPerl development environment.


1.1.2 Input and Output Operations

Perl gives you the option of reading from a filehandle one line at a time, suitable for text files, or reading from it in chunks of arbitrary length, suitable for binary byte streams like image files.

For input, the <> operator is used to read from a filehandle in a line-oriented fashion, and read() or sysread() to read in a byte-stream fashion. For output, print() and syswrite() are used for both text and binary data (you decide whether to make the output line-oriented by printing newlines).

$line = <FILEHANDLE>

@lines = <FILEHANDLE>

$line <>

@lines <>


The <> ("angle bracket") operator is sensitive to the context in which it is called. If it is used to assign to a scalar variable, a so-called scalar context, it reads a line of text from the indicated filehandle, returning the data along with its terminating end-of-line character. After reading the last line of the filehandle, <> will return undef, signaling the end-of-file (EOF) condition.

When <> is assigned to an array or used in another place where Perl ordinarily expects a list, it reads all lines from the filehandle through to EOF, returning them as one (potentially gigantic) list. This is called a list context.

If called in a "void context" (i.e., without being assigned to a variable),<> copies a line into the $_ global variable. This is commonly seen in while() loops, and often combined with pattern matches and other operations that use $_ implicitly:

while (<>) {
   print "Found a gnu/n" if /GNU/i;
}

The <FILEHANDLE> form of this function explicitly gives the filehandle to read from. However, the <> form is "magical." If the script was called with a set of file names as command-line arguments, <> will attempt to open() each argument in turn and will then return lines from them as if they were concatenated into one large pseudofile.

If no files are given on the command line, or if a single file named "-" is given, then <> reads from standard input and is equivalent to <STDIN>. See the perlfunc POD documentation for an explanation of how this works (pod perlfunc, as explained in the Preface).

$bytes = read (FILEHANDLE,$buffer,$length [,$offset])

$bytes = sysread (FILEHANDLE,$buffer,$length [,$offset])

The read() and sysread() functions read data of arbitrary length from the indicated filehandle. Up to $length bytes of data will be read, and placed in the $buffer scalar variable. Both functions return the number of bytes actually read, numeric 0 on the end of file, or undef on an error.

This code fragment will attempt to read 50 bytes of data from STDIN, placing the information in $buffer, and assigning the number of bytes read to $bytes:

my $buffer;
$bytes = read (STDIN,$buffer,50);

By default, the read data will be placed at the beginning of $buffer, overwriting whatever was already there. You can change this behavior by providing the optional numeric $offset argument, to specify that read data should be written into the variable starting at the specified position.

The main difference between read() and sysread() is that read() uses standard I/O buffering, and sysread() does not. This means that read() will not return until either it can fetch the exact number of bytes requested or it hits the end of file. The sysread() function, in contrast, can return partial reads. It is guaranteed to return at least 1 byte, but if it cannot immediately read the number of bytes requested from the filehandle, it will return what it can. This behavior is discussed in more detail later in the Buffering and Blocking section.

$result = print FILEHANDLE $data1,$data2,$data3...

$result = print $data1,$data2,$data3...

The print() function prints a list of data items to a filehandle. In the first form, the filehandle is given explicitly. Notice that there is no comma between the filehandle name and the first data item. In the second form, print() uses the current default filehandle, usually STDOUT. The default filehandle can be changed using the one-argument form of select() (discussed below). If no data arguments are provided, then print() prints the contents of $_.

If output was successful, print() returns a true value. Otherwise it returns false and leaves an error message in the variable named $!.

Perl is a parentheses-optional language. Although I prefer using parentheses around function arguments, most Perl scripts drop them with print(), and this book follows that convention as well.

$result = printf $format,$data1,$data2,$data3...

The printf() function is a formatted print. The indicated data items are formatted and printed according to the $format format string. The formatting language is quite rich, and is explained in detail in Perl's POD documentation for the related sprintf() (string formatting) function.

$bytes = syswrite (FILEHANDLE,$data [,$length [,$offset]])

The syswrite() function is an alternative way to write to a filehandle that gives you more control over the process. Its arguments are a filehandle and a scalar value (avariable or string literal). It writes the data to the filehandle, and returns the number of bytes successfully written.

By default, syswrite() attempts to write the entire contents of $data, beginning at the start of the string. You can alter this behavior by providing an optional $length and $offset, in which case syswrite() will write $length bytes beginning at the position specified by $offset.

Aside from familiarity, the main difference between print() and syswrite() is that the former uses standard I/O buffering, while the latter does not. We discuss this later in the Buffering and Blocking section.

Don't confuse syswrite() with Perl's unfortunately named write() function. The latter is part of Perl's report formatting package, which we won't discuss further.

$previous = select(FILEHANDLE)

The select() function changes the default output filehandle used by print print (). It takes the name of the filehandle to set as the default, and returns the name of the previous default. There is also a version of select() that takes four arguments, which is used for I/O multiplexing. We introduce the four-argument version in Chapter 8.

When reading data as a byte stream with read() or sysread(), a common idiom is to pass length($buffer) as the offset into the buffer. This will make read() append the new data to  the end of data that was already in the buffer. For example:

my $buffer;
while (1) {
  $bytes = read (STDIN,$buffer,50,length($buffer));
  last unless $bytes > 0;
}


1.1.3 Detecting the End of File

The end-of-file condition occurs when there's no more data to be read from a file or device. When reading from files this happens at the literal end of the file, but the EOF condition applies as well when reading from other devices. When reading from the terminal (command-line window), for example, EOF occurs when the user presses a special key: control-D on UNIX, control-Z on Windows/DOS, and command-. on Macintosh. When reading from a network-attached socket, EOF occurs when the remote machine closes its end of the connection.

The EOF condition is signaled differently depending on whether you are reading from the filehandle one line at a time or as a byte stream. For byte-stream operations with read() or sysread(), EOF is indicated when the function returns numeric 0. Other I/O errors return undef and set $! to the appropriate error message. To distinguish between an error and a normal end of file, you can test the return value with defined():

while (1) {
  my $bytes = read(STDIN,$buffer,100);
  die "read error" unless defined ($bytes);
  last unless $bytes > 0;
}

In contrast, the <> operator doesn't distinguish between EOF and abnormal conditions, and returns undef in either case. To distinguish them, you can set $! to undef before performing a series of reads, and check whether it is defined afterward:

undef $!;
while (defined(my $line = <STDIN>)) {
   $data .= $line;
}
die "Abnormal read error: $!" if defined ($!);

When you are using <> inside the conditional of a while() loop, as shown in the most recent code fragment, you can dispense with the explicit defined() test. This makes the loop easier on the eyes:

while (my $line = <STDIN>) {
   $data .= $line;
}

This will work even if the line consists of a single 0 or an empty string, which Perl would ordinarily treat as false. Outside while() loops, be careful to use defined() to test the returned value for EOF.

Finally, there is the eof() function, which explicitly tests a filehandle for the EOF condition:

$eof = eof(FILEHANDLE)


The eof() function returns true if the next read on FILEHANDLE will return an EOF. Called without arguments or parentheses, as in eof, the function tests the last filehandle read from.

When using while(<>) to read from the command-line arguments as a single pseudofile, eof() has "magical"—or at least confusing—properties. Called with empty parentheses, as in eof(), the function returns true at the end of the very last file. Called without parentheses or arguments, as in eof, the function returns true at the end of each of the individual files on the command line. See the Perl POD documentation for examples of the circumstances in which this behavior is useful.

In practice, you do not have to use eof() except in very special circumstances, and a reliance on it is often a signal that something is amiss in the structure of your program.


1.1.4 Anarchy at the End of the Line

When performing line-oriented I/O, you have to watch for different interpretations of the end-of-line character. No two operating system designers can seem to agree on how lines should end in text files. On UNIX systems, lines end with the linefeed character (LF, octal /012 in the ASCII table); on Macintosh systems, they end with the carriage return character (CR, octal /015); and the Windows/DOS designers decided to end each line of text with two characters, a carriage return/linefeed pair (CRLF, or octal /015/012). Most line-oriented network servers also use CRLF to terminate lines.

This leads to endless confusion when moving text files between machines. Fortunately, Perl provides a way to examine and change the end-of-line character. The global variable $/ contains the current character, or sequence of characters, used to signal the end of line. By default, it is set to /012 on Unix systems, /015 on Macintoshes, and /015/012 on Windows and DOS systems.

The line-oriented <> input function will read from the specified handle until it encounters the end-of-line character(s) contained in $/, and return the line of text with the end-of-line sequence still attached. The chomp() function looks for the end-of-line sequence at the end of a text string and removes it, respecting the current value of $/.

The string escape /n is the logical newline character, and means different things on different platforms. For example, /n is equivalent to /012 on UNIX systems, and to /015 on Macintoshes. (On Windows systems, /n is usually /012, but see the later discussion of DOS text mode.) In a similar vein, /r is the logical carriage return character, which also varies from system to system.

When communicating with a line-oriented network server that uses CRLF to terminate lines, it won't be portable to set $/ to /r/n. Use the explicit string /015/012 instead. To make this less obscure, the Socket and IO::Socket modules, which we discuss in great detail later, have an option to export globals named $CRLF and CRLF() that return the correct values.

There is an additional complication when performing line-oriented I/O on Microsoft Windows and DOS machines. For historical reasons, Windows/DOS distinguishes between filehandles in "text mode" and those in "binary mode." In binary mode, what you see is exactly what you get. When you print to a binary filehandle, the data is output exactly as you specified. Similarly, read operations return the data exactly as it was stored in the file.

In text mode, however, the standard I/O library automatically translates LF into CRLF pairs on the way out, and CRLF pairs into LF on the way in. The virtue of this is that it makes text operations on Windows and UNIX Perls look the same—from the programmer's point of view, the DOS text files end in a single /n character, just as they do in UNIX. The problem one runs into is when reading or writing binary files—such as images or indexed databases—and the files become mysteriously corrupted on input or output. This is due to the default line-end translation. Should this happen to you, you should turn off character translation by calling binmode() on the filehandle.

binmode (FILEHANDLE [$discipline])


The binmode() function turns on binary mode for a filehandle, disabling character translation. It should be called after the filehandle is opened, but before doing any I/O with it. The single-argument form turns on binary mode. The two-argument form, available only with Perl 5.6 or higher, allows you to turn binary mode on by providing :raw as the value of $discipline, or restore the default text mode using :crlf as the value.

binmode() only has an effect on systems like Windows and VMS, where the end-of-line sequence is more than one character. On UNIX and Macintosh systems, it has no effect.

Another way to avoid confusion over text and binary mode is to use the sysread() and syswrite() functions, which bypass the character translation routines in the standard I/O library.

A whole bevy of special global variables control other aspects of line-oriented I/O, such as whether to append an end-of-line character automatically to data output with the print() statement, and whether multiple data values should be separated by a delimiter. See Appendix B for a brief summary.


1.1.5 Opening and Closing Files

In addition to the three standard filehandles, Perl allows you to open any number of additional filehandles. To open a file for reading or writing, use the built-in Perl function open() If successful, open() gives you a filehandle to use for the read and/or write operations themselves. Once you are finished with the filehandle, call close() to close it. This code fragment illustrates how to open the file message.txt for writing, write two lines of text to it, and close it:

open (FH,">message.txt") or die "Can't open file: $!";
print FH "This is the first line./n";
print FH "And this is the second./n";
close (FH) or die "Can't close file: $!";

We call open() with two arguments: a filehandle name and the name of the file we wish to open. The filehandle name is any valid Perl identifier consisting of any combination of uppercase and lowercase letters, digits, and the underscore character. To make them distinct, most Perl programmers choose all uppercase letters for filehandles. The " > " symbol in front of the filename tells Perl to overwrite the file's contents if it already exists, or to create the file if it doesn't. The file will then be opened for writing.

If open() succeeds, it returns a true value. Otherwise, it returns false, causing Perl to evaluate the expression to the right of the or operator. This expression simply dies with an error message, using Perl's $! global variable to retrieve the last system error message encountered.

We call print() twice to write some text to the filehandle. The first argument to print() is the filehandle, and the second and subsequent arguments are strings to write to the filehandle. Again, notice that there is no comma between the filehandle and the strings to print. Whatever is printed to a filehandle shows up in its corresponding file. If the filehandle argument to print() is omitted, it defaults to STDOUT.

After we have finished printing, we call close() to close the filehandle. close() returns a true value if the filehandle was closed uneventfully, or false if some untoward event, such as a disk filling up, occurred. We check this result code using the same type of or test we used earlier.

Let's look at open() and close() in more detail.

$success = open(FILEHANDLE,$path)

$success = open(FILEHANDLE,$mode,$path)


The open() call opens the file given in $path, associating it with a designated FILEHANDLE. There are both two- and three-argument versions of open(). In the three-argument version, which is available in Perl versions 5.6 and higher, a $mode argument specifies how the file is to be opened. $mode is a one- or two-character string chosen to be reminiscent of the I/O redirection operators in the UNIX and DOS shells. Choices are shown here.

Mode Description
< Open file for reading
> Truncate file to zero length and open for writing
>> Open file for appending, do not truncate
+> Truncate file and then open for read/write
<+ Open file for read/write, do not truncate

We can open the file named darkstar.txt for reading and associate it with the filehandle DARKFH like this:

open(DARKFH,'<','darkstar.txt');

In the two-argument form of open(), the mode is appended directly to the filename, as in:

open(DARKFH,'<darkstar.txt');

For readability, you can put any amount of whitespace between the mode symbol and the filename; it will be ignored. If you leave out the mode symbol, the file will be opened for reading. Hence the above examples are all equivalent to this:

open(DARKFH,'darkstar.txt');

If successful, open() will return a true value. Otherwise it returns false. In the latter case, the $! global will contain a human-readable message indicating thecause of the error.

$success = close(FH);


The close() function closes a previously opened file, returning true if successful, or false otherwise. In the case of an error, the error message can again be found in $!.

When your program exits, any filehandles that are still open will be closed automatically.

The three-argument form of open() is used only rarely. However, it has the virtue of not scanning the filename for special characters the way that the two-argument form does. This lets you open files whose names contain leading or trailing whitespace, ">" characters, and other weird and arbitrary data. The filename "-" is special. When opened for reading, it tells Perl to open standard input. When opened for writing, it tells Perl to open standard output.

If you call open() on a filehandle that is already open, it will be automatically closed and then reopened on the file that you specify. Among other things, this call can be used to reopen one of the three standard filehandles on the file of your choice, changing the default source or destination of the <>print(), and warn() functions. We will see an example of this shortly.

As with the print() function, many programmers drop the parentheses around open() and close(). For example, this is the most common idiom for opening a file:

open DARKSTAR,"darkstar.txt" or die "Couldn't open darkstar.txt: $!"

I don't like this style much because it leads to visual ambiguity (does the or associate with the string "darkstar.txt" or with the open() function?). However, I do use this style withclose()print(), and return() because of their ubiquity.

The two-argument form of open() has a lot of magic associated with it (too much magic, some would say). The full list of magic behavior can be found in the perlfunc and perlopentut POD documentation. However, one trick is worth noting because we use it in later chapters. You can duplicate a filehandle by using it as the second argument to open() with the sequence >& or <& prepended to the beginning. >& duplicates filehandles used for writing, and <& duplicates those used for reading:

open (OUTCOPY,">&STDOUT");
open (INCOPY,"<&STDOUT");

This example creates a new filehandle named OUTCOPY that is attached to the same device as STDOUT. You can now write to OUTCOPY and it will have the same effect as writing to STDOUT. This is useful when you want to replace one or more of the three standard filehandles temporarily, and restore them later. For example, this code fragment will temporarily reopen STDOUT onto a file, invoke the system date command (using the system() function, which we discuss in more detail in Chapter 2), and then restore the previous value of STDOUT. When date runs, its standard output is opened on the file, and its output appears there rather than in the command window:

#!/usr/bin/perl
# file: redirect.pl


            
          

How do I add a directory to my include path (@INC) at runtime?

 How do I add a directory to my include path (@INC) at runtime?

Here are the suggested ways of modifying your include path, including environment variables, run-time switches, and in-code statements:

  • the PERLLIB environment variable
    	$ export PERLLIB=/path/to/my/dir
    	$ perl program.pl
  • the PERL5LIB environment variable
    	$ export PERL5LIB=/path/to/my/dir
    	$ perl program.pl
  • the perl -Idir command line flag
    	$ perl -I/path/to/my/dir program.pl
  • the use lib pragma:
    	use lib "$ENV{HOME}/myown_perllib";

The last is particularly useful because it knows about machine dependent architectures. The lib.pm pragmatic module was first included with the 5.002 release of Perl.

Shell向Perl脚本中传递变量的方法

方法一:

In shell using export  to output a variable, and in perl using special variable %ENV to get shell's variables.
i.e.:
--- shell box---
$ /bin/ksh
# export x=Foo
# perl -e 'print $ENV{"x"}'
-------------


方法二:
象C一样,PERL也有存储命令行参数的数组@ARGV,可以用来分别处理各个命令行参数;与C不同的是,$ARGV[0]是第一个参数,而不是程序名本身。
    $var = $ARGV[0]; # 第一个参数
    $numargs = @ARGV; # 参数的个数
  PERL中,<>操作符实际上是对数组@ARGV的隐含的引用,其工作原理为:
1、当PERL解释器第一次看到<>时,打开以$ARGV[0]为文件名的文件;
2、执行动作shift(@ARGV); 即把数组@ARGV的元素向前移动一个,其元素数量即减少了一个。
3、<>操作符读取在第一步打开的文件中的所有行。
4、读完后,解释器回到第一步重复。
  例:
    @ARGV = ("myfile1", "myfile2"); #实际上由命令行参数赋值
    while ($line = <>) {
    print ($line);
    } 
  将把文件myfile1和myfile2的内容打印出来。

Perl中的特殊变量

Perl中的特殊变量

1. $&, $`,$' 用在模式匹配中


$&  用来存放匹配中的值
$`   用来存放匹配中之前所有字符
$
'   用来存放匹配中之后所有字符

如:
#!/usr/bin/perl -w
if("Hello good  there,neigbor hello" =~ /S(w+),/)
{
        
print "That actually matched '$&'. ";
        
print $`." ";
        
print $'." ";
}

执行的结果为:

That actually matched 
'there,'.
Hello good  
neigbor hello

----------------------------------------------


另外常用的变量@_
@_是子程序的一个私有变量◆;如果有一个全局变量@_,它将在此子程序调用前存储起来,当子程序调用完成后,其早期的值会被重新赋还给@_◆。这意味着当将参数传递给子程序时不用担心它会影响此程序中其它子程序的@_这个变量的值。嵌套的子程序调用时,@_的值和上述类似。甚至此子程序递归调用时,每一次调用将得到新的@_,因此子程序调用时将得到其自身的参数列表。

◆除非调用的子程序前有&而后面没有括号(或者没有参数),此时@_从此调用者的上下文(context)得到。这通常不是个好主意,但有时很
有用。

2. Perl - $_ and @_

Perl's a great language for special variables - variables that are set up without the programmer having to intervene and providing information ranging from the number of lines read from the current input file ($.) through the current process ID ($$) and the operating system ($^O). Other special variables effect how certain operations are performed ($| controlling output buffering / flushing, for example), or are fundamental in the operation of certain facilities - no more so than $_ and @_.

Lets clear a misconception. $_ and @_ are different variables. In Perl, you can have a list and a scalar of the same name, and they refer to unrelated pieces of memory.

$_ is known as the "default input and pattern matching space". In other words, if you read in from a file handle at the top of a while loop, or run a foreach loop and don't name a loop variable, $_ is set up for you. Then any regular expression matches, chops (and lcs and many more) without a parameter, and even prints assume you want to work on $_. Thus:
while ($line = <FH>) {
  if ($line =~ /Perl/) {
    print FHO $line;
    }
  print uc $line;
  }

Shortens to:
while (<FH>) {
  /Perl/ and
    print FHO ;
  print uc;
  }

@_ is the list of incoming parameters to a sub. So if you write a sub, you refer to the first parameter in it as $_[0], the second parameter as $_[1] and so on. And you can refer to$_# as the index number of the last parameter:
sub demo {
  print "Called with ",$#_+1," params/n";
  print "First param was $_[0]/n";

Note that the English module adds in the ability to refer to the special variables by other longer, but easier to remember, names such as @ARG for @_ and $PID for $$. But use English; can have a detrimental performance effect if you're matching regular expressions against long incoming strings. 

Perl Notes(I)

Part I    Programming Perl

1 Perl Data Types

1.1 Funny Characters

Type Character Example Is a name for:
Scalar $ $cents An individual value (number or string)
Array @ @large A list of values, keyed by number
Hash % %interest A group of values, keyed by string
Subroutine & &how A callable chunk of Perl code
Typeglob * *struck Everything named struck

1.2 Singularities

Strings and numbers are singular pieces of data, while lists of strings or numbers are plural. Scalar variables can be assigned any form of scalar value: integers, floating-point numbers, strings, and even esoteric things like references to other variables, or to objects.

As in the Unix shell, you can use different quoting mechanisms to make different kinds of values. Double quotation marks (double quotes) do variable interpolation and backslash interpolation (such as turning /n into a newline) while single quotes suppress interpolation. And backquotes (the ones leaning to the left``) will execute an external program and return the output of the program, so you can capture it as a single string containing all the lines of output.

$answer = 42; # an integer
$pi = 3.14159265; # a "real" number
$avocados = 6.02e23; # scientific notation
$pet = "Camel"; # string
$sign = "I love my $pet"; # string with interpolation
$cost = 'It costs $100'; # string without interpolation
$thence = $whence; # another variable's value
$salsa = $moles * $avocados; # a gastrochemical expression
$exit = system("vi $file"); # numeric status of a command
$cwd = `pwd`; # string output from a command

And while we haven't covered fancy values yet, we should point out that scalars may also hold references to other data structures, including subroutines and objects.

$ary = /@myarray; # reference to a named array
$hsh = /%myhash; # reference to a named hash
$sub = /&mysub; # reference to a named subroutine