man Jcode () - Japanese Charset Handler

NAME

Jcode - Japanese Charset Handler

SYNOPSIS

 use Jcode;
 # 
 # traditional
 Jcode::convert(\$str, $ocode, $icode, "z");
 # or OOP!
 print Jcode->new($str)->h2z->tr($from, $to)->utf8;

DESCRIPTION

<Japanese document is now available as Jcode::Nihongo. >

Jcode.pm supports both object and traditional approach. With object approach, you can go like;

  $iso_2022_jp = Jcode->new($str)->h2z->jis;

Which is more elegant than:

  $iso_2022_jp = $str;
  &jcode::convert(\$iso_2022_jp, 'jis', &jcode::getcode(\$str), "z");

For those unfamiliar with objects, Jcode.pm still supports CWgetcode() and CWconvert().

If the perl version is 5.8.1, Jcode acts as a wrapper to Encode, the standard charset handler module for Perl 5.8 or later.

Methods

Methods mentioned here all return Jcode object unless otherwise mentioned.

Constructors

Creates Jcode object CW$j from CW$str. Input code is automatically checked unless you explicitly set CW$icode. For available charset, see getcode below. For perl 5.8.1 or better, CW$icode can be any encoding name that Encode understands.

  $j = Jcode->new($european, 'iso-latin1');
When the object is stringified, it returns the EUC-converted string so you can <print CW$j> instead of <print CW$j->euc>.
Passing Reference
Instead of scalar value, You can use reference as Jcode->new(\$str); This saves time a little bit. In exchange of the value of CW$str being converted. (In a way, CW$str is now tied to jcode object). Sets CW$j's internal string to CW$str. Handy when you use Jcode object repeatedly (saves time and memory to create object).
 # converts mailbox to SJIS format
 my $jconv = new Jcode;
 $/ = 00;
 while(&lt;&gt;){
     print $jconv->set(\$_)->mime_decode->sjis;
 }
Appends CW$str to CW$j's internal string. shortcut for Jcode->new() so you can go like;

Encoded Strings

In general, you can retrieve encoded string as CW$j->encoded.

$sjis = jcode($str)->sjis
What you code is what you get :) Same as CW$j->h2z->jis. Hankaku Kanas are forcibly converted to Zenkaku. For perl 5.8.1 and better, you can also use any encoding names and aliases that Encode supports. For example:
  $european = $j->iso_latin1; # replace '-' with '_' for names.
FYI: Encode::Encoder uses similar trick.
$j->fallback($fallback)
For perl is 5.8.1 or better, Jcode stores the internal string in UTF-8. Any character that does not map to ->encoding are replaced with a '?', which is Encode standard.
  my $unistr = "\x{262f}"; # YIN YANG
  my $j = jcode($unistr);  # $j->euc is '?'
You can change this behavior by specifying fallback like Encode. Values are the same as Encode. CWJcode::FB_PERLQQ, CWJcode::FB_XMLCREF, CWJcode::FB_HTMLCREF are aliased to those of Encode for convenice.
  print $j->fallback(Jcode::FB_PERLQQ)->euc;   # '\x{262f}'
  print $j->fallback(Jcode::FB_XMLCREF)->euc;  # '&#x262f;'
  print $j->fallback(Jcode::FB_HTMLCREF)->euc; # '&#9775;'
The global variable CW$Jcode::FALLBACK stores the default fallback so you can override that by assigning the value.
  $Jcode::FALLBACK = Jcode::FB_PERLQQ; # set default fallback scheme
folds lines in jcode string every CW$width (default: 72) where CW$width is the number of halfwidth character. Fullwidth Characters are counted as two. with a newline string spefied by CW$newline_str (default: \n). Rudimentary kinsoku suppport is now available for Perl 5.8.1 and better. returns character length properly, rather than byte length.

Methods that use MIME::Base64

To use methods below, you need MIME::Base64. To install, simply

   perl -MCPAN -e 'CPAN::Shell->install("MIME::Base64")'

If your perl is 5.6 or better, there is no need since MIME::Base64 is bundled. Converts CW$str to MIME-Header documented in RFC1522. When CW$lf is specified, it uses CW$lf to fold line (default: \n). When CW$bpl is specified, it uses CW$bpl for the number of bytes (default: 76; this number must be smaller than 76). For Perl 5.8.1 or better, you can also encode MIME Header as:

  $mime_header = $j->MIME_Header;
In which case the resulting CW$mime_header is MIME-B-encoded UTF-8 whereas CW$j->mime_encode() returnes MIME-B-encoded ISO-2022-JP. Most modern MUAs support both.
$j->mime_decode;
Decodes MIME-Header in Jcode object. For perl 5.8.1 or better, you can also do the same as:
  Jcode->new($str, 'MIME-Header')

Hankaku vs. Zenkaku

$j->h2z([$keep_dakuten])
Converts X201 kana (Hankaku) to X208 kana (Zenkaku). When CW$keep_dakuten is set, it leaves dakuten as is (That is, ka + dakuten is left as is instead of being converted to ga) You can retrieve the number of matches via CW$j->nmatch;
$j->z2h
Converts X208 kana (Zenkaku) to X201 kana (Hankaku). You can retrieve the number of matches via CW$j->nmatch;

Regexp emulators

To use CW->m() and CW->s(), you need perl 5.8.1 or better. Applies CWtr/$from/$to/ on Jcode object where CW$from and CW$to are EUC-JP strings. On perl 5.8.1 or better, CW$from and CW$to can also be flagged UTF-8 strings. If CW$opt is set, CWtr/$from/$to/$opt is applied. CW$opt must be 'c', 'd' or the combination thereof. You can retrieve the number of matches via CW$j->nmatch; The following methods are available only for perl 5.8.1 or better. Applies CWs/$pattern/$replace/$opt. CW$pattern and CWreplace must be in EUC-JP or flagged UTF-8. CW$opt are the same as regexp options. See perlre for regexp options. Like CW$j->tr(), CW$j->s() returns the object itself so you can nest the operation as follows;

  $j->tr("a-z", "A-Z")->s("foo", "bar");
Applies CWm/$patter/$opt. Note that this method DOES NOT RETURN AN OBJECT so you can't chain the method like CW$j->s().

Instance Variables

If you need to access instance variables of Jcode object, use access methods below instead of directly accessing them (That's what OOP is all about)

FYI, Jcode uses a ref to array instead of ref to hash (common way) to optimize speed (Actually you don't have to know as long as you use access methods instead; Once again, that's OOP)

$j->r_str
Reference to the EUC-coded String.
$j->icode
Input charcode in recent operation.
$j->nmatch
Number of matches (Used in CW$j->tr, etc.)

Subroutines

($code, [$nmatch]) = getcode($str)
Returns char code of CW$str. Return codes are as follows
 ascii   Ascii (Contains no Japanese Code)
 binary  Binary (Not Text File)
 euc     EUC-JP
 sjis    SHIFT_JIS
 jis     JIS (ISO-2022-JP)
 ucs2    UCS2 (Raw Unicode)
 utf8    UTF8
When array context is used instead of scaler, it also returns how many character codes are found. As mentioned above, CW$str can be \$str instead. jcode.pl Users: This function is 100% upper-conpatible with jcode::getcode() well, almost;
 * When its return value is an array, the order is the opposite;
   jcode::getcode() returns $nmatch first.
 * jcode::getcode() returns 'undef' when the number of EUC characters
   is equal to that of SJIS.  Jcode::getcode() returns EUC.  for
   Jcode.pm there is no in-betweens.
Converts CW$str to char code specified by CW$ocode. When CW$icode is specified also, it assumes CW$icode for input string instead of the one checked by getcode(). As mentioned above, CW$str can be \$str instead. jcode.pl Users: This function is 100% upper-conpatible with jcode::convert() !

BUGS

For perl is 5.8.1 or later, Jcode acts as a wrapper to Encode. Meaning Jcode is subject to bugs therein.

ACKNOWLEDGEMENTS

This package owes a lot in motivation, design, and code, to the jcode.pl for Perl4 by Kazumasa Utashiro <utashiro@iij.ad.jp>.

Hiroki Ohzaki <ohzaki@iod.ricoh.co.jp> has helped me polish regexp from the very first stage of development.

JEncode by makamaka@donzoko.net has inspired me to integrate Encode to Jcode. He has also contributed Japanese POD.

And folks at Jcode Mailing list <jcode5@ring.gr.jp>. Without them, I couldn't have coded this far.

SEE ALSO

Encode

Jcode::Nihongo

<http://www.iana.org/assignments/character-sets>

COPYRIGHT

Copyright 1999-2005 Dan Kogai <dankogai@dan.co.jp>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.