man MDN::UTF8 () - Perl extension for libmdn utf8 module.

NAME

MDN::UTF8 - Perl extension for libmdn utf8 module.

SYNOPSIS

  use MDN::UTF8;
  $length = MDN::UTF8->mblen($utf8_string);
  @ucs4_characters = MDN::UTF8->unpack($utf8_string);
  $utf8_string = MDN::UTF8->pack(@ucs4_characters);
  die if (!MDN::UTF8->isvalid($utf8_string));

DESCRIPTION

CWMDN::UTF8 provides a Perl interface to UTF-8 utility module of the MDN library (a C library for handling multilingual domain names) in the mDNkit.

CLASS METHODS

Although this module does not provide object interface, all the functions should be called as class methods, in order to be consistent with other modules in CWMDN::.

        MDN::UTF8->mblen($string);      # OK
        MDN::UTF8::mblen($string);      # NG
mblen($utf8_string)
Returns the length (in bytes) of the first character of CW$utf8_string. If the character is not a valid UTF-8 character, this method returns 0.
getwc($utf8_string)
Inspects the first character of CW$utf8_string, and resturns the result as a list with two elements. The first elemnt of the list is the integer code value of the character in the form of UCS-4, and the second is the length (in bytes) of the character in the form of UTF-8.
        ($wc, $length) = MDN::UTF8->getwc($string);
The value of the second element is the same as the one retruned from CWmblen(). If the character is not a valid UTF-8 character, this method returns an empty list. Note that it also returns an empty list for an empty UTF-8 string.
unpack($utf8_string)
Unpacks CW$utf8_string into a list of UCS-4 characters, and returns the list of integer code values of them. An empty list is returned if CW$utf8_string contains an invalid character or CW$utf8_string is empty.
pack(@ucs4_characters)
Packs a list of UCS-4 characters into an UTF-8 string, and returns the string. This is the reverse of CWunpack method above. If CW@ucs4_characters contains an invalid UCS-4 character, it returns CWundef.
isvalid($utf8_string)
Checks if CW$utf8_string is a valid UTF-8 encoded string. Returns 1 if it is valid, 0 otherwise.

ISSUE OF HANDLING UNICODE CHARACTERS

Beginning with version 5.6, Perl supports Unicode character, but the implementation is incomplete and highly experimental.

Perl provides the `character' and `byte' semantics. In the character semantics, an Unicode character is recognized as a character even if that occupies two or more bytes. In the byte semantics, Unicode character is recognized as a sequence of bytes.

Some Perl operators changes theier behaviors according with the semantics, and Perl decides whether an operator uses the character or bytes semantics based on whether input data is byte or character data. For example, a string literal which contains CW\x{304B} (Unicode character U+304B) is recognized as character data.

Also the MDN modules dealing with UTF-8. If you don't have special reason to use the character semantics, or you aren't familier with the character semantics, we recommend you to use CWbytes pragmra:

  use bytes;

That forces the byte semantics everywhere in your program. See perlunicode and perlbytes for more details about this issue.

SEE ALSO

MDN library specification, perlunicode, perlbytes