man WWW (Fonctions bibliothèques) - World Wide Web Package

NAME

WWW - World Wide Web Package

SYNOPSIS

extract_description( FILE )
extract_meta( FILE, NAME )
hyperlink( LIST )

DESCRIPTION

This package provides a utility functions for the World Wide Web to extract descriptions of or meta information from files, and hyperlink text.

SUBROUTINES

The following Perl subroutines are defined and available:

CWextract_description( FILE )
Extracts a description from an HTML or plain text file given by the FILE name; FILE should be an absolute path. The first CW$description::chars (default: 2048) characters are read. If the file ends in one of the extensions CWhtm, CWhtml, or CWshtml, it is presumed to be an HTML file; if the file ends in CWtxt, it is presumed to be a plain text file. Other extensions are not recognized and no description is returned for them.
For HTML files, first, if a CW<META NAME="description" CONTENT="...CW"> or a CW<META NAME="DC.description" CONTENT="...CW"> (Dublin Core) element is found, then the words specified as the value of the CWCONTENT attribute is returned as the description.
Otherwise, all HTML comments, text between CW<SCRIPT>, CW<STYLE>, and CW<TITLE> tags, and all other HTML tags are stripped. If CW<AREA ... CWALT="...CW"> or CW<IMG ... CWALT="..."> elements are found, then the words specified as the value of the CWALT attributes are extracted.
Finally, for either HTML or plain text files, at most CW$description::words (default: 50) are returned.
CWextract_meta( FILE, NAME )
Extracts the value of the CWCONTENT attribute from a CWMETA element having the given CWNAME attribute from an HTML file given by the FILE name; FILE should be an absolute path. The file must end in one of the extensions CWhtm, CWhtml, or CWshtml to be considered an HTML file. The first CW$description::chars (default: 2048) characters are read. The characters are cached between consecutive calls using the same filename.
CWhyperlink( LIST )
Adds hyperlinks to strings: that is strings that contain substrings that are valid URLs (according to RFC 1630) have the appropriate HTML tags ``wrapped'' around them so that they will be selectable when displayed in a browser. The CWftp, CWgopher, CWhttp, CWhttps, CWmailto, CWnews, CWtelnet, and CWwais URLs are recognized. Example: Read all about it at http://www.usatoday.com/

becomes:

Read all about it at <A HREF="http://www.usatoday.com/">http://www.usatoday.com/</A>

SEE ALSO

perl(1)

Tim Berners-Lee. ``Universal Resource Identifiers in WWW,'' Request for Comments 1630, Network Working Group of the Internet Engineering Task Force, June 1994.

Tim Berners-Lee, Larry Masinter, and Mark McCahill. ``Uniform Resource Locators (URL),'' Request for Comments 1738, Network Working Group, 1994.

Dave Raggett, Arnaud Le Hors, and Ian Jacobs. ``Notes on helping search engines index your Web site,'' HTML 4.0 Specification, Appendix B: Performance, Implementation, and Design Notes, World Wide Web Consortium, April 1998.

--. ``Objects, Images, and Applets: How to specify alternate text,'' HTML 4.0 Specification, 13.8, World Wide Web Consortium, April 1998.

Dublin Core Directorate. ``The Dublin Core: A Simple Content Description Model for Electronic Resources.''

Larry Wall, et al. Programming Perl, 3rd ed., O'Reilly & Associates, Inc., Sebastopol, CA, 2000.

AUTHOR

Paul J. Lucas <pauljlucas@mac.com>