man estindex (Commandes) - manage an inverted index

NAME

estindex - manage an inverted index

SYNOPSIS

estindex register [-list file] [-force] [-relax] [-wmax num] [-tsuf sufs] [-hsuf sufs] [-msuf sufs] [-mn] [-xsuf sufs type cmd] [-xtype type cmd] [-xt] [-xm] [-iz] [-ipre pres] [-isiz size] [-enc code] [-pt code] [-ft code] [-tattr attrs] [-rich] [-plute] name [dir]

estindex relate [-list file] [-force] [-relax] [-ni] name [prefix]

estindex purge [-list file] [-force] [-relax] name [prefix]

estindex optimize [-relax] [-small] name

estindex inform name

estindex merge [-relax] [-rich] [-plute] name elems...

estindex pree [-h] [-m] [-x type cmd] [-xt] [-xm] [-enc code] [-pt code] [-ft code] [-tattr attrs] [-wl] [file]

estindex version

DESCRIPTION

This manual page documents briefly the estindex commands.

This command is composed of sub commands for some purposes. The name of a sub command is specified by the second argument. If `*' is specified as a suffix rule, any file matches it. The name of an encoding should be specified as a formal name registered to IANA. When an outer command is called as a filter, the first argument specifies the name of the input file, the second argument specifies the name of the output file, and the environment variable `ESTORIG' specifies the name of the original file. If the name of the outer command begins with `@', a dynamic linking library whose name is the substring except for beginning `@' is linked, and a function whose name is `estfilter' is called. estautoreg is a program that construct inverted indexes and merge them.

SUBCOMMAND AND OPTIONS

Each sub command returns 0 if it finishes successfully, or 1 if any error has occurred. If the environment variable `ESTDBGFD' is set, debug information is output to the specified file descriptor. If you abort a running command, send one signal of SIGINT (Control-C), SIGQUIT (Control-/), and SIGTERM. Then the inverted index is closed normally and the command finishes. Any other meaning of forced termination may destroy the inverted index.

A summary of options is included below. For a complete description, see the /usr/share/doc/estraier/spex.html

estindex register

The sub command register is used in order to construct or update an inverted index.

name specifies the name of an inverted index.

dir specifies a directory which contains files to register. If it is omitted, the current directory is specified. The specified directory is scanned recursively and symbolic links are followed through.

If the option -list is specified, the file specified by file is read and files specified paths in each line of the read file are registered. If file is `-', the standard input is read. This option is useful with combination of the `find' command of UNIX. If a tab character is in each line, the string after the tab is treated as the value of the attribute `realuri' of the registered document.

If the option -force is specified, files already registered in the inverted index and not modified are also registered again.

If the option -relax is specified, the process sleeps moderately and relaxes the stress of the system.

If the option -wmax is specified, the number words specified by num is recorded as information for generating summary. By default, all words are recorded for summary. This option is useful to reduce the size of the inverted index and improve response of search.

The option -tsuf specifies suffixes of files to be handled as plain text. sufs specifies a list of suffixes separated with a comma. By default, it is the same as `-tsuf .txt,.asc'.

The option -hsuf specifies suffixes of files to be handled as HTML. sufs specifies a list of suffixes separated with a comma. By default, it is the same as `-hsuf .html,.htm'.

The option -msuf specifies suffixes of files to be handled as MIME. sufs specifies a list of suffixes separated with a comma. By default, it is the same as `-msuf .eml,.mht'.

If the option -mn is specified, attributes of the content body of MIME are prior for the attributes of the document.

The option -xsuf specifies suffixes of files to be processed by an arbitrary outer command. sufs specifies a list of suffixes separated with a comma. type specifies a media type. cmd specifies a command to convert an original data to HTML.

The option -xtype specifies a media type to be processed by an arbitarary outer command. type specifies a media type. cmd specifies a command to convert an original data to HTML. This option is used with combination of the `estfind' command.

If the option -xt is specified, output of the outer command is treated as plain text.

If the option -xm is specified, output of the outer command is treated as MIME.

If the option -iz is specified, empty documents are not registered.

The option -ipre specifies prefixes of files to be ignored. pres specifies a list of prefixes separated with a comma.

If the option -isiz is specified, files whose size is larger than the specified size are ignored. size specifies the size by bytes.

The option -enc specifies the encoding of the registered files with code. By default, the encoding of each files are detected automatically due to the extracted text.

If the option -pt is specified, the title of each registered document is overwritten with its path on the local file system. The encoding of the file system is specified with code.

If the option -ft is specified, the title of each registered document is overwritten with its file name on the local file system. The encoding of the file system is specified with code.

The option -tattr specifies attributes to be merged to the text and to be treated as search words. attrs specifies a list of attribute names separated with a comma.

If the option -rich is specified, RAM and disk are utilized bountifully for large sites (more than 100 thousands of documents).

If the option -plute is specified, RAM and disk are utilized bountifully for large sites (more than 500 thousands of documents). If a documents already registered in the inverted index is being registered, if its last modified time is newer than its registration time in the inverted index, it is registered, else, it is ignored.

estindex relate

The sub command relate is used in order to add score information for relational document search.

name specifies the name of an inverted index.

prefix specifies a prefix of the URI of target documents. If it is omitted, all documents are related.

If the option -list is specified, the file specified by file is read and files specified paths in each line of the read file are processed. If file is `-', the standard input is read.

If the option -force is specified, score information of all target documents are registered regardless whether they are already registered or not.

If the option -relax is specified, the process sleeps moderately and relaxes the stress of the system.

If the option -ni is specified, TF-IDF is disabled. By default, it is enabled.

If you do not need relational document search, you do not have to perform this sub command.

estindex purge

The sub command purge is used in order to reflect deleted files to an inverted index.

name specifies the name of an inverted index.

prefix specifies a prefix of the URI of target documents. If it is omitted, all documents are checked.

If the option -list is specified, the file specified by file is read and files specified paths in each line of the read file are processed. If file is `-', the standard input is read.

If the option -force is specified, all target documents are removed from the inverted index regardless whether the files exist or not.

If the option -relax is specified, the process sleeps moderately and relaxes the stress of the system.

estindex optimize

The sub command optimize is used in order to delete useless information which arisen by updating an inverted index.

name specifies the name of an inverted index.

If the option -relax is specified, the process sleeps moderately and relaxes the stress of the system.

If the option -small is specified, optimization preferring size reduction is performed.

estindex inform

The sub command inform is used in order to get information of an inverted index.

name specifies the name of an inverted index.

estindex merge

The sub command merge is used in order to merge plural inverted indexes.

name specifies the name of an inverted index.

elems specifies the names of element inverted indexes.

If the option -relax is specified, the process sleeps moderately and relaxes the stress of the system.

If the option -rich is specified, RAM and disk are utilized bountifully for large sites (more than 100 thousands of documents).

If the option -plute is specified, RAM and disk are utilized bountifully for large sites (more than 500 thousands of documents).

estindex pree

The sub command pree is used in order to test text extraction and word breaking.

file specifies the name of a file to read. If it is omitted, the standard input is read.

If the option -h is specified, the input is handled as HTML. The default is plain text.

If the option -m is specified, the input is handled as e-mail. The default is plain text.

The option -x is used for the input to be processed by an arbitrary outer command. type specifies a media type. cmd specifies a command to convert an original data to HTML.

If the option -xt is specified, output of the outer command is treated as plain text.

If the option -xm is specified, output of the outer command is treated as MIME.

The option -enc specifies the encoding of the registered files with code. By default, the encoding of each files are detected automatically due to the extracted text.

If the option -pt is specified, the title of each registered document is overwritten with its path on the local file system. The encoding of the file system is specified with code.

If the option -ft is specified, the title of each registered document is overwritten with its file name on the local file system. The encoding of the file system is specified with code.

The option -tattr specifies attributes to be merged to the text and to be treated as search words. attrs specifies a list of attribute names separated with a comma.

If the option -wl is specified, only split words in normalized form were output.

estindex version

The sub command `version' is used in order to know the version information of Estraier.

EXAMPLES

To enable full-text search, you should construct an inverted index beforehand. For example, if your web contents are under `/home/mikio/public_html' and CGI script is available there, perform the following steps.

cd /home/mikio/public_html

estindex register casket

estindex relate casket

Then, all of plain text, HTML, and MIME files are registered into an inverted index named as casket.

When your site is updated, perform the following steps.

cd /home/mikio/public_html

estindex purge casket

estindex register casket

estindex optimize casket

estindex relate casket

Then, deleted files are reflected to the inverted index, and new or modified files are also reflected to the inverted index.

SEE ALSO

AUTHOR

estraier was written by Mikio Hirabayashi <mikio at users.sourceforge.net>.

This manual page was written by Fumitoshi UKAI <ukai@debian.or.jp>, for the Debian project (but may be used by others).