man indexer.conf (Formats) - configuration file for indexer

NAME

indexer.conf - configuration file for indexer

DESCRIPTION

This is configuration file for indexer (1). Configuration file consists of commands and their arguments. All commands are case-insensitive. You can use # to comment out lines.

VARIABLES

Global parameters

These commands should be used only once and take global effect for the whole configuration file.

DBType type
Database type, currently supported values are mysql, pgsql, msql, solid, mssql, oracle, ibase, sqlite Actually it does not matter for native libraries support, but ODBC users must specify one of the supported values. If your database type is not supported, use unknown instead. DBHost host SQL host name (Not required for ODBC) Default: localhost DBName mnogosearch SQL database name or ODBC DSN Default: mnogosearch DBUser foo Database username to connect to database Default: no user DBPass bar Database password to connect to database Default: no password DBMode single/multi/crc/crc-multi SQL database words storage mode. Does not apply for built-in database. When single is specified, all words are stored in the same table. multi means that words are stored in different tables depending on wordlength. multi mode is usualy faster, but it requires more tables in database. In case of crc mode, mnoGoSearch will store 32 bit integer word ID's calculated by CRC32 algorythm instead of words. crc mode requires less diskspace and is faster than single and multi modes. crc-multi mode shares storage structure with crc mode, but stores words in different tables depending on wordlength like multi mode. Default DBMode value is single LocalCharset charset Defines charset for local file system. It is required if you are using 8 bit characters and is not applicable for 7 bit characters. This command is to be used once and takes global effect for the whole configuration file. Example: LocalCharset windows-1250 CrossWords yes|no Building CrossWords index. Crosswords are those, that are used in a link to the present page. The default value is no StopWordFile filename This command indicates which file contains stopwords list to load. You may specify either absolute file name, or filename with a relative path to mnoGoSearch /etc directory. You may use several StopWordsFile commands. MinWordLength characters MinWordLength characters With these commands you can change default length range of words stored in database. By default mnoGoSearch stores words that are longer than 1 and shorter than 32. Example: MaxWordLength 35 MaxDocSize bytes Specify maximum size of a document in bytes that can be indexed. The default value is 1048576 (1 Mb). This command take global effect for the whole config file. HTTPHeader header You may add custom HTTP headers to indexer HTTP request. Do not use "If-modified-since" and "Accept-Charset" headers, since they are composed by indexer itself. "User-Agent: mnoGoSearch/version" is sent too, although you may override it. The command has global effect for the whole configuration file. ServerTable table_name This command works only with SQL database and is not applicable for built-in database mode. Load servers with all their parameters from the table table_name For an example of such tables structure, please refer to the file create/mysql/server.txt You may use several arguments with this command: ServerTable my_servers1 my_servers2 my_servers3 or just a single argument: ServerTable server DeleteNoServer yes|no Use this command to specify whether to delete the URL that have no corresponding Server commands. Default value is yes
VarDir /path/to/my/var/dir
Specify a custom path to directory that indexer stores data to when use with built-in database and in cache mode. By default /var directory of mnoGoSearch installation is used.

URL Control Configuration

Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...] Use this command to allow URL's that match (does not match) given argument. First three optional parameters describe the type of comparison. Default values are Match, NoCase, String Use NoCase or Case values to to choose case insensitive or sensitive comparison. Use Regex to choose regular expression comparison. Use String to choose string with wildcards comparison. Wildcards are * for any number of characters, and ? for one character. Note that * and ? have special meaning in String match type. Please use Regex to describe documents with ? and * signs in URL. String match is much faster than Regex String wrere it is possible. You may use several arguments for one Allow command and use this command any number of times. It takes global effect for the config file. Note that mnoGoSearch automatically adds one Allow regex .* command after reading config file. That command means that everything is allowed that is not disallowed Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...] Use this to disallow indexing documents with URLs that match given argument. The meaning of the first three optional parameters is exactly the same as with the Allow command. You can use several arguments for one Disallow command. Takes global effect for config file. Example: #Exclude cgi-bin and non-parsed-headers Disallow /cgi-bin/ \.cgi /nph #Exclude some known extensions Disallow \.b$ \.sh$ \.md5$ Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$ Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$ Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$ Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$ Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$ Disallow \.vrml$ \.wrl$ Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$ Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$ Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$ Disallow \.ai$ \.eps$ \.ppt$ \.hqx$ Disallow \.cpt$ \.bms$ \.oda$ \.tcl$ Disallow \.rpm$ #Exclude Apache directory list in different sort order Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$ #Exclude ./. and ./.. from Apache and Squid directory list Disallow /[.]{1,2} /\%2e /\%2f CheckOnly regexp [regexp [...] ] Indexer will use HEAD instead of GET http method for URLs that matches regexp. It means that file will be checked only and will not be downloaded. Usefull for zip,exe,arj etc files. One can use several arguments for one 'CheckOnly' command. One can use this command any times but not more than MAXFILTER in indexer.h Takes global effect for config file. Examples: #Use HEAD method for some known non-text extensions: CheckOnly \.b$ \.sh$ \.md5$ CheckOnly \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$ CheckOnly \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$ CheckOnly \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$ CheckOnly \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$ CheckOnly \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ CheckOnly \.vrml$ \.wrl$ CheckOnly \.exe$ \.cab$ \.dll$ \.bin$ \.class$ CheckOnly \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$ CheckOnly \.rtf$ \.pdf$ \.cdf$ \.ps$ CheckOnly \.ai$ \.eps$ \.ppt$ \.hqx$ CheckOnly \.cpt$ \.bms$ \.oda$ \.tcl$ CheckOnly \.rpm$ HrefOnly regexp [regexp [...] ] Indexer scans html documents that match regexp as it would scan any other URLs, except that it will not index the contents. It will add any URLs it finds in html document to database. Usefull when indexing mail list archives with big index pages which contain mostly URLs. One can use several arguments for one 'HrefOnly' command. One can use this command any times but not more than MAXFILTER in indexer.h Takes global effect for config file. Examples: #Scan these files for href tags only, but do not index there contents. HrefOnly mail.*\.html$ thr.*\.html$

MIME types and external parsers

UseRemoteContentType yes|no This command specifies if the indexer should get content type from HTTP server headers (yes) , or from its AddType settings (no). If set to no , and the indexer could not determine content-type with its AddType settings, SyslogFacility facility Useful only if indexer is compiled with syslog support and if you do not like the default. Argument is the same as used in syslog.conf file (for example: local7 , daemon ). For list of possible facilities see syslog.conf(5) Takes global effect and should be used only once ! Default: depends on compilation. LogdAddr host[:port] Use cachelogd at given host and port if specified. Required for cache mode only. Default values are localhost and port 7000 FollowOutside yes|no Allow/disallow indexer to walk outside current server. Should be used carefully (see MaxHops command). Default: no Period seconds Reindex period in seconds, 604800 = 1 week. May be used before every Server command and takes effect till the end of config file or till next Period command. Tag number Use this parameter for your own purposes. For example for grouping some servers into one group, etc. May be used multiple times before every Server command and takes effect till the end of config file or till next Tag command. MaxHops number Maximum way in "mouse clicks" from start URL given in Server command. May be used multiple times before every Server command and takes effect till the end of config file or till next MaxHops command. Default: 256 MaxNetErrors number Maximum network errors for each server. If there are too many network errors on some server (server is down, host unreachable etc.) indexer will try not to do more than number attempts to connect to this server. May be used multiple times before Server command and takes effect till the end of config file or till next MaxNetErrors command. Default: 16 TitleWeight number Weight of the words in the <title>...</title> Can be set multiple times before Server command and takes effect till the end of config file or till next TitleWeight command. Default: 2 BodyWeight number Weight of the words in the <body>...</body> of the html documents and in the contents of the text/plain documents. Can be set multiple times before Server command and takes effect till the end of config file or till next BodyWeight command. Default: 1 DescWeight number Weight of the words in the <META NAME="Description" Content="..."> Can be set multiple times before Server command and takes effect till the end of config file or till next DescWeight command. Default: 2 KeywordWeight number Weight of the words in the <META NAME="Keywords" Content="..."> Can be set multiple times before Server command and takes effect till the end of config file or till next KeywordWeight command. Default: 2 UrlWeight number Weight of the words in the URL of the documents. Can be set multiple times before Server command and takes effect till the end of config file or till next UrlWeight command. Default: 0 DeleteBad yes|no Prevent indexer from deleting bad (not found, forbidden etc) URLs from database. Useful if you want to check 'integrity' of you server(s), so if you set it to , that "bad" URLs will remain in database. Can be set multiple times before Server command and takes effect till the end of config file or till next DeleteBad command. Default: yes Robots yes|no Allows/disallows using robots.txt and <META NAME="robots"> exclusions. Useful if you want to check 'integrity' of you server(s). Can be set multiple times before Server command and takes effect till the end of config file or till next Robots command. Default: yes.

Section <string> <number>
where <string> is a section name and <number> is section ID between 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different sections IDs for different documents parts. In this case during search time you'll be able to give different weight to each part or even disallow some sections at a search time.

Index yes|no Prevent indexer from storing words into database. Useful if you want to check 'integrity' of you server(s). Can be set multiple times before "Server" command and takes effect till the end of config file or till next Index command. Note: Instead of Index no you can use the alternate form NoIndex Default: yes Follow yes|no Allow/disallow indexer to store <a href="..."> into database. Can be set multiple times before Server command and takes effect till the end of config file or till next Follow command. Note: Instead of Follow no you can use the alternate form NoFollow Default: yes MaxDocSize size Hope the name is self-explanatory, this command is to limit maximum document size. size is in bytes. If there is document with size more than size , indexer will parse only first size bytes of documents. Default: 1048576 (which is 1 megabyte) Mime <from_mime> <to_mime>[;charset] [" command line [$1]" ] This is used to add support for parsing documents with mime types other than text/plain and text/html. It can be done via external parser (which should provide output in plain or html text) or just by substituting mime type so indexer can understand it directly. <from_mime> and <to_mime> are standard mime types. <to_mime> should be either text/plain or text/html , because these are the only types that indexer understands. We assume external parser generates results on stdout (if not, you have to write a little script and cat results to stdout). Optional charset parameter used to change charset if needed. Command line parameter is optional. If there's no command line, this is used to change mime type. Command line could also have $1 parameter which stands for temporary file name. Some parsers could not operate on stdin, so indexer creates temporary file for parser and its name passed instead of $1. CharSet charset Useful for 8 bit character sets. WWW-servers send data in different character sets. charset is default character set of server in next Server command(s). May be used before every Server command and takes effect till the end of config file or till next CharSet command. By now indexer supports Cyrillic koi8-r, cp1251, cp866, iso8859-5, x-mac-cyrillic, Arabic cp1256, Western iso-8859-1, Central Europe iso-8859-2 and cp1250 character sets. This parameter is default character set for "bad" servers that do not send information about charset in header: just "Content-type: text/html" instead of for example "Content-type: text/html; charset=koi8-r" and do not send charset information in META tags. CharSet command. Examples: CharSet koi8-r CharSet windows-1250 CharSet ISO-8859-1 ForceIISCharset1251 yes/no This option is useful for users dealing with Cyrillic content and broken (or misconfigured?) Microsoft IIS web servers, which tends to report charset incorrectly. This is a really dirty hack, but if this option is turned on it is assumed that all servers that are reported as 'Microsoft' or 'IIS' have content in Windows-1251 codepage. This command should be used only once in configuration file and takes global effect. Default: no AuthBasic login:passwd Use basic http authorization. Can be set before every Server command and takes effect only for next Server command. Examples: AuthBasic somebody:something If you have password protected directory(ies), but whole server is open, use: AuthBasic login1:passwd1 Server http://my.server.com/my/secure/directory1/ AuthBasic login2:passwd2 Server http://my.server.com/my/secure/directory2/ Server http://my.server.com/ ProxyAuthBasic login:passwd Use http proxy basic authorisation. Can be used before every Server command and taked effect only for the next one Server command! It should be also before Proxy command. Example: ProxyAuthBasic somebody:smth Proxy your.proxy.host[:port] Connect ia proxy rather directly. You can index ftp servers (only) when using proxy. If port is not specified, it is set to default value of 3128 (Squid). If proxy host is not specified, direct connection will be performed. Can be set before every Server command and takes effect till the end of config file or till next Proxy command. Examples: Proxy atoll.anywhere.com - proxy on atoll.anywhere.com, port 3128 Proxy lota.anywhere.com:8090 - proxy on lota.anywhere.com, port 8090 Proxy - turn off proxy usage (direct connection) Server URL It is the main configuration command. Use this to add start URL of server to be indexed. You may use many Server commands in the same indexer.conf file Examples: Server http://localhost/ Server http://www.yoursite.com/ Server http://www.yoursite.com/~yourname/ Server ftp://ftp.yourdomain.com/pub/

EXAMPLE

This is a minimal sample indexer config file DBHost localhost DBName udmsearch DBUser foo DBPass bar Server http://localhost/ Disallow /cgi-bin/ \.cgi /nph Disallow \.b$ \.sh$ \.md5$ Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$ Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$ Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$ Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$ Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$ Disallow \.vrml$ \.wrl$ Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$ Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$ Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$ Disallow \.ai$ \.eps$ \.ppt$ \.hqx$ Disallow \.cpt$ \.bms$ \.oda$ \.tcl$ Disallow \.rpm$ Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$ Disallow /[.]{1,2} /\%2e /\%2f

SEE ALSO