man rainbow (Commandes) - document classification front-end to libbow

NAME

rainbow - document classification front-end to libbow

SYNOPSIS

rainbow [OPTION...] [ARG...]

DESCRIPTION

Rainbow is a C program that performs document classification using one of several different methods, including naive Bayes, TFIDF/Rocchio, K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr's Probabilitistic Indexing, and a simple-minded form a shrinkage with naive Bayes.

Rainbow is a standalone program that does document classification. Here are some examples:

rainbow -i ./training/positive ./training/negative

Using the text files found under the directories `./positive' and `./negative', tokenize, build word vectors, and write the resulting data structures to disk.

rainbow --query=./testing/254

Tokenize the text document `./testing/254', and classify it, producing output like:

/home/mccallum/training/positive 0.72 /home/mccallum/training/negative 0.28

rainbow --test-set=0.5 -t 5

Perform 5 trials, each consisting of a new random test/train split and outputs of the classification of the test documents.

OPTIONS

Testing documents that are specified on the command line:
-x, --test-files
In same format as `-t', output classifications of documents in the directory ARG The ARG must have the same subdir names as the ARG's specified when --index'ing.
-X, --test-files-loo
Same as --test-files, but evaulate the files assuming that they were part of the training data, and doing leave-one-out cross-validation. This only works with the classification methods that support leave-one-out evaluation
Splitting options:
--ignore-set=SOURCE
How to select the ignored documents. Same format as --test-set. Default is `0'.
--set-files-use-basename[=N]
When using files to specify doc types, compare only the last N components the doc's pathname. That is use the filename and the last N-1 directory names. If N is not specified, it defaults to 1.
--test-set=SOURCE
How to select the testing documents. A number between 0 and 1 inclusive with a decimal point indicates a random fraction of all documents. The number of documents selected from each class is determined by attempting to match the proportions of the non-ignore documents. A number with no decimal point indicates the number of documents to select randomly. Alternatively, a suffix of `pc' indicates the number of documents per-class to tag. The suffix 't' for a number or proportion indicates to tag documents from the pool of training documents, not the untagged documents. `remaining' selects all documents that remain untagged at the end. Anything else is interpreted as a filename listing documents to select. Default is `0.0'.
--train-set=SOURCE
How to select the training documents. Same format as --test-set. Default is `remaining'.
--unlabeled-set=SOURCE How to select the unlabeled documents.
Same format as --test-set. Default is `0'.
--validation-set=SOURCE
How to select the validation documents. Same format as --test-set. Default is `0'.
For building data structures from text files:
-i, --index
Tokenize training documents found under directories ARG... (where each ARG directory contains documents of a different class), build token-document matrix, and save it to disk.
--index-lines=FILENAME Read documents' contents from the filename
argument, one-per-line. The first two space-delimited words on each line are the document name and class name respectively
--index-matrix=FORMAT
Read document/word statistics from a file in the format produced by --print-matrix=FORMAT. See --print-matrix for details about FORMAT.
For doing document classification using the token-document matrix built with -i:
--forking-query-server=PORTNUM
Same as `--query-server', except allow multiple clients at once by forking for each client.
--print-doc-length
When printing the classification scores for each test document, at the end also print the number of words in the document. This only works with the --test option.
-q, --query[=FILE]
Tokenize input from stdin [or FILE], then print classification scores.
--query-server=PORTNUM Run rainbow in server mode, listening on socket
number PORTNUM. You can try it by executing this command, then in a different shell window on the same machine typing `telnet localhost PORTNUM'.
-r, --repeat
Prompt for repeated queries.
Rainbow-specific vocabulary options:
--hide-vocab-in-file=FILE
Hide from the vocabulary all words read as space-separated strings from FILE. Note that regular lexing is not done on these strings.
--hide-vocab-indices-in-file=FILE
Hide from the vocabulary all words read as space-separated word integer indices from FILE.
--use-vocab-in-file=FILE
Limit vocabulary to just those words read as space-separated strings from FILE. Note that regular lexing is not done on these strings.
Testing documents that were indexed with `-i':
-t, --test=N
Perform N test/train splits of the indexed documents, and output classifications of all test documents each time. The parameters of the test/train splits are determined by the option `--test-set' and its siblings
--test-on-training=N
Like `--test', but instead of classifing the held-out test documents classify the training data in leave-one-out fashion. Perform N trials.
Diagnostics:
--build-and-save
Builds a class model and saves it to disk. This option is unstable.
-B, --print-matrix[=FORMAT]
Print the word/document count matrix in an awk- or perl-accessible format. Format is specified by the following letters:

print all vocab or just words in document:

a=all OR s=sparse

print counts as ints or binary:

b=binary OR i=integer

print word as:

n=integer index OR w=string OR e=empty OR
c=combination The default is the last in each list
-F, --print-word-foilgain=CLASSNAME
Print the word/foilgain vector for CLASSNAME. See Mitchell's Machine Learning textbook for a description of foilgain.
-I, --print-word-infogain=N
Print the N words with the highest information gain.
--print-doc-names[=TAG]
Print the filenames of documents contained in the model. If the optional TAG argument is given, print only the documents that have the specified tag, where TAG might be `train', `test', etc.
--print-log-odds-ratio[=N]
For each class, print the N words with the highest log odds ratio score. Default is N=10.
--print-word-counts=WORD
Print the number of times WORD occurs in each class.
--print-word-pair-infogain=N
Print the N word-pairs, which when co-occuring in a document, have the highest information gain. (Unfinished; ignores N.)
--print-word-probabilities=CLASS
Print P(w|CLASS), the probability in class CLASS of each word in the vocabulary.
--test-from-saved
Classify using the class model saved to disk. This option is unstable.
--use-saved-classifier Don't ever re-train the classifier.
Use whatever class barrel was saved to disk. This option designed for use with --query-server
-W, --print-word-weights=CLASSNAME
Print the word/weight vector for CLASSNAME, sorted with high weights first. The meaning of `weight' is undefined.
Probabilistic Indexing options, --method=prind:
-G, --prind-no-foilgain-weight-scaling
Don't have PrInd scale its weights by Quinlan's FoilGain.
-N, --prind-no-score-normalization
Don't have PrInd normalize its class scores to sum to one.
--prind-non-uniform-priors
Make PrInd use non-uniform class priors.
General options
--annotations=FILE
The sarray file containing annotations for the files in the index
-b, --no-backspaces
Don't use backspace when verbosifying progress (good for use in emacs)
-d, --data-dir=DIR
Set the directory in which to read/write word-vector data (default=~/.<program_name>).
--random-seed=NUM
The non-negative integer to use for seeding the random number generator
--score-precision=NUM
The number of decimal digits to print when displaying document scores
-v, --verbosity=LEVEL
Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max)
Lexing options
--append-stoplist-file=FILE
Add words in FILE to the stoplist.
--exclude-filename=FILENAME
When scanning directories for text files, skip files with name matching FILENAME.
-g, --gram-size=N
Create tokens for all 1-grams,... N-grams.
-h, --skip-header
Avoid lexing news/mail headers by scanning forward until two newlines.
--istext-avoid-uuencode
Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length.
--lex-pipe-command=SHELLCMD
Pipe files through this shell command before lexing them.
--max-num-words-per-document=N
Only tokenize the first N words in each document.
--no-stemming
Do not modify lexed words with a stemming function. (usually the default, depending on lexer)
--replace-stoplist-file=FILE
Empty the default stoplist, and add space-delimited words from FILE.
-s, --no-stoplist
Do not toss lexed words that appear in the stoplist.
--shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
Default is usually 2.
-S, --use-stemming
Modify lexed words with the `Porter' stemming function.
--use-stoplist
Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer)
--use-unknown-word
When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `<unknown>' token
--xxx-words-only
Only tokenize words with `xxx' in them
Mutually exclusive choice of lexers
--flex-mail
Use a mail-specific flex lexer
--flex-tagged
Use a tagged flex lexer
-H, --skip-html
Skip HTML tokens when lexing.
--lex-alphanum
Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters.
--lex-infix-string=ARG Use only the characters after ARG in each word for
stoplisting and stemming. If a word does not contain ARG, the entire word is used.
--lex-suffixing
Use a special lexer that adds suffixes depending on Email-style headers.
--lex-white
Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd.
Feature-selection options
-D, --prune-vocab-by-doc-count=N
Remove words that occur in N or fewer documents.
-O, --prune-vocab-by-occur-count=N
Remove words that occur less than N times.
-T, --prune-vocab-by-infogain=N
Remove all but the top N words by selecting words with highest information gain.
Weight-vector setting/scoring method options
--binary-word-counts
Instead of using integer occurrence counts of words to set weights, use binary absence/presence.
--event-document-then-word-document-length=NUM
Set the normalized length of documents when --event-model=document-then-word
--event-model=EVENTNAME
Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word.
Default is `word'.
--infogain-event-model=EVENTNAME
Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word.
Default is `document'.
-m, --method=METHOD
Set the word weight-setting method; METHOD may be one of: active, em, emsimple, kl, knn, maxent, naivebayes, nbshrinkage, nbsimple, prind, tfidf_words, tfidf_log_words, tfidf_log_occur, tfidf, svm, default=naivebayes.
--print-word-scores
During scoring, print the contribution of each word to each class.
--smoothing-dirichlet-filename=FILE
The file containing the alphas for the dirichlet smoothing.
--smoothing-dirichlet-weight=NUM
The weighting factor by which to muliply the alphas for dirichlet smoothing.
--smoothing-goodturing-k=NUM
Smooth word probabilities for words that occur NUM or less times. The default is 7.
--smoothing-method=METHOD
Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell
--uniform-class-priors When setting weights, calculating infogain and
scoring, use equal prior probabilities on classes.
Support Vector Machine options, --method=svm:
--svm-active-learning= Use active learning to query the labels &
incrementally (by arg_size) build the barrels.
--svm-active-learning-baseline=
Incrementally add documents to the training set at random.
--svm-al-transduce
do transduction over the unlabeled data during active learning.
--svm-al_init_tsetsize=
Number of random documents to start with in active learning.
--svm-bsize=
maximum size to construct the subproblems.
--svm-cache-size=
Number of kernel evaluations to cache.
--svm-cost=
cost to bound the lagrange multipliers by (default 1000).
--svm-df-counts=
Set df_counts (0=occurrences, 1=words).
--svm-epsilon_a=
tolerance for the bounds of the lagrange multipliers (default 0.0001).
--svm-kernel=
type of kernel to use (0=linear, 1=polynomial, 2=gassian, 3=sigmoid, 4=fisher kernel).
--svm-quick-scoring
Turn quick scoring on.
--svm-remove-misclassified=
Remove all of the misclassified examples and retrain (default none (0), 1=bound, 2=wrong.
--svm-rseed=
what random seed should be used in the test-in-train splits
--svm-start-at=
which model should be the first generated.
--svm-suppress-score-matrix
Do not print the scores of each test document at each AL iteration.
--svm-test-in-train
do active learning testing inside of the training... a hack around making code 10 times more complicated.
--svm-tf-transform=
0=raw, 1=log...
--svm-trans-cost=
value to assign to C* (default 200).
--svm-trans-hyp-refresh=
how often the hyperplane should be recomputed during transduction. Only applies to SMO. (default 40)
--svm-trans-nobias
Do not use a bias when marking unlabeled documents. Use a threshold of 0 to determine labels instead of some threshold tomark a certain number of documents for each class.
--svm-trans-npos=
number of unlabeled documents to label as positive (default: proportional to number of labeled positive docs).
--svm-trans-smart-vals=
use previous problem's as a starting point for the next. (default true)
--svm-transduce-class= override default class(es) (int) to do
transduction with (default bow_doc_unlabeled).
--svm-use-smo=
default 1 (use SMO) - PR_LOQO not compiled
--svm-vote=
Type of voting to use (0=singular, 1=pairwise; default 0).
--svm-weight=
type of function to use to set the weights of the documents' words (0=raw_frequency, 1=tfidf, 2=infogain.
Naive Bayes options, --method=naivebayes:
--naivebayes-binary-scoring
When using naivebayes, use hacky scoring to get good Precision-Recall curves.
--naivebayes-m-est-m=M When using `m'-estimates for smoothing in
NaiveBayes, use M as the value for `m'. The default is the size of vocabulary.
--naivebayes-normalize-log
When using naivebayes, return -1/log(P(C|d), normalized to sum to one instead of P(C|d). This results in values that are not so close to zero and one.
Maximum Entropy options, --method=maxent:
--maxent-constraint-docs=TYPE
The documents to use for setting the constraints. The default is train. The other choice is trainandunlabeled.
--maxent-gaussian-prior
Add a Gaussian prior to each word/class feature constraint.
--maxent-gaussian-prior-no-zero-constraints
When using a gaussian prior, do not enforce constraints that have notraining data.
--maxent-halt-by-accuracy=TYPE
When running maxent, halt iterations using the accuracy of documents. TYPE is type of documentsto test. See `--em-halt-using-perplexity` for choices for TYPE
--maxent-halt-by-logprob=TYPE
When running maxent, halt iterations using the logprob of documents. TYPE is type of documentsto test. See `--em-halt-using-perplexity` for choices for TYPE
--maxent-iteration-docs=TYPE
The types of documents to use for maxent iterations. The default is train. TYPE is type of documents to test. See `--em-halt-using-perplexity` for choices for TYPE
--maxent-iterations=NUM
The number of iterative scaling iterations to perform. The default is 40.
--maxent-keep-features-by-mi=NUM
The number of top words by mutual information per class to use as features. Zeroimplies no pruning and is the default.
--maxent-logprob-constraints
Set constraints to be the log prob of the word.
--maxent-print-accuracy=TYPE
When running maximum entropy, print the accuracy of documents at each round. TYPE is type of document to measure perplexity on. See `--em-halt-using-perplexity` for choices for TYPE
--maxent-prior-variance=NUM
The variance to use for the Gaussian prior. The default is 0.01.
--maxent-prune-features-by-count=NUM
Prune the word/class feature set, keeping only those features that haveat least NUM occurrences in the training set.
--maxent-scoring-hack
Use smoothed naive Bayes probability for zero occuring word/class pairs during scoring
--maxent-smooth-counts Add 1 to the count of each word/class pair when
calculating the constraint values.
--maxent-vary-prior-by-count
Multiply log (1 + N(w,c)) times variance when using a gaussian prior.
--maxent-vary-prior-by-count-linearly
Mulitple N(w,c) times variance when using a Gaussian prior.
K-nearest neighbor options, --method=knn:
--knn-k=K
Number of neighbours to use for nearest neighbour. Defaults to 30.
--knn-weighting=xxx.xxx
Weighting scheme to use, coded like SMART. Defaults to nnn.nnnThe first three chars describe how the model documents areweighted, the second three describe how the test document isweighted. The codes for each position are described in knn.c.Classification consists of summing the scores per class for thek nearest neighbour documents and sorting.
EMSIMPLE options:
--emsimple-no-init
Use this option when using emsimple as the secondary method for genem
--emsimple-num-iterations=NUM
Number of EM iterations to run when building model.
--emsimple-print-accuracy=TYPE
When running emsimple, print the accuracy of documents at each EM round. Type can be validation, train, or test.
EM options:
--em-anneal
Use Deterministic annealing EM.
--em-anneal-normalizer When running EM, do deterministic annealing-ish
stuff with the unlabeled normalizer.
--em-binary
Do special tricks for the binary case.
--em-binary-neg-classname=CLASS
Specify the name of the negative class if building a binary classifier.
--em-binary-pos-classname=CLASS
Specify the name of the positive class if building a binary classifier.
--em-compare-to-nb
When building an EM class barrel, show doc stats for the naivebayesbarrel equivalent. Only use in conjunction with --test.
--em-crossentropy
Use crossentropy instead of naivebayes for scoring.
--em-halt-using-accuracy=TYPE
When running EM, halt when accuracy plateaus. TYPE is type of document to measure perplexity on.
Choices are `validation', `train', `test',
`unlabeled' and `trainandunlabeled' and `trainandunlabeledloo'
--em-halt-using-perplexity=TYPE
When running EM, halt when perplexity plataeus. TYPE is type of document to measure perplexity on.
Choices are `validation', `train', `test',
`unlabeled',
`trainandunlabeled' and
`trainandunlabeledloo'
--em-labeled-for-start-only
Use the labeled documents to set the starting point for EM, butignore them during the iterations
--em-multi-hump-init=METHOD
When initializing mixture components, how to assign component probs to documents. Default is `spread'. Other choices are `spiked'.
--em-multi-hump-neg=NUM
Use NUM center negative classes. Only use in binary case.Must be using scoring method nb_score.
--em-num-iterations=NUM
Number of EM iterations to run when building model.
--em-perturb-starting-point=TYPE
Instead of starting EM with P(w|c) from the labeled training data, start from values that are randomly sampled from the multinomial specified by the labeled training data. TYPE specifies what distribution to use for the perturbation; choices are `gaussian' `dirichlet', and `none'. Default is `none'.
--em-print-accuracy=TYPE
When running EM, print the accuracy of documents at each round. TYPE is type of document to measure perplexity on. See `--em-halt-using-perplexity` for choices for TYPE
--em-print-perplexity=TYPE
When running EM, print the perplexity of documents at each round. TYPE is type of document to measure perplexity on. See `--em-halt-using-perplexity` for choices for TYPE
--em-print-top-words
Print the top 10 words per class for each EM iteration.
--em-save-probs
On each EM iteration, save all P(C|w) to a file.
--em-set-vocab-from-unlabeled
Remove words from the vocabulary not used in the unlabeled data
--em-stat-method=STAT
The method to convert scores to probabilities.The default is 'nb_score'.
--em-temp-reduce=NUM
Temperature reduction factor for deterministic annealing. Default is 0.9.
--em-temperature=NUM
Initial temperature for deterministic annealing. Default is 200.
--em-unlabeled-normalizer=NUM
Number of unlabeled docs it takes to equal a labeled doc.Defaults to one.
--em-unlabeled-start=TYPE
When initializing the EM starting point, how the unlabeled docs contribute. Default is `zero'.
Other choices are `prior' `random'
and `even'.
Active Learning options:
--active-add-per-round=NUM
Specify the number of documents to label each round. The default is 4.
--active-beta=NUM
Increase spread of document densities.
--active-binary-pos=CLASS
The name of the positive class for binary classification. Required forrelevance sampling.
--active-committee-size=NUM
The number of committee members to use with QBC. Default is 1.
--active-final-em
Finish with a full round of EM.
--active-no-final-em
Finish without a full round of EM.
--active-num-rounds=NUM
The number of active learning rounds to perform. The default is 10.
--active-perturb-after-em
Perturb after running EM to create committee members.
--active-pr-print-stat-summary
Print the precision recall curves used for score to probability remapping.
--active-pr-window-size=NUM
Set the window size for precision-recall score to probability remapping.The default is 20.
--active-print-committee-matrices
Print the confusion matrix for each committee member at each round.
--active-qbc-low-kl
Select documents with the lowest kl-divergence instead of the highest.
--active-remap-scores-pr
Remap scores with sneaky precision-recall tricks.
--active-secondary-method=METHOD
The underlying method for active learning to use. The default is 'naivebayes'.
--active-selection-method=METHOD
Specify the selection method for picking unlabeled docs. One of uncertainty, relevance, qbc, random. The default is 'uncertainty'.
--active-stream-epsilon=NUM
The rate factor for selecting documents in stream sampling.
--active-test-stats
Generate output for test docs every n rounds.
-?, --help
Give this help list
--usage
Give a short usage message
-V, --version
Print program version

Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.

REPORTING BUGS

Please report bugs related to this program to Andrew McCallum <mccallum@cs.cmu.edu>. If the bugs are related to the Debian package send bugs to submit@bugs.debian.org.

SEE ALSO

arrow(1), archer(1), crossbow(1).

The full documentation for arrow will be provided as a Texinfo manual. If the info and arrow programs are properly installed at your site, the command

info arrow

should give you access to the complete manual.

You can also find documentation and updates for libbow at http://www.cs.cmu.edu/~mccallum/bow