man crossbow (Commandes) - a front-end with hierarchical clustering and deterministic annealing

NAME

crossbow - a front-end with hierarchical clustering and deterministic annealing

SYNOPSIS

crossbow [OPTION...] [ARG...]

DESCRIPTION

Crossbow is document clustering front-end to libbow. This brief manpage was written for the Debian GNU/Linux distribution since there is none available in the main package.

Note that crossbow is not a supported program.

OPTIONS

For building data structures from text files:
--build-hier-from-dir
When indexing a single directory, use the directory structure to build a class hierarchy
-c, --cluster
cluster the documents, and write the results to disk
--classify
Split the data into train/test, and classify the test data, outputing results in rainbow format
--classify-files=DIRNAME
Classify documents in DIRNAME, outputing `filename classname' pairs on each line.
--cluster-output-dir=DIR
After clustering is finished, write the cluster to directory DIR
-i, --index
tokenize training documents found under ARG..., build weight vectors, and save them to disk
--index-multiclass-list=FILE
Index the files listed in FILE. Each line of FILE should contain a filenames followed by a list of classnames to which that file belongs.
--print-doc-names[=TAG]
Print the filenames of documents contained in the model. If the optional TAG argument is given, print only the documents that have the specified tag.
--print-matrix
Print the word/document count matrix in an awk- or perl-accessible format. Format is sparse and includes the words and the counts.
--print-word-probabilities=FILEPREFIX
Print the word probability distribution in each leaf to files named FILEPREFIX-classname
--query-server=PORTNUM Run crossbow in server mode, listening on socket
number PORTNUM. You can try it by executing this command, then in a different shell window on the same machine typing `telnet localhost PORTNUM'.
--use-vocab-in-file=FILENAME
Limit vocabulary to just those words read as space-separated strings from FILE.
Splitting options:
--ignore-set=SOURCE
How to select the ignored documents. Same format as --test-set. Default is `0'.
--set-files-use-basename[=N]
When using files to specify doc types, compare only the last N components the doc's pathname. That is use the filename and the last N-1 directory names. If N is not specified, it defaults to 1.
--test-set=SOURCE
How to select the testing documents. A number between 0 and 1 inclusive with a decimal point indicates a random fraction of all documents. The number of documents selected from each class is determined by attempting to match the proportions of the non-ignore documents. A number with no decimal point indicates the number of documents to select randomly. Alternatively, a suffix of `pc' indicates the number of documents per-class to tag. The suffix 't' for a number or proportion indicates to tag documents from the pool of training documents, not the untagged documents. `remaining' selects all documents that remain untagged at the end. Anything else is interpreted as a filename listing documents to select. Default is `0.0'.
--train-set=SOURCE
How to select the training documents. Same format as --test-set. Default is `remaining'.
--unlabeled-set=SOURCE How to select the unlabeled documents.
Same format as --test-set. Default is `0'.
--validation-set=SOURCE
How to select the validation documents. Same format as --test-set. Default is `0'.
Hierarchical EM Clustering options:
--hem-branching-factor=NUM
Number of clusters to create. Default is 2.
--hem-deterministic-horizontal
In the horizontal E-step for a document, set to zero the membership probabilities of all leaves, except the one matching the document's filename
--hem-garbage-collection
Add extra /Misc/ children to every internal node of the hierarchy, and keep their local word distributions flat
--hem-incremental-labeling
Instead of using all unlabeled documents in the M-step, use only the labeled documents, and incrementally label those unlabeled documents that are most confidently classified in the E-step
--hem-lambdas-from-validation=NUM
Instead of setting the lambdas from the labeled/unlabeled data (possibly with LOO), instead set the lambdas using held-out validation data. 0<NUM<1 is the fraction of unlabeled documents just before EM training of the classifier begins. Default is 0, which leaves this option off.
--hem-max-num-iterations=NUM
Do no more iterations of EM than this.
--hem-maximum-depth=NUM
The hierarchy depth beyond which it will not split. Default is 6.
--hem-no-loo
Do not use leave-one-out evaluation during the E-step.
--hem-no-shrinkage
Use only the clusters at the leaves; do not do anything with the hierarchy.
--hem-no-vertical-word-movement
Use EM just to set the vertical priors, not to set the vertical word distribution; i.e. do not to `full-EM'.
--hem-pseudo-labeled
After using the labels to set the starting point for EM, change all training documents to unlabeled, so that they can have their class labels re-assigned by EM. Useful for imperfectly labeled training data.
--hem-restricted-horizontal
In the horizontal E-step for a document, set to zero the membership probabilities of all leaves whose names are not found in the document's filename
--hem-split-kl-threshold=NUM
KL divergence value at which tree leaves will be split. Default is 0.2
--hem-temperature-decay=NUM
Temperature decay factor. Default is 0.9.
--hem-temperature-end=NUM
The final value of T. Default is 1.
--hem-temperature-start=NUM
The initial value of T.
General options
--annotations=FILE
The sarray file containing annotations for the files in the index
-b, --no-backspaces
Don't use backspace when verbosifying progress (good for use in emacs)
-d, --data-dir=DIR
Set the directory in which to read/write word-vector data (default=~/.<program_name>).
--random-seed=NUM
The non-negative integer to use for seeding the random number generator
--score-precision=NUM
The number of decimal digits to print when displaying document scores
-v, --verbosity=LEVEL
Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max)
Lexing options
--append-stoplist-file=FILE
Add words in FILE to the stoplist.
--exclude-filename=FILENAME
When scanning directories for text files, skip files with name matching FILENAME.
-g, --gram-size=N
Create tokens for all 1-grams,... N-grams.
-h, --skip-header
Avoid lexing news/mail headers by scanning forward until two newlines.
--istext-avoid-uuencode
Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length.
--lex-pipe-command=SHELLCMD
Pipe files through this shell command before lexing them.
--max-num-words-per-document=N
Only tokenize the first N words in each document.
--no-stemming
Do not modify lexed words with a stemming function. (usually the default, depending on lexer)
--replace-stoplist-file=FILE
Empty the default stoplist, and add space-delimited words from FILE.
-s, --no-stoplist
Do not toss lexed words that appear in the stoplist.
--shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
Default is usually 2.
-S, --use-stemming
Modify lexed words with the `Porter' stemming function.
--use-stoplist
Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer)
--use-unknown-word
When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `<unknown>' token
--xxx-words-only
Only tokenize words with `xxx' in them
Mutually exclusive choice of lexers
--flex-mail
Use a mail-specific flex lexer
--flex-tagged
Use a tagged flex lexer
-H, --skip-html
Skip HTML tokens when lexing.
--lex-alphanum
Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters.
--lex-infix-string=ARG Use only the characters after ARG in each word for
stoplisting and stemming. If a word does not contain ARG, the entire word is used.
--lex-suffixing
Use a special lexer that adds suffixes depending on Email-style headers.
--lex-white
Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd.
Feature-selection options
-D, --prune-vocab-by-doc-count=N
Remove words that occur in N or fewer documents.
-O, --prune-vocab-by-occur-count=N
Remove words that occur less than N times.
-T, --prune-vocab-by-infogain=N
Remove all but the top N words by selecting words with highest information gain.
Weight-vector setting/scoring method options
--binary-word-counts
Instead of using integer occurrence counts of words to set weights, use binary absence/presence.
--event-document-then-word-document-length=NUM
Set the normalized length of documents when --event-model=document-then-word
--event-model=EVENTNAME
Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word.
Default is `word'.
--infogain-event-model=EVENTNAME
Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word.
Default is `document'.
-m, --method=METHOD
Set the word weight-setting method; METHOD may be one of: fienberg-classify, hem-classify, hem-cluster, multiclass, default=naivebayes.
--print-word-scores
During scoring, print the contribution of each word to each class.
--smoothing-dirichlet-filename=FILE
The file containing the alphas for the dirichlet smoothing.
--smoothing-dirichlet-weight=NUM
The weighting factor by which to muliply the alphas for dirichlet smoothing.
--smoothing-goodturing-k=NUM
Smooth word probabilities for words that occur NUM or less times. The default is 7.
--smoothing-method=METHOD
Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell
--uniform-class-priors When setting weights, calculating infogain and
scoring, use equal prior probabilities on classes.
-?, --help
Give this help list
--usage
Give a short usage message
-V, --version
Print program version

Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.

REPORTING BUGS

Please report bugs related to this program to Andrew McCallum <mccallum@cs.cmu.edu>. If the bugs are related to the Debian package send bugs to submit@bugs.debian.org

SEE ALSO

arrow(1), archer(1), rainbow(1). The full documentation for crossbow will be provided as a Texinfo manual. If the info and crossbow programs are properly installed at your site, the command

info crossbow

should give you access to the complete manual.

You can also find documentation and updates for libbow at http://www.cs.cmu.edu/~mccallum/bow