man crossbow (Commandes) - a front-end with hierarchical clustering and deterministic annealing
NAME
crossbow - a front-end with hierarchical clustering and deterministic annealing
SYNOPSIS
crossbow [OPTION...] [ARG...]
DESCRIPTION
Crossbow is document clustering front-end to libbow. This brief manpage was written for the Debian GNU/Linux distribution since there is none available in the main package.
Note that crossbow is not a supported program.
OPTIONS
- For building data structures from text files:
- --build-hier-from-dir
- When indexing a single directory, use the directory structure to build a class hierarchy
- -c, --cluster
- cluster the documents, and write the results to disk
- --classify
- Split the data into train/test, and classify the test data, outputing results in rainbow format
- --classify-files=DIRNAME
- Classify documents in DIRNAME, outputing `filename classname' pairs on each line.
- --cluster-output-dir=DIR
- After clustering is finished, write the cluster to directory DIR
- -i, --index
- tokenize training documents found under ARG..., build weight vectors, and save them to disk
- --index-multiclass-list=FILE
- Index the files listed in FILE. Each line of FILE should contain a filenames followed by a list of classnames to which that file belongs.
- --print-doc-names[=TAG]
- Print the filenames of documents contained in the model. If the optional TAG argument is given, print only the documents that have the specified tag.
- --print-matrix
- Print the word/document count matrix in an awk- or perl-accessible format. Format is sparse and includes the words and the counts.
- --print-word-probabilities=FILEPREFIX
- Print the word probability distribution in each leaf to files named FILEPREFIX-classname
- --query-server=PORTNUM Run crossbow in server mode, listening on socket
- number PORTNUM. You can try it by executing this command, then in a different shell window on the same machine typing `telnet localhost PORTNUM'.
- --use-vocab-in-file=FILENAME
- Limit vocabulary to just those words read as space-separated strings from FILE.
- Splitting options:
- --ignore-set=SOURCE
- How to select the ignored documents. Same format as --test-set. Default is `0'.
- --set-files-use-basename[=N]
- When using files to specify doc types, compare only the last N components the doc's pathname. That is use the filename and the last N-1 directory names. If N is not specified, it defaults to 1.
- --test-set=SOURCE
- How to select the testing documents. A number between 0 and 1 inclusive with a decimal point indicates a random fraction of all documents. The number of documents selected from each class is determined by attempting to match the proportions of the non-ignore documents. A number with no decimal point indicates the number of documents to select randomly. Alternatively, a suffix of `pc' indicates the number of documents per-class to tag. The suffix 't' for a number or proportion indicates to tag documents from the pool of training documents, not the untagged documents. `remaining' selects all documents that remain untagged at the end. Anything else is interpreted as a filename listing documents to select. Default is `0.0'.
- --train-set=SOURCE
- How to select the training documents. Same format as --test-set. Default is `remaining'.
- --unlabeled-set=SOURCE How to select the unlabeled documents.
- Same format as --test-set. Default is `0'.
- --validation-set=SOURCE
- How to select the validation documents. Same format as --test-set. Default is `0'.
- Hierarchical EM Clustering options:
- --hem-branching-factor=NUM
- Number of clusters to create. Default is 2.
- --hem-deterministic-horizontal
- In the horizontal E-step for a document, set to zero the membership probabilities of all leaves, except the one matching the document's filename
- --hem-garbage-collection
- Add extra /Misc/ children to every internal node of the hierarchy, and keep their local word distributions flat
- --hem-incremental-labeling
- Instead of using all unlabeled documents in the M-step, use only the labeled documents, and incrementally label those unlabeled documents that are most confidently classified in the E-step
- --hem-lambdas-from-validation=NUM
- Instead of setting the lambdas from the labeled/unlabeled data (possibly with LOO), instead set the lambdas using held-out validation data. 0<NUM<1 is the fraction of unlabeled documents just before EM training of the classifier begins. Default is 0, which leaves this option off.
- --hem-max-num-iterations=NUM
- Do no more iterations of EM than this.
- --hem-maximum-depth=NUM
- The hierarchy depth beyond which it will not split. Default is 6.
- --hem-no-loo
- Do not use leave-one-out evaluation during the E-step.
- --hem-no-shrinkage
- Use only the clusters at the leaves; do not do anything with the hierarchy.
- --hem-no-vertical-word-movement
- Use EM just to set the vertical priors, not to set the vertical word distribution; i.e. do not to `full-EM'.
- --hem-pseudo-labeled
- After using the labels to set the starting point for EM, change all training documents to unlabeled, so that they can have their class labels re-assigned by EM. Useful for imperfectly labeled training data.
- --hem-restricted-horizontal
- In the horizontal E-step for a document, set to zero the membership probabilities of all leaves whose names are not found in the document's filename
- --hem-split-kl-threshold=NUM
- KL divergence value at which tree leaves will be split. Default is 0.2
- --hem-temperature-decay=NUM
- Temperature decay factor. Default is 0.9.
- --hem-temperature-end=NUM
- The final value of T. Default is 1.
- --hem-temperature-start=NUM
- The initial value of T.
- General options
- --annotations=FILE
- The sarray file containing annotations for the files in the index
- -b, --no-backspaces
- Don't use backspace when verbosifying progress (good for use in emacs)
- -d, --data-dir=DIR
- Set the directory in which to read/write word-vector data (default=~/.<program_name>).
- --random-seed=NUM
- The non-negative integer to use for seeding the random number generator
- --score-precision=NUM
- The number of decimal digits to print when displaying document scores
- -v, --verbosity=LEVEL
- Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max)
- Lexing options
- --append-stoplist-file=FILE
- Add words in FILE to the stoplist.
- --exclude-filename=FILENAME
- When scanning directories for text files, skip files with name matching FILENAME.
- -g, --gram-size=N
- Create tokens for all 1-grams,... N-grams.
- -h, --skip-header
- Avoid lexing news/mail headers by scanning forward until two newlines.
- --istext-avoid-uuencode
- Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length.
- --lex-pipe-command=SHELLCMD
- Pipe files through this shell command before lexing them.
- --max-num-words-per-document=N
- Only tokenize the first N words in each document.
- --no-stemming
- Do not modify lexed words with a stemming function. (usually the default, depending on lexer)
- --replace-stoplist-file=FILE
- Empty the default stoplist, and add space-delimited words from FILE.
- -s, --no-stoplist
- Do not toss lexed words that appear in the stoplist.
- --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
- Default is usually 2.
- -S, --use-stemming
- Modify lexed words with the `Porter' stemming function.
- --use-stoplist
- Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer)
- --use-unknown-word
- When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `<unknown>' token
- --xxx-words-only
- Only tokenize words with `xxx' in them
- Mutually exclusive choice of lexers
- --flex-mail
- Use a mail-specific flex lexer
- --flex-tagged
- Use a tagged flex lexer
- -H, --skip-html
- Skip HTML tokens when lexing.
- --lex-alphanum
- Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters.
- --lex-infix-string=ARG Use only the characters after ARG in each word for
- stoplisting and stemming. If a word does not contain ARG, the entire word is used.
- --lex-suffixing
- Use a special lexer that adds suffixes depending on Email-style headers.
- --lex-white
- Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd.
- Feature-selection options
- -D, --prune-vocab-by-doc-count=N
- Remove words that occur in N or fewer documents.
- -O, --prune-vocab-by-occur-count=N
- Remove words that occur less than N times.
- -T, --prune-vocab-by-infogain=N
- Remove all but the top N words by selecting words with highest information gain.
- Weight-vector setting/scoring method options
- --binary-word-counts
- Instead of using integer occurrence counts of words to set weights, use binary absence/presence.
- --event-document-then-word-document-length=NUM
- Set the normalized length of documents when --event-model=document-then-word
- --event-model=EVENTNAME
- Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word.
- Default is `word'.
- --infogain-event-model=EVENTNAME
- Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word.
- Default is `document'.
- -m, --method=METHOD
- Set the word weight-setting method; METHOD may be one of: fienberg-classify, hem-classify, hem-cluster, multiclass, default=naivebayes.
- --print-word-scores
- During scoring, print the contribution of each word to each class.
- --smoothing-dirichlet-filename=FILE
- The file containing the alphas for the dirichlet smoothing.
- --smoothing-dirichlet-weight=NUM
- The weighting factor by which to muliply the alphas for dirichlet smoothing.
- --smoothing-goodturing-k=NUM
- Smooth word probabilities for words that occur NUM or less times. The default is 7.
- --smoothing-method=METHOD
- Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell
- --uniform-class-priors When setting weights, calculating infogain and
- scoring, use equal prior probabilities on classes.
- -?, --help
- Give this help list
- --usage
- Give a short usage message
- -V, --version
- Print program version
Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.
REPORTING BUGS
Please report bugs related to this program to Andrew McCallum <mccallum@cs.cmu.edu>. If the bugs are related to the Debian package send bugs to submit@bugs.debian.org
SEE ALSO
arrow(1), archer(1), rainbow(1). The full documentation for crossbow will be provided as a Texinfo manual. If the info and crossbow programs are properly installed at your site, the command
- info crossbow
should give you access to the complete manual.
You can also find documentation and updates for libbow at http://www.cs.cmu.edu/~mccallum/bow