Class: Corpus | Voyant Tools Help

new Corpus(id)
`voyantjs/src/corpus.js`, `line 32`

Create a new Corpus using the specified Corpus ID

Parameters:

Name	Type	Description
`id`	string	The Corpus ID

Methods

analysis(config) → {Promise.<Object>} voyantjs/src/corpus.js, line 1335

Performs one of several dimension reduction statistical analysis techniques.

For more details see the scatterplot tutorial.

Parameters:

Name Type Description

config

Object

Properties

Name	Type	Description
`type`	string	The type of analysis technique to use: 'ca', 'pca', 'tsne', 'docsim'
`start`	number	The zero-based start of the list
`limit`	number	A limit to the number of items to return at a time
`dimensions`	number	The number of dimensions to render, either 2 or 3.
`bins`	number	The number of bins to separate a document into.
`clusters`	number	The number of clusters within which to group words.
`perplexity`	number	The TSNE perplexity value.
`iterations`	number	The TSNE iterations value.
`comparisonType`	string	The value to use for comparing terms. Options are: 'raw', 'relative', and 'tfidf'.
`target`	string	The term to set as the target. This will filter results to terms that are near the target.
`term`	string	Used in combination with "target" as a white list of terms to keep.
`query`	string	A term query (see search tutorial)
`stopList`	string	A list of stopwords to include (see stopwords tutorial)

Returns:

Promise.<Object>

collocates(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 951

Returns an array of collocates (either document or corpus collocates, depending on the specified mode). Collocates are terms which appear more frequently in proximity to keywords across the corpus or document.

The mode is set to "documents" when any of the following is true

the mode parameter is set to "documents"
a docIndex parameter being set
a docId parameter being set

The following is an example a Corpus Collocate (corpus mode):

{
	"term": "love",
	"rawFreq": 568,
	"contextTerm": "mr",
	"contextTermRawFreq": 24
}

The following is an example of Document Collocate (documents mode):

{
		"docIndex": 4,
		"keyword": "love",
		"keywordContextRawFrequency": 124,
		"term": "fanny",
		"termContextRawFrequency": 8,
		"termContextRelativeFrequency": 0.021680217,
		"termDocumentRawFrequency": 816,
		"termDocumentRelativeFrequency": 0.0050853477,
		"termContextDocumentRelativeFrequencyDifference": 0.01659487
}

The following config parameters are valid in both modes:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
query: a term query (see search tutorial)
stopList: a list of stopwords to include (see stopwords tutorial)
collocatesWhitelist: collocates will be limited to this list
context: the size of the context (the number of words on each side of the keyword)
dir: sort direction, ASCending or DESCending

The following are specific to corpus mode:

sort: the order of the terms, one of the following: RAWFREQ, TERM, CONTEXTTERM, CONTEXTTERMRAWFREQ

The following are specific to documents mode:

sort: the order of the terms, one of the following: TERM, REL, REL, RAW, DOCREL, DOCRAW, CONTEXTDOCRELDIFF
docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)

An example:

// show top 5 collocate terms
loadCorpus("austen").collocates({stopList: 'auto', limit: 5}).then(terms => terms.map(term => term.term))

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see list above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`query`	string	a term query (see search tutorial)
`stopList`	string	a list of stopwords to include (see stopwords tutorial)
`collocatesWhitelist`	string	collocates will be limited to this list
`context`	number	the size of the context (the number of words on each side of the keyword)
`dir`	string	sort direction, `ASC`ending or `DESC`ending

Returns:

Promise.<Array> -

a Promise for a Array of Terms

contexts(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 866

Returns an array of Objects that contain keywords in contexts (KWICs).

An individual KWIC Object looks something like this:

{
		"docIndex": 0,
		"query": "love",
		"term": "love",
		"position": 0,
		"left": "FREINDSHIP AND OTHER EARLY WORKS",
		"middle": "Love",
		"right": " And Friendship And Other Early"
}

The following are valid in the config parameter:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
query: a term query (see search tutorial)
sort: the order of the contexts: TERM, DOCINDEX, POSITION, LEFT, RIGHT
dir: sort direction, **ASC**ending or **DESC**ending
perDocLimit: the limit parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
stripTags: for the left, middle and right values, one of the following: ALL, BLOCKSONLY (tries to maintain blocks for line formatting), NONE (default)
context: the size of the context (the number of words on each side of the keyword)
docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)
overlapStrategy: determines how to handle cases where there's overlap between KWICs, such as "to be or not to be" when the keyword is "be"; here are the options:
- none: nevermind the overlap, keep all words
  - {left: "to", middle: "be", right: "or not to be"}
  - {left: "to be or not to", middle: "be", right: ""}
- first: priority goes to the first occurrence (some may be dropped)
  - {left: "to", middle: "be", right: "or not to be"}
- merge: balance the words between overlapping occurrences
  - {left: "to", middle: "be", right: "or"}
  - {left: "not to", middle: "be", right: ""}

An example:

// load the first 20 words in the corpus
loadCorpus("austen").contexts({query: "love", limit: 10})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`query`	string	a term query (see search tutorial)
`sort`	string	the order of the contexts: `TERM, DOCINDEX, POSITION, LEFT, RIGHT`
`dir`	string	sort direction, `ASC`ending or `DESC`ending
`perDocLimit`	number	the `limit` parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
`stripTags`	string	for the `left`, `middle` and `right` values, one of the following: `ALL`, `BLOCKSONLY` (tries to maintain blocks for line formatting), `NONE` (default)
`context`	number	the size of the context (the number of words on each side of the keyword)
`docIndex`	number	the zero-based index of the documents to include (use commas to separate multiple values)
`docId`	string	the document IDs to include (use commas to separate multiple values)
`overlapStrategy`	string	determines how to handle cases where there's overlap between KWICs, such as "to be or not to be" when the keyword is "be"

Returns:

Promise.<Array> -

a Promise for an Array of KWIC Objects

correlations(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1136

Returns an array of correlations (either document or corpus correlations, depending on the specified mode).

The mode is set to "documents" when any of the following is true

the mode parameter is set to "documents"
a docIndex parameter being set
a docId parameter being set

The following is an example a Corpus correlation (corpus mode):

{
	"source": {
		"term": "mrs",
		"inDocumentsCount": 8,
		"rawFreq": 2531,
		"relativePeakedness": 0.46444246,
		"relativeSkewness": -0.44197384
	},
	"target": {
		"term": "love",
		"inDocumentsCount": 8,
		"rawFreq": 568,
		"relativePeakedness": 5.763066,
		"relativeSkewness": 2.2536576
	},
	"correlation": -0.44287738,
	"significance": 0.08580014
}

The following is an example of Document correlation (documents mode), without positions requested:

{
	"source": {
		"term": "confide",
		"rawFreq": 3,
		"relativeFreq": 89.3948,
		"zscore": -0.10560975,
		"zscoreRatio": -0.7541012,
		"tfidf": 1.1168874E-5,
		"totalTermsCount": 33559,
		"docIndex": 0,
		"docId": "8a61d5d851a69c03c6ba9cc446713574"
	},
	"target": {
		"term": "love",
		"rawFreq": 54,
		"relativeFreq": 1609.1063,
		"zscore": 53.830048,
		"zscoreRatio": -707.44696,
		"tfidf": 0.0,
		"totalTermsCount": 33559,
		"docIndex": 0,
		"docId": "8a61d5d851a69c03c6ba9cc446713574"
	},
	"correlation": 0.93527687,
	"significance": 7.0970666E-5
}

The following config parameters are valid in both modes:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
termsOnly: a very compact data view of the correlations
sort: the order of the terms, one of the following: CORRELATION, CORRELATIONABS
dir: sort direction, **ASC**ending or **DESC**ending

The following is specific to corpus mode:

minInDocumentsCountRatio: the minimum coverage (as a percentage between 0 and 100) of the term, amongst all the documents

The following are specific to documents mode:

docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)

An example:

// load the first 10 phrases in the corpus
loadCorpus("austen").correlations({query: "love", limit: 10})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`minInDocumentsCountRatio`	number	the minimum coverage (as a percentage between 0 and 100) of the term, amongst all the documents
`termsOnly`	boolean	a very compact data view of the correlations
`sort`	string	the order of the terms, one of the following: `CORRELATION`, `CORRELATIONABS`
`dir`	string	sort direction, `ASC`ending or `DESC`ending

Returns:

Promise.<Array> -

a Promise for an Array of phrase Objects

documents(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 511

Returns an array of documents metadata for the corpus.

The following are valid in the config parameter:

start: the zero-based start of the list
limit: a limit to the number of items to return at a time
docIndex: a zero-based list of documents (first document is zero, etc.); multiple documents can be separated by a comma
docId: a set of document IDs; multiple documents can be separated by a comma
query: one or more term queries for the title, author or full-text
sort: one of the following sort orders: INDEX, TITLE, AUTHOR, TOKENSCOUNTLEXICAL, TYPESCOUNTLEXICAL, TYPETOKENRATIOLEXICAL, PUBDATE
dir: sort direction, **ASC**ending or **DESC**ending

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see list above)

Properties

Name	Type	Description
`start`	number	the zero-based start of the list
`limit`	number	a limit to the number of items to return at a time
`docIndex`	number	a zero-based list of documents (first document is zero, etc.); multiple documents can be separated by a comma
`docId`	string	a set of document IDs; multiple documents can be separated by a comma
`query`	string	one or more term queries for the title, author or full-text
`sort`	string	one of the following sort orders: `INDEX`, `TITLE`, `AUTHOR`, `TOKENSCOUNTLEXICAL`, `TYPESCOUNTLEXICAL`, `TYPETOKENRATIOLEXICAL`, `PUBDATE`
`dir`	string	sort direction, `ASC`ending or `DESC`ending

Returns:

Promise.<Array> -

a Promise for an Array of documents metadata

entities(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1234

Returns an array of entities.

The config object as parameter can contain the following:

docIndex: document index to restrict to (can be comma-separated list)
annotator: the annotator to use: 'stanford' or 'nssi' or 'spacy'

Parameters:

Name Type Description

config

Object

Properties

Name	Type	Description
`docIndex`	number \| string	document index to restrict to (can be comma-separated list)
`annotator`	string	the annotator to use: 'stanford' or 'nssi' or 'spacy'

Returns:

Promise.<Array>

async filterByCategory(categories, categoryNameopt) → {Promise.<Object>} voyantjs/src/corpus.js, line 1281

Given a Categories instance or ID, returns an object mapping category names to corpus terms. The results can be limited to specific category names by providing one or more of them.

Parameters:

Name	Type	Attributes	Description
`categories`	String \| Spyral.Categories		A categories ID or a Spyral.Categories instance.
`categoryName`	String \| Array.<String>	<optional>	One or more names of categories within the instance.

Returns:

Promise.<Object>

id() → {Promise.<string>} voyantjs/src/corpus.js, line 330

Returns the ID of the corpus.

Returns:

Promise.<string> -

a Promise for the string ID of the corpus

lemmas(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1163

Get lemmas. This is the equivalent of calling: this.tokens({ withPosLemmas: true, noOthers: true })

Parameters:

Name	Type	Description
`config`	Object	an Object specifying parameters (see above)

Returns:

Promise.<Array> -

a Promise for an Array of lemma Objects

metadata(config) → {Promise.<object>} voyantjs/src/corpus.js, line 415

Returns the metadata object (of the corpus or document, depending on which mode is used).

The following is an example of the object return for the metadata of the Jane Austen corpus:

{
	"id": "b50407fd1cbbecec4315a8fc411bad3c",
	"alias": "austen",
	"title": "",
	"subTitle": "",
	"documentsCount": 8,
	"createdTime": 1582429585984,
	"createdDate": "2020-02-22T22:46:25.984-0500",
	"lexicalTokensCount": 781763,
	"lexicalTypesCount": 15368,
	"noPasswordAccess": "NORMAL",
	"languageCodes": [
		"en"
	]
}

The following is an example of what is returned as metadata for the first document:

[
	{
	"id": "ddac6b12c3f4261013c63d04e8d21b45",
	"extra.X-Parsed-By": "org.apache.tika.parser.DefaultParser",
	"tokensCount-lexical": "33559",
	"lastTokenStartOffset-lexical": "259750",
	"parent_modified": "1548457455000",
	"typesCount-lexical": "4235",
	"typesCountMean-lexical": "7.924203",
	"lastTokenPositionIndex-lexical": "33558",
	"index": "0",
	"language": "en",
	"sentencesCount": "1302",
	"source": "stream",
	"typesCountStdDev-lexical": "46.626404",
	"title": "1790 Love And Freindship",
	"parent_queryParameters": "VOYANT_BUILD=M16&textarea-1015-inputEl=Type+in+one+or+more+URLs+on+separate+lines+or+paste+in+a+full+text.&VOYANT_REMOTE_ID=199.229.249.196&accessIP=199.229.249.196&VOYANT_VERSION=2.4&palette=default&suppressTools=false",
	"extra.Content-Type": "text/plain; charset=windows-1252",
	"parentType": "expansion",
	"extra.Content-Encoding": "windows-1252",
	"parent_source": "file",
	"parent_id": "ae47e3a72cd3cad51e196e8a41e21aec",
	"modified": "1432861756000",
	"location": "1790 Love And Freindship.txt",
	"parent_title": "Austen",
	"parent_location": "Austen.zip"
	}
]

In Corpus mode there's no reason to specify arguments. In documents mode you can request specific documents in the config object:

start: the zero-based start of the list
limit: a limit to the number of items to return at a time
docIndex: a zero-based list of documents (first document is zero, etc.); multiple documents can be separated by a comma
docId: a set of document IDs; multiple documents can be separated by a comma
query: one or more term queries for the title, author or full-text
sort: one of the following sort orders: INDEX, TITLE, AUTHOR, TOKENSCOUNTLEXICAL, TYPESCOUNTLEXICAL, TYPETOKENRATIOLEXICAL, PUBDATE
dir: sort direction, **ASC**ending or **DESC**ending

An example:

// this would show the number 8 (the size of the corpus)
loadCorpus("austen").metadata().then(metadata => metadata.documentsCount)

Parameters:

Name	Type	Description
`config`	Object	an Object specifying parameters (see list above)

Returns:

Promise.<object> -

a Promise for an Object containing metadata

phrases(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1030

Returns an array of phrases or n-grams (either document or corpus phrases, depending on the specified mode).

The mode is set to "documents" when any of the following is true

the mode parameter is set to "documents"
a docIndex parameter being set
a docId parameter being set

The following is an example a Corpus phrase (corpus mode), without distributions requested:

{
	"term": "love with",
	"rawFreq": 103,
	"length": 2
}

The following is an example of Document phrase (documents mode), without positions requested:

{
	"term": "love with",
	"rawFreq": 31,
	"length": 2,
	"docIndex": 5
}

The following config parameters are valid in both modes:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
minLength: the minimum length of the phrase
maxLength: the maximum length of the phrase
minRawFreq: the minimum raw frequency of the phrase
- sort: the order of the terms, one of the following: RAWFREQ, TERM, LENGTH
dir: sort direction, **ASC**ending or **DESC**ending
overlapFilter: it happens that phrases contain other phrases and we need a strategy for handling overlap:
- NONE: nevermind the overlap, keep all phrases
- LENGTHFIRST: priority goes to the longest phrases
- RAWFREQFIRST: priority goes to the highest frequency phrases
- POSITIONFIRST: priority goes to the first phrases

The following are specific to documents mode:

docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)

An example:

// load the first 20 phrases in the corpus
loadCorpus("austen").phrases({query: "love", limit: 10})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`minLength`	number	the minimum length of the phrase
`maxLength`	number	the maximum length of the phrase
`minRawFreq`	number	the minimum raw frequency of the phrase
`sort`	string	the order of the terms, one of the following: `RAWFREQ, TERM, LENGTH`
`dir`	string	sort direction, `ASC`ending or `DESC`ending
`overlapFilter`	string	it happens that phrases contain other phrases and we need a strategy for handling overlap

Returns:

Promise.<Array> -

a Promise for an Array of phrase Objects

summary() → {Promise.<string>} voyantjs/src/corpus.js, line 441

Returns a brief summary of the corpus that includes essential metadata (documents count, terms count, etc.)

An example:

loadCorpus("austen").summary();

Returns:

Promise.<string> -

a Promise for a string containing a brief summary of the corpus metadata

terms(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 682

Returns an array of terms (either CorpusTerms or DocumentTerms, depending on the specified mode). These terms are actually types, so information about each type is collected (as opposed to the tokens method which is for every occurrence in document order).

The mode is set to "documents" when any of the following is true

the mode parameter is set to "documents"
a docIndex parameter being set
a docId parameter being set

The following is an example a Corpus Term (corpus mode):

{
	"term": "the",
	"inDocumentsCount": 8,
	"rawFreq": 28292,
	"relativeFreq": 0.036189996,
	"comparisonRelativeFreqDifference": 0
}

The following is an example of Document Term (documents mode):

{
	"term": "the",
	"rawFreq": 1333,
	"relativeFreq": 39721.086,
	"zscore": 28.419,
	"zscoreRatio": -373.4891,
	"tfidf": 0.0,
	"totalTermsCount": 33559,
	"docIndex": 0,
	"docId": "8a61d5d851a69c03c6ba9cc446713574"
}

The following config parameters are valid in both modes:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
minRawFreq: the minimum raw frequency of terms
query: a term query (see search tutorial)
stopList: a list of stopwords to include (see stopwords tutorial)
withDistributions: a true value shows distribution across the corpus (corpus mode) or across the document (documents mode)
whiteList: a keyword list – terms will be limited to this list
tokenType: the token type to use, by default lexical (other possible values might be title and author)
dir: sort direction, **ASC**ending or **DESC**ending

The following are specific to corpus mode:

bins: by default there are the same number of bins as there are documents (for distribution values), this can be modified
corpusComparison: you can provide the ID of a corpus for comparison of frequency values
inDocumentsCountOnly: if you don't need term frequencies but only frequency per document set this to true
sort: the order of the terms, one of the following: INDOCUMENTSCOUNT, RAWFREQ, TERM, RELATIVEPEAKEDNESS, RELATIVESKEWNESS, COMPARISONRELATIVEFREQDIFFERENCE

The following are specific to documents mode:

bins: by default the document is divided into 10 equal bins(for distribution values), this can be modified
sort: the order of the terms, one of the following: RAWFREQ, RELATIVEFREQ, TERM, TFIDF, ZSCORE
perDocLimit: the limit parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)

An example:

// show top 5 terms
loadCorpus("austen").terms({stopList: 'auto', limit: 5}).then(terms => terms.map(term => term.term))

// show top term for each document
loadCorpus("austen").terms({stopList: 'auto', perDocLimit: 1, mode: 'documents'}).then(terms => terms.map(term => term.term))

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see list above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`minRawFreq`	number	the minimum raw frequency of terms
`query`	string	a term query (see search tutorial)
`stopList`	string	a list of stopwords to include (see stopwords tutorial)
`withDistributions`	boolean	a true value shows distribution across the corpus (corpus mode) or across the document (documents mode)
`whiteList`	string	a keyword list – terms will be limited to this list
`tokenType`	string	the token type to use, by default `lexical` (other possible values might be `title` and `author`)
`dir`	string	sort direction, `ASC`ending or `DESC`ending

Returns:

Promise.<Array> -

a Promise for a Array of Terms

text(config) → {Promise.<string>} voyantjs/src/corpus.js, line 549

Returns the text of the entire corpus.

Texts are concatenated together with two new lines and three dashes (\n\n---\n\n)

The following are valid in the config parameter:

noMarkup: strips away the markup
compactSpace: strips away superfluous spaces and multiple new lines
limit: a limit to the number of characters (per text)
format: text for plain text, any other value for the simplified Voyant markup

An example:

// fetch 1000 characters from each text in the corpus into a single string
loadCorpus("austen").text({limit:1000})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see list above)

Properties

Name	Type	Description
`noMarkup`	boolean	strips away the markup
`compactSpace`	boolean	strips away superfluous spaces and multiple new lines
`limit`	number	a limit to the number of characters (per text)
`format`	string	`text` for plain text, any other value for the simplified Voyant markup

Returns:

Promise.<string> -

a Promise for a string of the corpus

texts(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 584

Returns an array of texts from the entire corpus.

The following are valid in the config parameter:

noMarkup: strips away the markup
compactSpace: strips away superfluous spaces and multiple new lines
limit: a limit to the number of characters (per text)
format: text for plain text, any other value for the simplified Voyant markup

An example:

// fetch 1000 characters from each text in the corpus into an Array
loadCorpus("austen").texts({limit:1000})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see list above)

Properties

Name	Type	Description
`noMarkup`	boolean	strips away the markup
`compactSpace`	boolean	strips away superfluous spaces and multiple new lines
`limit`	number	a limit to the number of characters (per text)
`format`	string	`text` for plain text, any other value for the simplified Voyant markup

Returns:

Promise.<Array> -

a Promise for an Array of texts from the corpus

titles(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 483

Returns an array of document titles for the corpus.

The following are valid in the config parameter:

start: the zero-based start of the list
limit: a limit to the number of items to return at a time
docIndex: a zero-based list of documents (first document is zero, etc.); multiple documents can be separated by a comma
docId: a set of document IDs; multiple documents can be separated by a comma
query: one or more term queries for the title, author or full-text
sort: one of the following sort orders: INDEX, TITLE, AUTHOR, TOKENSCOUNTLEXICAL, TYPESCOUNTLEXICAL, TYPETOKENRATIOLEXICAL, PUBDATE
dir: sort direction, **ASC**ending or **DESC**ending

An example:

loadCorpus("austen").titles().then(titles => "The last work is: "+titles[titles.length-1])

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see list above)

Properties

Name	Type	Description
`start`	number	the zero-based start of the list
`limit`	number	a limit to the number of items to return at a time
`docIndex`	number	a zero-based list of documents (first document is zero, etc.); multiple documents can be separated by a comma
`docId`	string	a set of document IDs; multiple documents can be separated by a comma
`query`	string	one or more term queries for the title, author or full-text
`sort`	string	one of the following sort orders: `INDEX`, `TITLE`, `AUTHOR`, `TOKENSCOUNTLEXICAL`, `TYPESCOUNTLEXICAL`, `TYPETOKENRATIOLEXICAL`, `PUBDATE`
`dir`	string	sort direction, `ASC`ending or `DESC`ending

Returns:

Promise.<Array> -

a Promise for an Array of document titles

toString() voyantjs/src/corpus.js, line 1508

An alias for summary.

tokens(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 745

Returns an array of document tokens.

The promise returns an array of document token objects. A document token object can look something like this:

	{
		"docId": "8a61d5d851a69c03c6ba9cc446713574",
		"docIndex": 0,
		"term": "LOVE",
		"tokenType": "lexical",
		"rawFreq": 54,
		"position": 0,
		"startOffset": 3,
		"endOffset": 7
	}

The following are valid in the config parameter:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
stopList: a list of stopwords to include (see stopwords tutorial)
whiteList: a keyword list – terms will be limited to this list
perDocLimit: the limit parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
noOthers: only include lexical forms, no other tokens
stripTags: one of the following: ALL, BLOCKSONLY, NONE (BLOCKSONLY tries to maintain blocks for line formatting)
withPosLemmas: include part-of-speech and lemma information when available (reliability of this may vary by instance)
docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)

An example:

// load the first 20 tokens (don't include tags, spaces, etc.)
loadCorpus("austen").tokens({limit: 20, noOthers: true})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`stopList`	string	a list of stopwords to include (see stopwords tutorial)
`whiteList`	string	a keyword list – terms will be limited to this list
`perDocLimit`	number	the `limit` parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
`noOthers`	boolean	only include lexical forms, no other tokens
`stripTags`	string	one of the following: `ALL`, `BLOCKSONLY`, `NONE` (`BLOCKSONLY` tries to maintain blocks for line formatting)
`withPosLemmas`	boolean	include part-of-speech and lemma information when available (reliability of this may vary by instance)
`docIndex`	number	the zero-based index of the documents to include (use commas to separate multiple values)
`docId`	string	the document IDs to include (use commas to separate multiple values)

Returns:

Promise.<Array> -

a Promise for an Array of document tokens

tool(tool, config) → {Promise.<string>} voyantjs/src/corpus.js, line 1392

Returns an HTML snippet that will produce the specified Voyant tools to appear.

In its simplest form we can simply call the named tool:

loadCorpus("austen").tool("Cirrus");

Each tool supports some options (that are summarized below), and those can be specified as options:

loadCorpus("austen").tool("Trends", {query: "love"});

There are also parameters (width, height, style, float) that apply to the actual tool window:

loadCorpus("austen").tool("Trends", {query: "love", style: "width: 500px; height: 500px"});

It's also possible to have several tools appear at once, though they won't be connected by events (clicking in a window won't modify the other windows):

loadCorpus("austen").tool("Cirrus", "Trends");

One easy way to get connected tools is to use the CustomSet tool and experiment with the layout:

loadCorpus("austen").tool("CustomSet", {tableLayout: "Cirrus,Trends", style: "width:800px; height: 500px"});

See the list of corpus tools for available tools and options.

Parameters:

Name	Type	Description
`tool`	string	The tool to display
`config`	Object	The config object for the tool

Returns:

Promise.<string>

async topics(config) → {Promise.<Object>} voyantjs/src/corpus.js, line 1214

Performs topic modelling using the latent Dirichlet allocation. Returns an object that has two primary properties:

topics: an array of topics (words organized into bunches of a specified size)
topicDocuments: an array of documents and their topic weights

Each topic in the topics array is an object with the following properties:

words: an array of the actual words that form the topic. Each word has the same properties as the topic, as well as a "word" property that contains the text content.
tokens
documentEntropy
wordLength
coherence
uniformDist
corpusDist
effNumWords
tokenDocDiff
rank1Docs
allocationRatio
allocationCount
exclusivity

Each document in the topicDocuments array is an object with the following properties:

docId: the document ID
weights: an array of the numbers corresponding to the the weight of each topic in this document

The config object as parameter can contain the following:

topics: the number of topics to get (default is 10)
termsPerTopic: the number of terms for each topic (default is 10)
iterations: the number of iterations to do, more iterations = more accurate (default is 100)
perDocLimit: the token limit per document, starting at the beginning of the document
seed: specify a particular seed to use for random number generation
stopList: a list of stopwords to include

Parameters:

Name Type Description

config

Object

(see above)

Properties

Name	Type	Description
`topics`	number	the number of topics to get (default is 10)
`termsPerTopic`	number	the number of terms for each topic (default is 10)
`iterations`	number	the number of iterations to do, more iterations = more accurate (default is 100)
`perDocLimit`	number	specify a token limit per document, starting at the beginning of the document
`seed`	number	specify a particular seed to use for random number generation
`stopList`	string	a list of stopwords to include (see stopwords tutorial)

Returns:

Promise.<Object>

words(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 791

Returns an array of words from the corpus.

The array of words are in document order, much like tokens.

The following are valid in the config parameter:

start: the zero-based start index of the list (for paging)
limit: the maximum number of terms to provide per request
stopList: a list of stopwords to include (see stopwords tutorial)
whiteList: a keyword list – terms will be limited to this list
perDocLimit: the limit parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
docIndex: the zero-based index of the documents to include (use commas to separate multiple values)
docId: the document IDs to include (use commas to separate multiple values)

An example:

// load the first 20 words in the corpus
loadCorpus("austen").tokens({limit: 20})

Parameters:

Name Type Description

config

Object

an Object specifying parameters (see above)

Properties

Name	Type	Description
`start`	number	the zero-based start index of the list (for paging)
`limit`	number	the maximum number of terms to provide per request
`stopList`	string	a list of stopwords to include (see stopwords tutorial)
`whiteList`	string	a keyword list – terms will be limited to this list
`perDocLimit`	number	the `limit` parameter is for the total number of terms returned, this parameter allows you to specify a limit value per document
`docIndex`	number	the zero-based index of the documents to include (use commas to separate multiple values)
`docId`	string	the document IDs to include (use commas to separate multiple values)

Returns:

Promise.<Array> -

a Promise for an Array of words

static load(config, api) → {Promise.<Corpus>} voyantjs/src/corpus.js, line 1527

Load a Corpus using the provided config and api

Parameters:

Name	Type	Description
`config`	Spyral.Corpus~CorpusConfig	the Corpus config
`api`	Object	any additional API values

Returns:

Promise.<Corpus>

Type Definitions

CorpusConfig voyantjs/src/corpus.js, line 34

The Corpus config

Properties:

Name	Type	Description
`corpus`	String	The ID of a previously created corpus. A corpus ID can be used to try to retrieve a corpus that has been previously created. Typically the corpus ID is used as a first string argument, with an optional second argument for other parameters (especially those to recreate the corpus if needed). `loadCorpus("goldbug"); loadCorpus("goldbug", { // if corpus ID "goldbug" isn't found, use the input input: "https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt", inputRemoveUntil: 'THE GOLD-BUG', inputRemoveFrom: 'FOUR BEASTS IN ONE' });`
`input`	String \| Array.<String>	Input sources for the corpus. The input sources can be either normal text or URLs (starting with `http`). Typically input sources are specified as a string or an array in the first argument, with an optional second argument for other parameters. loadCorpus("Hello Voyant!"); // one document with this string loadCorpus(["Hello Voyant!", "How are you?"]); // two documents with these strings loadCorpus("http://hermeneuti.ca/"); // one document from URL loadCorpus(["http://hermeneuti.ca/", "https://en.wikipedia.org/wiki/Voyant_Tools"]); // two documents from URLs loadCorpus("Hello Voyant!", "http://hermeneuti.ca/"]); // two documents, one from string and one from URL loadCorpus("https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt", { inputRemoveUntil: 'THE GOLD-BUG', inputRemoveFrom: 'FOUR BEASTS IN ONE' }); // use a corpus ID but also specify an input source if the corpus can't be found loadCorpus("goldbug", { input: "https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt", inputRemoveUntil: 'THE GOLD-BUG', inputRemoveFrom: 'FOUR BEASTS IN ONE' });
`inputFormat`	String	The input format of the corpus (the default is to auto-detect). The auto-detect format is usually reliable and inputFormat should only be used if the default behaviour isn't desired. Most of the relevant values are used for XML documents: DTOC: Dynamic Table of Contexts XML format HTML: Hypertext Markup Language RSS: Really Simple Syndication XML format TEI: Text Encoding Initiative XML format TEICORPUS: Text Encoding Initiative Corpus XML format TEXT: plain text XML: treat the document as XML (sometimes overridding auto-detect of XML vocabularies like RSS and TEI) Other formats include PDF, MSWORD, XLSX, RTF, ODT, and ZIP (but again, these rarely need to be specified).
`tableDocuments`	String	Determine what is a document in a table (the entire table, by row, by column); only used for table-based documents. Possible values are: undefined or blank (default): the entire table is one document rows: each row of the table is a separate document columns: each column of the table is a separate document See also Creating a Corpus with Tables.
`tableContent`	String	Determine how to extract body content from the table; only used for table-based documents. Columns are referred to by numbers, the first is column 1 (not 0). You can specify separate columns by using a comma or you can combined the contents of columns/cells by using a plus sign. Some examples: 1: use column 1 1,2: use columns 1 and 2 separately 1+2,3: combine columns 1 and two and use column 3 separately See also Creating a Corpus with Tables.
`tableAuthor`	String	Determine how to extract the author from each document; only used for table-based documents. Columns are referred to by numbers, the first is column 1 (not 0). You can specify separate columns by using a comma or you can combined the contents of columns/cells by using a plus sign. Some examples: 1: use column 1 1,2: use columns 1 and 2 separately 1+2,3: combine columns 1 and two and use column 3 separately See also Creating a Corpus with Tables.
`tableTitle`	String	Determine how to extract the title from each document; only used for table-based documents. Columns are referred to by numbers, the first is column 1 (not 0). You can specify separate columns by using a comma or you can combined the contents of columns/cells by using a plus sign. Some examples: 1: use column 1 1,2: use columns 1 and 2 separately 1+2,3: combine columns 1 and two and use column 3 separately See also Creating a Corpus with Tables.
`tableGroupBy`	String	Specify a column (or columns) by which to group documents; only used for table-based documents, in rows mode. Columns are referred to by numbers, the first is column 1 (not 0). You can specify separate columns by using a comma or you can combined the contents of columns/cells by using a plus sign. Some examples: 1: use column 1 1,2: use columns 1 and 2 separately 1+2,3: combine columns 1 and two and use column 3 separately See also Creating a Corpus with Tables.
`tableNoHeadersRow`	String	Determine if the table has a first row of headers; only used for table-based documents. Provide a value of "true" if there is no header row, otherwise leave it blank or undefined (default). See also Creating a Corpus with Tables.
`tokenization`	String	The tokenization strategy to use This should usually be undefined, unless specific behaviour is required. These are the valid values: undefined or blank: use the default tokenization (which uses Unicode rules for word segmentation) wordBoundaries: use any Unicode character word boundaries for tokenization whitespace: tokenize by whitespace only (punctuation and other characters will be kept with words) See also Creating a Corpus Tokenization.
`xmlContentXpath`	String	The XPath expression that defines the location of document content (the body); only used for XML-based documents. `loadCorpus("<doc><head>Hello world!</head><body>This is Voyant!</body></doc>", { xmlContentXpath: "//body" }); // document would be: "This is Voyant!"` See also Creating a Corpus with XML.
`xmlTitleXpath`	String	The XPath expression that defines the location of each document's title; only used for XML-based documents. `loadCorpus("<doc><title>Hello world!</title><body>This is Voyant!</body></doc>", { xmlTitleXpath: "//title" }); // title would be: "Hello world!"` See also Creating a Corpus with XML.
`xmlAuthorXpath`	String	The XPath expression that defines the location of each document's author; only used for XML-based documents. `loadCorpus("<doc><author>Stéfan Sinclair</author><body>This is Voyant!</body></doc>", { xmlAuthorXpath: "//author" }); // author would be: "Stéfan Sinclair"` See also Creating a Corpus with XML.
`xmlPubPlaceXpath`	String	The XPath expression that defines the location of each document's publication place; only used for XML-based documents. `loadCorpus("<doc><pubPlace>Montreal</pubPlace><body>This is Voyant!</body></doc>", { xmlPubPlaceXpath: "//pubPlace" }); // publication place would be: "Montreal"` See also Creating a Corpus with XML.
`xmlPublisherXpath`	String	The XPath expression that defines the location of each document's publisher; only used for XML-based documents. `loadCorpus("<doc><publisher>The Owl</publisher><body>This is Voyant!</body></doc>", { xmlPublisherXpath: "//publisher" }); // publisher would be: "The Owl"` See also Creating a Corpus with XML.
`xmlKeywordXpath`	String	The XPath expression that defines the location of each document's keywords; only used for XML-based documents. `loadCorpus("<doc><keyword>text analysis</keyword><body>This is Voyant!</body></doc>", { xmlKeywordXpath: "//keyword" }); // publisher would be: "text analysis"` See also Creating a Corpus with XML.
`xmlCollectionXpath`	String	The XPath expression that defines the location of each document's collection name; only used for XML-based documents. `loadCorpus("<doc><collection>documentation</collection><body>This is Voyant!</body></doc>", { xmlCollectionXpath: "//collection" }); // publisher would be: "documentation"` See also Creating a Corpus with XML.
`xmlDocumentsXpath`	String	The XPath expression that defines the location of each document; only used for XML-based documents. See also Creating a Corpus with XML.
`xmlGroupByXpath`	String	The XPath expression by which to group multiple documents; only used for XML-based documents. `loadCorpus("<doc><sp s='Juliet'>Hello!</sp><sp s='Romeo'>Hi!</sp><sp s='Juliet'>Bye!</sp></doc>", { xmlDocumentsXpath: '//sp', xmlGroupByXpath: "//@s" }); // two docs: "Hello! Bye!" (Juliet) and "Hi!" (Romeo)` See also Creating a Corpus with XML.
`xmlExtraMetadataXpath`	String	A value that defines the location of other metadata; only used for XML-based documents. `loadCorpus("<doc><tool>Voyant</tool><phase>1</phase><body>This is Voyant!</body></doc>", { xmlExtraMetadataXpath: "tool=//tool\nphase=//phase" }); // tool would be "Voyant" and phase would be "1"` Note that `xmlExtraMetadataXpath` is a bit different from the other XPath expressions in that it's possible to define multiple values (each on its own line) in the form of name=xpath. See also Creating a Corpus with XML.
`xmlExtractorTemplate`	String	Pass the XML document through the XSL template located at the specified URL before extraction (this is ignored in XML-based documents). This is an advanced parameter that allows you to define a URL of an XSL template that can be called before text extraction (in other words, the other XML-based parameters apply after this template has been processed).
`inputRemoveUntil`	String	Omit text up until the start of the matching regular expression (this is ignored in XML-based documents). `loadCorpus("Hello world! This is Voyant!", { inputRemoveUntil: "This" }); // document would be: "This is Voyant!"` See also Creating a Corpus with Text.
`inputRemoveUntilAfter`	String	Omit text up until the end of the matching regular expression (this is ignored in XML-based documents). `loadCorpus("Hello world! This is Voyant!", { inputRemoveUntilAfter: "world!" }); // document would be: "This is Voyant!"` See also Creating a Corpus with Text.
`inputRemoveFrom`	String	Omit text from the start of the matching regular expression (this is ignored in XML-based documents). `loadCorpus("Hello world! This is Voyant!", { inputRemoveFrom: "This" }); // document would be: "Hello World!"` See also Creating a Corpus with Text.
`inputRemoveFromAfter`	String	Omit text from the end of the matching regular expression (this is ignored in XML-based documents). `loadCorpus("Hello world! This is Voyant!", { inputRemoveFromAfter: "This" }); // document would be: "Hello World! This"` See also Creating a Corpus with Text.
`subTitle`	String	A sub-title for the corpus. This is currently not used, except in the Dynamic Table of Contexts skin. Still, it may be worth specifying a subtitle for later use.
`title`	String	A title for the corpus. This is currently not used, except in the Dynamic Table of Contexts skin. Still, it may be worth specifying a title for later use.
`curatorTsv`	String	a simple TSV of paths and labels for the DToC interface (this isn't typically used outside of the specialized DToC context). The DToC skin allows curation of XML tags and attributes in order to constrain the entries shown in the interface or to provide friendlier labels. This assumes plain text unicode input with one definition per line where the simple XPath expression is separated by a tab from a label. `p paragraph ref[@target*="religion"] religion` For more information see the DToC documentation on Curating Tags

new Corpus(id) voyantjs/src/corpus.js, line 32

Parameters:

Methods

analysis(config) → {Promise.<Object>} voyantjs/src/corpus.js, line 1335

Parameters:

Properties

Returns:

collocates(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 951

Parameters:

Properties

Returns:

contexts(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 866

Parameters:

Properties

Returns:

correlations(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1136

Parameters:

Properties

Returns:

documents(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 511

Parameters:

Properties

Returns:

entities(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1234

Parameters:

Properties

Returns:

async filterByCategory(categories, categoryNameopt) → {Promise.<Object>} voyantjs/src/corpus.js, line 1281

Parameters:

Returns:

id() → {Promise.<string>} voyantjs/src/corpus.js, line 330

Returns:

lemmas(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1163

Parameters:

Returns:

metadata(config) → {Promise.<object>} voyantjs/src/corpus.js, line 415

Parameters:

Returns:

phrases(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 1030

Parameters:

Properties

Returns:

summary() → {Promise.<string>} voyantjs/src/corpus.js, line 441

Returns:

terms(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 682

Parameters:

Properties

Returns:

text(config) → {Promise.<string>} voyantjs/src/corpus.js, line 549

Parameters:

Properties

Returns:

texts(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 584

Parameters:

Properties

Returns:

titles(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 483

Parameters:

Properties

Returns:

toString() voyantjs/src/corpus.js, line 1508

tokens(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 745

Parameters:

Properties

Returns:

tool(tool, config) → {Promise.<string>} voyantjs/src/corpus.js, line 1392

Parameters:

Returns:

async topics(config) → {Promise.<Object>} voyantjs/src/corpus.js, line 1214

Parameters:

Properties

Returns:

words(config) → {Promise.<Array>} voyantjs/src/corpus.js, line 791

Parameters:

Properties

Returns:

static load(config, api) → {Promise.<Corpus>} voyantjs/src/corpus.js, line 1527

Parameters:

Returns:

Type Definitions

new Corpus(id)
`voyantjs/src/corpus.js`, `line 32`

analysis(config) → {Promise.<Object>}
`voyantjs/src/corpus.js`, `line 1335`

collocates(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 951`

contexts(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 866`

correlations(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 1136`

documents(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 511`

entities(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 1234`

async filterByCategory(categories, categoryNameopt) → {Promise.<Object>}
`voyantjs/src/corpus.js`, `line 1281`

id() → {Promise.<string>}
`voyantjs/src/corpus.js`, `line 330`

lemmas(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 1163`

metadata(config) → {Promise.<object>}
`voyantjs/src/corpus.js`, `line 415`

phrases(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 1030`

summary() → {Promise.<string>}
`voyantjs/src/corpus.js`, `line 441`

terms(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 682`

text(config) → {Promise.<string>}
`voyantjs/src/corpus.js`, `line 549`

texts(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 584`

titles(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 483`

toString()
`voyantjs/src/corpus.js`, `line 1508`

tokens(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 745`

tool(tool, config) → {Promise.<string>}
`voyantjs/src/corpus.js`, `line 1392`

async topics(config) → {Promise.<Object>}
`voyantjs/src/corpus.js`, `line 1214`

words(config) → {Promise.<Array>}
`voyantjs/src/corpus.js`, `line 791`

static load(config, api) → {Promise.<Corpus>}
`voyantjs/src/corpus.js`, `line 1527`

CorpusConfig
`voyantjs/src/corpus.js`, `line 34`