It may come with annotations such as part-of-speech tags, morphological analysis, discourse structure, and so forth.
As we saw in the IOB tagging technique (7.), it is possible to represent higher-level constituents using tags on individual words.
Two sentences, read by all speakers, were designed to bring out dialect variation: The remaining sentences were chosen to be phonetically rich, involving all phones (sounds) and a comprehensive range of diphones (phone bigrams).
Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones.
These are organized into a tree structure, shown schematically in 1.2.
At the top level there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models.
Finally, TIMIT includes demographic data about the speakers, permitting fine-grained study of vocal, social, and gender characteristics.
It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.
Moreover, even at a given level there may be different labeling schemes or even disagreement amongst annotators, such that we want to represent multiple versions.
A second property of TIMIT is its balance across multiple dimensions of variation, for coverage of dialect regions and diphones.
Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them.
The goal of this chapter is to answer the following questions: Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the lifecycle of corpus.