Documentation of Dolmen
Overview
Dolmen is a free software toolbox for linguistics research. It offers a user-friendly interface to manage, annotate and use language corpora. It is particularly well suited for dealing with time-aligned data. The main features are:
- Project management: organize files into projects and manage versions.
- Extensible metadata: files can be annotated with tags, which allow you to sort and organize your data.
- Inter-operable: Dolmen can read Wavesurfer and Praat TextGrid files, and open TextGrid and sound files directly in Praat.
- Powerful search engine: build and save complex queries; search patterns across tiers.
- Standard-based: Dolmen files are encoded in XML and Unicode.
Dolmen runs on all major platforms (Windows, Mac OS X and GNU/Linux) and is freely available under the terms of the GNU General Public License (GPL).
You can read this document sequentially or jump to a specific topic directly:
- The main window
- Managing projects
- Managing metadata
- Annotation files
- Executing queries
- Regular expressions
- editing preferences
The main window
On start up, Dolmen always displays the main window, which is split into 3 areas:
- the file browser (on the right): it is used to display the files in your current project.
- the application tabs (at the bottom): a number of fixed tabs used to interact with the application.
- the viewer (which occupies the remainder of the window): it stores a number of ‘views’ (for instance, the result of a query). Views can be added, removed and moved around. They work very much like tabs on modern web browsers.
Managing projects
Overview
Dolmen was designed to deal with large corpora of linguistic data. Therefore, it does not operate on single files but on projects, which are collections
of files. When it starts up, Dolmen opens a default empty project. The project structure is displayed in the file browser, in the left part of the main window.
You can add files to the current project via the command Add file(s) to project...
available from the context menu in the file browser or from the File menu.
To select several files, keep the ctrl (or command on Mac) key pressed and click on each file you want to open. If you want to import a large number of files
at once, simply place them in one folder (directory) and use the command Add content of folder to project...
This will recursively import all files and
sub-folders from the folder you chose into the project.
Currently, Dolmen supports 2 types of files:
- sound files: Dolmen supports many sound formats including the popular WAV, AIFF and FLAC. You can see all supported formats by clicking
Supported audio formats
from the help menu. Note that Dolmen does not yet support the lossy formats OGG Vorbis and MP3. - annotations: these are time-aligned text files, which are generally Praat TextGrid files or WaveSurfer label files.
Move files to folder...
: you are then prompted for a name. Upon validating the name (by pressing Enter or clicking the ok
button),
the folder is created at the bottom of the file browser, with the selected files in it. You can drag and drop it wherever you want within the file
browser and create as many sub-folders as you wish. The hierarchical structure of your project does not affect or depend on the location of the files
on your hard-drive; it is stored in the project file directly and is meant to help you organize your files within the application.
To save your project, use the command Save project...
: a project ends with the extension .dmpr
(Dolmen project), which will be added automatically
if you omit it. You can also open an existing project using the command Open project...
The most recently opened projects are available in the File
menu
under Recent projects
. The last project that was opened during the previous session can be re-opened with the shortcut ctrl+shift+o
.
If you wish to make your project portable (e.g. you want to be able to exchange it with colleagues), it is recommended that you place all your project files inside one folder and that you place the project file inside that folder.
Managing metadata
Dolmen does not make any assumptions about the semantics associated with your project. Instead, it offers a simple yet powerful mechanism to add metadata
to it, namely “tags”. In its simplest form, a tag is a label that is attached to a file. In Dolmen, tags are furthermore grouped into categories.
To add a new tag to a file (or a set of files), select the file(s) you want to tag and right-click on the selection.
In the context menu, click on Add tag...
> Create new tag...
This will open the tag editor.
The editor will let you input a category and a value. For convenience, Dolmen provides a "Default" category, which might be the only category you need for small projects. For bigger projects, however, you will probably want to design your own categories to better organize your data. For instance, if you have 12 speakers with 3 tasks each, you may create a "Speaker" category with a different label for each speaker, and a "Task" category with 3 labels corresponding to each task. You could create extra categories for sex, age group, etc.
Note for PFC users: if the name of your file follows the PFC conventions, Dolmen will automatically create tags for the categories "Speaker", "Survey" and "Task", based on the file names.
Once you have created a new tag, it is available in the Add tag...
sub-menu. Each category is represented as a
menu containing all the labels that are available for this category.
Clicking on a label will add the tag to the selected file(s) (if it is not already present). To remove a tag,
simply use the Remove tag...
command from the context menu, which works in a similar way to Add tag...
To remove a tag permanently from all files, use Remove tag...
> Remove tag from project...
When you hover the mouse cursor over a file in the file browser, its metadata are displayed in a tool tip. Tags are displayed the format "Category : Label". Tags from the default category are simply displayed as "Label", without any category. The value of a tag is generally treated as text, but Dolmen also supports numeric values. Suppose for instance that you need a category for the age of your subjects or for a rating scale; in this case, you could create a category "Age" (or "Rating") and use only numeric values. If all values for a given category are numeric, Dolmen will understand that they must be treated as numbers and will present them differently in the tag box: instead of showing all values, it will display a value field and will let you choose a mathematical operator ("=", ">", etc.). All files supported by Dolmen can be tagged and you can add as many tags as you wish a single file.
Besides tags, Dolmen offers a "Description" field which lets you enter any text that you may want to associate with a file. To edit it, double click on file in the file browser: this will open the file in the viewer and the description is the first (and currently only...) field. If you modify it, remember to save your project for the modifications to be written to disk.
A file's metadata can be viewed by hovering over it in the file browser or by opening it in a view.
Managing bookmarks
Dolmen lets you bookmark search results so that you can retrieve them easily later on, for instance when you are writing a paper and need to discuss specific cases. A bookmark keeps track of the matched text and of the location of the match in the sound file.
To bookmark a search result, simply right-click on it in the query view, and click on Bookmark search result...
A new dialog will pop up which will let you assign a title to your bookmark and (optionally) add some notes.
To view your bookmarks, click on the combobox selector in the top-left corner of the window (above the file browser) and select
Bookmarks
. Your bookmarks will be displayed in the file browser, instead of the project's files. You can view the metadata
associated with a bookmark by hovering over it with the mouse cursor. Double-clicking on a bookmark opens the annotation at the location
that was bookmarked.
Annotations
An annotation is a time-aligned text file, typically the transcription and/or labeling of a sound file. The format used in Dolmen is inspired by Praat’s TextGrid, with a number of extensions. The key differences are:
- Annotations support (user-defined) metadata
- an annotation can be bound to a sound file
- Praat's tier intervals are called 'spans' in Dolmen
- Dolmen's items (spans and points) can be mixed within a tier.
- Dolmen's items can have connections to/from any other item in an annotation (they are treated as vertices of a graph).
DMF
). However, it is not possible to store all the information that an annotation
can store into a TextGrid file. Dolmen works around this limitation by creating a separate file in which it writes the TextGrid's metadata. By default, this file is
stored along with the TextGrid, but it is possible to hide them if you do not with to "pollute" your file system with metadata files (see preference editor).
This all happens behind the scenes : users can thus tag all their annotations (whether they are DMF
or TextGrid files) without having to bother about
which format they are stored in. However, the metadata cannot be read by Praat.
As a convenience, when an annotation file is loaded, Dolmen will automatically try to find a sound file that matches the annotation's
name with the extension .wav
, .flac
or .aiff
(in this order). If the sound file exists, the annotation will be bound to it even if it is not
part of the project.
Queries
The search window
The main search interface is available through the Search
button on the left side of the main window. Clicking the button will open a new window
(the 'search window').
The file box in the top left corner allows you to select the type of files to search in. Currently, only annotations are supported.
The Search box in the top right corner allows you to enter some text or a regular expression to search.
Next to the search field, a spin box lets you select the tier you want to search in. The default choice is Any tier
which means
that Dolmen will try to find your pattern in all tiers of the selected files. You can also restrict the search to a particular tier; in that case, if a file contains
less tiers than the tier number you chose (e.g. you try to search in tier 3 of a files that has only 2 tiers), the file is ignored and a warning is written
to the output tab. Right below the search field, the "plus" and "minus" buttons let you add and remove search tiers (see cross tier search).
Additionnally, you can select a search style for your query: valid options are Regular Expression
, UNIX Shell Pattern
and Plain Text
(see search style). A check box allows you to make your query case-sensitive. When the search is case-sensitive,
strings like "foo", "Foo" and "FOO" are all treated as different; when it is case-insensitive (the default), they are treated as one and the same string.
Below the file and search boxes is the tag box: its content is generated on the basis of the tags that the current project contains. Each category is
displayed as a group box containing a list of all the labels of this category. You can check or uncheck any label in any category (each category also has
an "All labels" button check/uncheck all labels at once).
The search engine will filter files based on the conditions that you specify in the tag box. Within a category, it uses the boolean operator OR to find the subset
of files that has either label. Across categories, it uses the operator AND to find the intersection of all the subsets defined by each category.
Once you hit the ok
button, the result of your query is presented as a new view in the viewer. You can browse the results with the mouse wheel.
The information box on the right-hand side displays information about the selected token.
If an annotation is bound to a sound file, you can play a match by double-clicking on it or by pressing the space
bar (you can also interrupt it by pressing Esc
).
You can also right-click on the selected match and click on the Play selection
action from the match context menu. If you have Praat installed and your annotation is a TextGrid,
you can use the Open selection in Praat
from the match context menu. Dolmen will open the TextGrid (and the sound file if the annotation is bound) in Praat and will display
the current match. (You need to have Praat already running for this to work.)
Cross-tier search
Dolmen is not limited to single-tier search: it is also able to perform "cross-tier" search, that is, it can retrieve results from a tier depending on conditions met on other tiers. Cross-tier search is useful when you have data that is hierarchically organized across several tiers. Typical examples are prosodic hierarchies or syntactic trees. Currently, the hierarchical organization is read off the annotation based on the items' time alignment. The root item item on the base tier must be a span (or interval in Praat). On each subordinate tier, items are considered to be dominated by the root item if they are within its boundaries: for instance, if a root item spans from 10'' to 15'', all items within that range on the dependent tiers will be treated as dependent nodes.
As it is currently implemented, cross-tier search uses the first tier in the search box as the base tier and treats the others as hierarchically dependent on the first one. That is, Whenever it finds a match in the base tier, it tries to find a match on each of the other tiers on the items that are within the range of the matching item on the base tier.
To enable cross-tier search, simply add one or several tier(s) in the search box using the "plus" button below the search box. (You can also use the key combination Alt +
to add a tier
and Alt -
to remove the last one). When cross-tier search is enabled, an additional spin box appears above the first search field, which lets you select the tier of which you want to
display the text. It can be any tier, and not necessarily one of those you are searching from.
An example will make this clearer: suppose you have 3 tiers in your file: the first one contains spans which denote syllables, the second one contains syllabic constituents ("syll") ("Onset", "Nucleus", "Coda") and the last one individual segments ("p", "a", "t"...). Let's consider a query that looks for "syll" on tier 1, "Coda" on tier 2 and displays text from tier 3. This query will first get all the items that have a "syll" label on the first tier; then, for each of those, it will look for a label "Coda" on tier 2 within the limits of the span on tier 1; for each item that matches both conditions, it will display the concatenated text of the items on tier 3 that are dominated by the matching item on tier 1. Our query will thus print all syllables that end in a coda.
Exporting queries to a spreadsheet
Dolmen can export query results to a text format (CSV, which historically stands for "comma separated values") that can be read by spreadsheet programs such Microsoft Excel or LibreOffice Calc. Dolmen uses the CSV file format (rather than say XLS) because it is simple, portable and can be read or parsed by a wide variety of programs.
To export the results of a query to a CSV file, right-click on the results and select "Export" > "Save as tab-separated file..."
. To import the CSV file into Excel, click on Data > From text
(from the main menu). In LibreOffice, simply open the CSV file like a regular file: its extension
will automatically be recognized. Both program will open a new dialog that lets you decide how to import the file. In order to be able to sucessfully load the file,
you need to specify the encoding (Unicode or UTF-8), the separator (tabulation character) and the text delimiter (double-quote character). In Excel, if you have numeric codings
that start with zero, make sure they are treated as text values when you import them, otherwise the leading zeros will be trimmed off.
CSV files are organized as a table where rows represent query results and columns represent values. The following values are extracted: the file name, the start and end of the
item in which the match was found, the left context, the match, the right context, and the project's categories. Each category is represented by a column for all matches: if there
are several values associated with a category for a given match (say category Language
and values English
and French
), the values are
separated by the character "|
" (e.g. English | French
). If there are no tags for a category, the field is left empty.
The 'Queries' tab
The Queries
tab (located at the bottom of the main window) stores all the queries you run during a session.
Double-clicking on a query will focus the query in the viewer or re-open it if you closed it.
To save a query, right-click on it to trigger the context menu and click on Save query...
: a standard file dialog will open for you to save your query.
Queries bear the extension .dmq
(Dolmen Query); the extension is added automatically if you omit it.
Note that a query is bound to a particular project: you will not be able to open a DMQ
file which doesn’t match the current project.
The query syntax is currently undocumented (it is loosely based on SQL but is specific to Dolmen). Normal users need not know anything about it.
PFC and PAC users
When using the "PFC" and "PAC" modes (see application mode), a number of facilities are available to search for schwa and liaison codings.
PFC
If you select tier 2 in the search box, the search field is made “aware” of the fact that you are looking for schwa codings and offers the following conveniences:
- you can input a star
*
to replace any digit in a coding. Thus, the pattern*422
will return all codings in word-final position between two consonants, whether schwa is realized or not. - The pattern
****
will return all codings - the symbol
%
denotes any character but 'e'. This is particularly useful when studying the correlation between spelling and pronunciation, for instance to see whether there is a significant difference between e-ending and consonant-ending words in the realization of schwa.
Similarly, if you look for a pattern in the third tier, the following shortcuts are available:
*
represents any digit (**
returns all liaison codings)C
represents any liaison consonant.
PAC
If you look for a pattern in tier 2 (r-liaison), you can use a star *
to replace any digit in the coding. The pattern ***
will return all liaison codings.
Search and regular expressions
Dolmen's search engine can make use of regular expressions (sometimes called 'regexp'), which are a special syntax that let users find text patterns.
Regular expressions are very powerful, but their syntax can be cumbersome and they can be sometimes be tricky to use properly.
Here we give an overview of regular expressions as they are implemented in Dolmen. In what follows, a 'string' is defined as an arbitrary sequence
of characters and is show in italics (e.g. the string xyz). A 'pattern' is a sequence of characters that conforms to the syntax of regular expression
and is shown in bold face (e.g. the pattern ^.*$
). A pattern 'matches' a string if there is a substring in the string that corresponds
to the pattern. The match is underlined (e.g. xyz).
Basics
Regular expressions always try to match a pattern from left to right; in their simplest form, they match a sequence of (non-special) characters. For instance, the pattern the matches the first occurrence of ‘the’ in the string the cat is chasing the mouse. Note that search can be made case-sensitive (it is case-insensitive by default) by ticking the corresponding box in the search window. In Dolmen (as currently implemented), regular expressions are ‘non-greedy’, which means they will match the smallest segment of text that corresponds to the pattern (but see caveats).
The most common symbols are:
.
: match any character^
: match the beginning of a string$
: match the end of a string\b
: match a word boundary\s
: match a white space character
[xyz]
: match either of the characters 'x', 'y' or 'z'’[^xyz]
: match any character but 'x', 'y' or 'z'[a-z]
: match any character in the range from 'a' to 'z'
\d
: match a digit character (equivalent to[0-9]
)\w
: match a word character (including digits and '_' underscore)
E?
: match 0 or 1 occurrences of the expression EE*
: match 0 or more occurrences of the expression EE+
: match 1 or more occurrences of the expression EE{n}
: match exactly n occurrences of the expression EE{n,m}
: match between n and m occurrences of the expression EE{n,}
: match at least n occurrences of the expression EE{,m}
: match at most m occurrences of the expression E (and possibly 0)
In this context, an expression must be understood as either a character (e.g. o{2,}
matches the string food) or
a sequence of characters enclosed by parentheses (e.g. (do){2}
matches the string fais dodo).
Another useful character is '|', which is used to combine expressions (logical OR). For example, the pattern ^(John|Mary)
matches the strings
John kissed Mary and Mary was kissed by John.
All the characters that are used as part of the syntax of regular expressions ('{', ')', '\', etc.) are treated as special characters by the search engine.
As such, if you need to match one of those characters in a string (e.g. the parentheses in the string and (he) she...), you need to escape it with a backslash.
For instance, the pattern \(he\)
matches the string and (he), I mean she...).
Extensions
To make things easier, Dolmen recognizes a number of additional symbols which can be useful to linguists (but are not part of the syntax of regular expressions).
#
: a word boundary (equivalent to \b)#*
: match a (possibly empty) prefix (equivalent to\b\w*
)*#
: match a (possibly empty) suffix (equivalent to\w*\b
)
These symbols offer convenience to look for derived forms. For instance, the pattern #*happ[yi]*#
could be used to match forms like
happy, happier, unhappy, happiness, unhappiness, etc. Note however that these symbols cannot be used in PFC and PAC mode in the codings tiers.
Additionally, Dolmen defines a few useful variables. Search variables always start with ‘%’ and are capitalized:
%LINE
: match a non-empty line (equivalent to^.+$
)%WORD
: match a non-empty word (equivalent to\b\w+\b
)
Booby traps
Regular expressions can sometimes be difficult to use, and may not always do what you think they should be doing. Here are some examples:
- regular expressions only match characters, they have no "understanding" of linguistic structure. As such, the notion 'word' must be understood in a broad sense: indeed, a string like AR0303BD is a perfectly valid word as far as the regular expression engine is concerned, even though it might be a speaker identifier in your conventions. This is something you may have to take into consideration, for instance if you want to count words in a corpus.
- Regular expressions are 'non-greedy', but that doesn’t mean that they are 'minimal'. Let’s consider the pattern
I.*know
matched against the string I don’t, I don’t know. Because search is non-greedy, we might expect that it will match the substring I don’t know, but this is not the case: it matches the whole string I don’t, I don’t know. The reason for this is that the regular expression engine parses a string from left to right and returns the first match without computing all the logical possibilities (for performance reasons). In this case, it matches the first character of the string and continues until the end. 'Minimal' search is not supported by any regular expression engine and is currently not implemented in Dolmen.
Learn more
If you would like to learn more about the regular expression engine used in Dolmen, you should have a look at Qt's regular expression engine, which is used by Dolmen.
The following website also contains lots of information about regular expressions www.regular-expressions.info.
Another very good resource about regular expressions in general is: Jeffrey E.F. Friedl (1997) Mastering regular expressions, O’Reilly.
The preference editor
The preference editor is available in the main menu under Edit
> Preferences
(or Dolmen
> Preferences
on Mac). The editor contains 2 tabs,
which allow you to adjust your settings to your system and/or to your liking.
The 'General' tab
The application data folder is the folder where Dolmen stores its own files, especially metadata files when they are not stored with the TextGrids.
If you need to change this value, click on the choose...
button and navigate to the folder that you want to use.
The match context window is the size (in characters) of the context on each side of your match in a query view. Adjust it according to the size of your screen. Note that choosing a longer window will somewhat slow down the search, depending on your hardware and on the size of your corpus. By default, this option is set to 30 characters.
The default search style lets you decide how a pattern is to be searched by default. Options are:
Regular Expressions
: this is the most powerful mode. It lets you use Perl-like regular expressions.UNIX Shell Pattern
: easier than regular expressions but much more limited.Plain Text
: no special characters
The last option lets you decide whether to store the metadata of non-native files (e.g. TextGrids) along with the files. If you
select Yes
(the default case), the metadata file will be stored in the same folder as the file. If you select No
, it will be stored
in the Application data
folder.
The 'Advanced' tab
The Praat path
option, as you would expect, lets you adjust the path to Praat. This is particularly useful if it is located in a non-standard place
(especially on Windows and Linux).
The TextGrid encoding
lets you modify which encoding TextGrid files should be read in.
By default, Praat uses the ASCII
encoding for files that do not use any
special characters (e.g. a label file written in English) and UTF-16
for files that do contain special characters (e.g. a transcription in French).
In Praat, you can check (and modify) this setting under Praat
> Preferences
> Text writing preferences...
By default, Dolmen assumes that files are encoded in UTF-16
, but you can change it to UTF-8
or ASCII
if it is not the case.
Older formats like Windows ISO 8859-15
and Mac Roman
are not supported.
The last option is the Application mode
. Most users will use the Default
mode.
However, users working within the projects “Phonologie du français contemporain” (PFC) and “Phonologie de l’anglais contemporain” (PAC) should select the
PFC
and PAC
mode respectively.
These special modes enable a number of features which are specific to those projects (and which were previously available in the PFC/PAC toolbox, which Dolmen supersedes).