User login

login

Help

Help - Formulating Search Criteria

In search mode, the entry made in a cell to define the data to search for, or your 'search criteria', is automatically interpreted as a regular expression (regex) , unless it is preceded by an equals sign (=). This sign indicates that data in the cell must exactly match the following string of characters.

A regular expression describes a pattern of text, against which the sample data are matched. If the data fit the pattern, the match is true and the sample to which that piece of data belongs is included in the set of returned samples when the search is complete.

Regular expressions are a powerful way of searching text data. Suppose you are looking for a word which you know has variants, but are not sure what they are and/or not interested in separating them out. You need to describe a pattern specific enough that it will capture the word you want (and no others), but also general enough so it will not exclude any of the variants of that word. Regular expressions let you do that - and much more. In-depth descriptions and tutorials are widely available on the web (e.g. http://www.regular-expressions.info/).

RMS shorthands

RMS extends standard regex syntax by three simple conventions which are shortcuts for expressing the concepts "Does not match this regular expression", "Is exactly equal to this string" and "Is not exactly equal to this string".

=is exactly equal to the following string

If you start your search string with the equal sign =, it will look for an exact match and an exact match only: rather than "contains this pattern", it will read: "is equal to this string". =raklo will find raklo only, not e.g. rakloro.

!=is not exactly equal to the following string

Start your regular expression with the exclamation mark followed by an equal sign !=, the search will return all samples whose data do not exactly equal the string that follows it (though they may contain it). !=raklo will find samples that have rakloro, čhavo, etc., but none that have exactly raklo.

!does not match the following regular expression

Any regular expression that starts with the exclamation mark !, will look for all the samples which do not contain the pattern described by the regex. !raklo will find any samples that do not have raklo, nor rakloro, o raklo, etc.

Regular Expressions

A regular expression is made up of some combination of normal characters and special characters.

An example of a regular expression containing no special characters might be ek. As a search criterion applied to the cell for "one" in the numerals section, ek will return all the samples where the data in that cell match the pattern ek. This will include samples where the entry for "one" is ek. It will also include samples where the entry for "one" is jek or ekh. But it will exclude samples that have ik or je, which do not contain the pattern.

Special Characters and Shorthands

Special characters are characters that have a special, non-literal meaning in a regular expression. There are eleven: ( [ . ? * + \ ^ $ | ).

The dot . for example represents any one character. If what you are looking for is an actual, literal dot, or full stop, you need to precede it with another special character the backslash \, like so: \.. The backslash is also used to turn a following normal character into a special one. Sequences like \n or \t represent non-printable characters like a linefeed or a tab; \d or \w are shorthands to describe frequently-used character classes.

The special characters and shorthands you are most likely to need in a search of the RMS, including a couple of RMS-specific ones, are listed below, along with their meanings. This list is presented on screen in search mode.

\escape character
[  ]one character out of this character class
[^ ]one character not out of this character class
\wone word character (alpha-numeric characters and underscore)
.any one character (including whitespace)
?preceding character zero or one time
*preceding character zero or more times
+preceding character one or more times
{ }preceding character a specified number of times
^start of string
$end of string
\bword boundary
[[:<:]]start of word
[[:>:]]end of word
|alternation (OR)
(  )grouping
!does not match the following regular expression
=is exactly equal to the following string
!=is not exactly equal to the following string

Usage

Character classes

Character classes let you match one out of several characters. Simply place the characters you want to match between square brackets.

[  ]one character out of this character class
[^ ]one character not out of this character class

To find all the samples which have a long vowel in a particular cell, you might use [āēīōū]: the characters inside the square brackets make up a character class, in this case, the five long vowels in Romani. It doesn't matter what order the characters in a class appear in.

To find all the samples which do not have a long vowel in a particular cell, you would use the negated character class [^āēīōū].

A search asks if any one of the characters in the class are a match. b[ij]av matches biav and bjav. It does not match bijav. bia[^v] matches biaw or biaf, but not biav, and not bia either, which doesn't have that fourth character specified in the pattern that is not v.

Inside a character class, special characters, with the exception of ] \ ^ -, do not need to be escaped with the backslash. The complete set of punctuation marks is described by [,;:.!?].

Some predefined character classes are:

\wone word character (alpha-numeric characters and underscore)
.any one character (including all whitespace except linebreak)

Use \w to find samples that contain one word character in this cell - this excludes all samples which do not have any data in the cell. (You could also use . if you don't mind running the risk of including samples that have just some specious spaces or tab characters in the cell.)

Because much of the data in the RMS actually represents sound, we have two extra character classes that might come in handy:

\vone vowel character
\cone consonant character

\v includes all the Romani short vowel [aeiouəy] and long vowel symbols [āēīōū]. It also includes accented vowel symbols [áéíóú] and [àèìòù], as well as a selection of vowel symbols which are sometimes found in borrowed items: [äëöüåæøœ].

\c includes symbols for the Romani sonoronts [mnlłrřjw], stops [pbtdkg], and non-sibilant [fvðθxɣh] and sibilant fricatives and affricates [szśźšžcč]. The transcription used in RMS means that something we tend to think of as a consonant sound in its own right - affricates (except "c" and "č"), aspirates, geminates - are represented by two characters in sequence. They are not part of the \c character class.

Quantifiers

Three of the special characters determine how often a preceding character or (more typically character class) should be matched.

?preceding character zero or one time

Use the question mark ? when you want to make the preceding character optional. For example, dž? matches d and . The same goes for a character class: d[jzž]? finds e.g. avdzi, avin, adjin.

You would not expect a sequence of more than two of the same character to occur in a row in Romani. That being the case, you can afford to be a little less specific.

+preceding character one or more times
*preceding character zero or more times

The plus sign + indicates that the preceding character occurs an indefinite number of times but at least once. šun+el would find šunel, šunnel, šunnnnel etc.

The asterisk * indicates that the preceding character occurs an indefinite number of times or even not at all: sav*o matches savo, savvo, savvvvvo etc., and sao.

Both the plus sign and the asterisk tend to be most useful following character classes or more complex groupings. With k\w+k you can find words that have two ks separated from each other by at least one word character \w+; for example kakava or akavka but not jakka. k\w*k includes jakka.

{ }preceding character a specified number of times

You can specify exactly how many times to match a character (or class, or group) with a number written inside curly brackets. \w{2} finds any sequence that is exactly two word characters long; and \w{3,5} finds any sequence that is between three and five word characters long.

Anchors

An anchor refers to a position, rather than a character.

\bword boundary
\Bnot word boundary

Suppose you want to search for all the samples where a particular example phrase contains the 1SG NOM pronoun ame. ame by itself, of course, will also return phrases that contain some PL OBL form like amenge, not to mention kames, amerika, ... By specifying the word boundary ame\b, you eliminate any rogue endings. Specify the word boundary at the beginning too, \bame\b, and you make sure to find only the exact word form ame and not jame or the like.

A lot of cells in the RMS are intended for single-word entries, so that the terms word and string become synonymous. You can always use the word boundary anchor \b to refer to the start and end of the entry, even if it is just the one word; or you can use the string start and end anchors:

^start of string
$end of string

ni finds samples that have ni in the word for "nobody": nikon, nijek, khonik, etc. Now suppose you only want samples where the word for "nobody" begins with ni: ^ni ensures that the expression ni is matched only at the start of the string, thus excluding something like khonik. Note how ^ni will also exclude something like konik ni, which \bni would have picked up (the second ni in konik ni follows a word boundary).

\b[vj]?[oe]j[.?!\s]*$ finds samples where some phrase has a variant of the 3SG.F pronoun "she" (such as oj, voj, jej, ...) in phrase-final position, disregarding potential sentence-final punctuation marks and whitespace. kaj si voj ? for example is a match, whereas kaj jej isy? is not.

Alternation

|alternation (OR)

The pipe symbol | allows you to search for two or more very different patterns in one go. For example, RMS groups possessive pronouns into five variant types: full, regular, syncopated, r-final, and minimal. If you are looking for all the samples that have either full or syncopated or r-final variants, you could do three separate searches, save the results of each and then combine them with logical OR. But since you are searching in the same cell, the regex alternation syntax is the much easier alternative, in this case: full|syncopated|r-final.

In fact, since in this particular example we know exactly what the variant types can be - one of the five above-mentioned - we could be a lot less specific in our regex and reduce it to a bare f|s (or indeed [fs]): f matches both full and r-final (and none of the others), and s adds syncopated (and none of the others).

Grouping

You can group characters by using the round brackets ( ), much in the fashion of math formulas.

( )grouping

In de(ve)?l, the entire two-character sequence ve is made optional by surrounding it with round brackets before following it with the zero-or-once quantifier ?. The expression finds strings containing del or devel. The quantifier's scope extends to the entire pattern inside the round brackets, so devl is not a match.

Anything can appear inside round brackets. \\btu\b(.+ \w+)*s\b = \btu .+ \w+s\b finds any phrase where there is at least one word between the word tu and a word ending in s. tu na sanas is a match, tu keres is not (\btu\b.+s\b finds either).