Kratylos is a facility to let researchers upload lexical and corpus datasets from FieldWorks, Praat, Elan and other software, and it lets researchers browse the results. All user data must be cited as specified in the search results. To cite the program, please use the following:

Finkel, Raphael, and Daniel Kaufman. Kratylos. Computer software. Kratylos: Unified Linguistic Corpora from Diverse Data Sources. Version 2.0. University of Kentucky and Endangered Language Alliance, 1 June 2016. Web www.kratylos.org.

The creators can be contacted through the following links:
Raphael Finkel at the University of Kentucky and Daniel Kaufman at the Endangered Language Alliance. Kratylos is sponsored by grant #1500753 from the National Science Foundation under the DEL program.

logo

Cats Claw

Computer- Assisted Technology Service
Computational Linguist's Automated Workbench

History

The Kratylos project started in 2012 as a collaboration between Daniel Kaufman at the Endangered Language Alliance (ELA, New York) and Raphael Finkel at the University of Kentucky (UKY, Lexington). Kratylos is intended as a means to present dictionary and text files with associated media from a variety of formats, including those created by FLEx, Praat, ELAN and Toolbox. Users can search over any combination of available language data as well as add their own data to the project. Ultimately, users will also be able to create their own depositories by installing Kratylos on their own websites.

Since mid-2015, Kratylos has been supported by the NSF under grant #1500753. This grant has supported software development (Raphael Finkel, with a research assistant, Jiho Noh) and language fieldwork (Daniel Kaufman, with research assistants Ahmed Shamim, Lluvia Camacho-Cervantes and Daniel Barry).

Data organization

Data are organized in two levels: language and collection. The language is recorded in Roman lower-case letters (without spaces or punctuation). The collection name is a single word, typically representing a text title, a version number, or an informant name. Together, a particular language/collection is called a project. Each project is associated with the researcher who has uploaded its data, called the project maintainer. The maintainer may upload new files at any time; files with the same name as old files overwrite the old files. Uploaded data may be accessed (searched and displayed), but not modified.

A project contains both data and metadata. The data comprise multiple uploaded data files, which can be of multiple datatypes, including Fieldworks Explorer LIFT and TEXT, Praat TextGrid, ELAN EAF, and Toolbox. Kratylos digests these files into an internal representation suitable for query and display. The metadata include the "official" language name (if any), the public/private choice, any collaborators, the provenance (free-form text, typically indicating the researcher's and consultant(s) names, and place/date of acquisition), the researcher(s) responsible for collecting the data (the creator), and, if applicable, a URL pointing to the original data. Kratylos uses the metadata to determine access rights and to construct citations for query results.

The maintainer can decide whether the project is to be public or private. Public projects are accessible by anybody. Private projects are only accessible by the maintainer and by collaborators to whom the maintainer has given permission.

Kratylos segments each project into entries. An entry is typically a single lexical item (for a lexicon, such as FLEx DICT), a small text (EAF), timing interval (Praat), or phrase (FLEx TEXT).

Users

Unregistered users and registered users who are not currently logged in are considered anonymous. Anonymous users may access all public projects. Registered users who are logged in may access public projects and private projects they maintain or for which they are listed as collaborators.

Querying data

To query within a project or set of projects, first go to the Project tab at the top of the page. This tab displays information about all the languages you can access; the list depends on your logged-in status. If you click on a language, Kratylos shows the titles of the projects for that language that share metadata such as maintainer, data type, and accessibility. You can select (and later unselect) all those projects, or you can scroll through the projects and select those that interest you. Once you have selected projects of interest, click at the top to move to the query page.

You can also click on to load a new window or tab with specific information about that project. If you have the necessary privilege, you can then modify metadata, such as provenance, researcher, and data website, for this project alone or for all related projects. You can also add and remove collaborators (they are allowed to modify metadata but not upload new data) and switch visibility for the project (and related projects, if you want) between public and private.

You can also click to go directly to that project in "continuous" mode, which lets you read the projects entries from the start.

Once Kratylos displays the query page, you can type in a query. The query can include any Unicode character. For convenience, if the selected projects contain data with non-ASCII characters, Kratylos displays a keyboard icon () that you can click to bring up a keyboard specialized to those special characters. Kratylos converts queries into Unicode Normalization Form D (Canonical Decomposition), so you can use precomposed non-ASCII characters if you like.

When you submit a query (by clicking ), Kratylos displays the first n (typically 5, but you can select a different count) entries in the selected projects that match the query. A query has one of these forms:

  • A string query matches anywhere within a field. Adjacent fields in the data are considered separated by a single space, so a string query such as the man would match both that string exactly as well as adjacent fields such as the mannerism.
  • A word query is like a string query, but it matches only full words, delimited by spaces, punctuation, or the boundary of a field.
  • A pattern query uses Perl regular-expression (regex) syntax. Pattern queries are not converted to Unicode Normalization Form D. Regex patterns can be quite complex and difficult to debug. Kratylos provides a query builder for complex patterns that search for two targets while avoiding specific intervening and adjacent elements.
  • A multi-tier query is composed of nested units. A unit has the form <tierName content>. The content can be a Perl pattern, a nested tier, or empty. To see the relevant tier names, you can switch to Outline format. If a tier is nested within another tier, you must include the outer tier as well. For instance, if the outline of tiers looks like this:
    				Group
    					text
    					basicMorpheme
    					gloss
    				
    you can specify a gloss of foo by this unit: <Group <gloss foo>>. If you put a * after the tier name, Kratylos interprets it as "any subsequent instance of this tier". Use an empty content to force Kratylos to skip an instance of the tier. Here is a complex example based on the structure of flextext tiers:
    	<Segnum 16><Word* <Morpheme <Citation <CF feta>>><Morpheme <Morph -re>>>
    This multi-tier search pattern looks for an element with Segnum matching the pattern 16, any Word with a first Morpheme with first Citation with first CF matching the pattern feta, followed directly by a Morpheme with first Morph matching the pattern -re.
  • A boolean query is composed of individual patterns separated by operators AND, OR, and NOT (which can be written as && || and !).

You may choose to apply the query ignoring accent marks, so a word manana would match an entry containing mañana.

You may specify a field-specific filter for string, word, pattern, and boolean queries; each option in the menu is a tier name followed by the datatype to which it applies.

Instead of seeing full, formatted entries, you can ask for a simple summary of the results, showing only the number of entries that match the query and the total number of matches (a single entry can match multiple times).

Query results

After executing a query, Kratylos displays the query details, which you may modify to submit a new query, and a list of selected projects and datatypes in which Kratylos has found matches, each with entries that match the query.

Each project with matches has a header indicating the language, the project, the datatype, and the provenance of that project. If there are multiple matching projects, the header also contains a symbol that leads to a menu allowing you to adjust the positioning of multiple projects in the result. The header also displays a button you can click to obtain a citation for the project, either in bibTex, in APA (American Psychological Association), or simple URL style. The browser copies the citation into the selection buffer so you can paste it into documents.

Kratylos emphasizes the part of each entry that matches the query by applying a yellow background, although it can't do that for matches that cross field boundaries.

Kratylos initially displays entries using linear format, in which each entry is formatted according to a standard for its datatype. Some datatypes, such as EAF and Toolbox, have project-specific templates. When you places the cursor over a field of the display, Kratylos displays the name of the field. You may click on any field in the result, calling up a menu:

  • Hide tier: Stop showing this tier in the query results. This choice is remembered across searches.
  • Show all tiers: Show all the tiers in the data, even ones that are normally hidden. This choice is remembered across searches.
  • Restore default visibility: Hide tiers that are ordinarily hidden, and show tiers that are ordinarily shown. This choice is remembered across searches.
  • Query this value: Submit a new word query based on the content of this field.

If the entry displays an audio symbol, then Kratylos has an associated media file that you can play by clicking on that symbol. If the media associated with an entry is a segment of a longer media file, Kratylos shows a control after playing the segment so you can play it again or play earlier or later portions. If the media comprise an entire file, Kratylos does not show the control; you can simply click on the symbol to replay the file.

If you want to see query results in an outline format, change the mode for an entry by clicking on the button on the right.

Some data represent a narration. Users who wish to see subsequent or previous entries in a narration can click on the button and choose continuous mode, which begins by showing the entry in which you clicks the button and then allows you to move forward (more results) or backward (earlier results).

Each query result has an button, which allows you to generate a representation of the entry as LaTeX source (either for the expex package or the linguex package), as a PNG image, or as unformatted text. The first two export formats copy text to the selection buffer, from which they can be pasted into documents; the image format appears as a downloaded image. The LaTeX expex and linguex outputs are intended to be placed in a file that has this preamble:

\documentclass{article}
\usepackage{url}
\usepackage[usenames]{xcolor}
\usepackage{fontspec}
\setmainfont{FreeSans}
\usepackage{expex} % or \usepackage{linguex}
\begin{document}
			

Because entries often use non-Latin alphabets, it is best to process the resulting LaTeX file with XeLaTeX.

Maintainers may add an annotation to any entry; the annotation can consist of text, images, audio, and video. Users may view the annotation either by clicking on the button or the button, which presents a menu including "hide/show annotation".

Logged-in users may submit feedback to a project maintainer by clicking the and then selecting "provide feedback". Kratylos displays a form that names the project and the query and prompts for a message. When you sends the message, Kratylos converts it to email to the maintainer and includes a PNG image of the entry.

Profile manipulation

Logged in users may see their profile by clicking on their email address at the top right of any page. They may edit their personal information: Name, affiliation, country, website, and any other information (free text) they wish to share. They cannot change their email address, because Kratylos uses that information as a unique personal identifier.

Uploading data

Uploading is accomplished from the page that a user accesses by clicking on the Upload tab at the top of the page. Registered users can create new projects (for which they become the maintainer), upload data to those projects, and establish metadata for the projects. If a project contains multiple data files of the same datatype, they should follow identical structure. For example, if there are multiple Toolbox files, they should use the same tags. Similarly, if there are multiple ELAN EAF files, they should have the same tier names. Otherwise, the maintainer should introduce separate projects for the different formats. However, a single project may contain multiple datatypes, such as FLExdict and FLEx text.

The uploader splits FLEx text files into multiple projects, one for each title. These projects are considered related and share the same metadata. Lists of projects group related projects together, allowing users to search all or some of them. Maintainers can modify the metadata of related projects in a single update, or they can choose to modify metadata on a project-by-project basis.

A user may upload a project in several steps, each time uploading a single file. Once a project has its first file, Kratylos displays its metadata on the upload page, so the maintainer need not re-enter it for further uploads, although the maintainer may modify it. The uploaded file may be a compressed archive (ZIP, gzip/tar, or bzip2/tar) of several files. Kratylos scrutinizes the individual component files of the archive to determine their type. It rejects any files that it cannot identify. File names are significant; if the maintainer submits a file with the same name and datatype as a previous one, the previous one is deleted in favor of the new one.

Some data files have associated media, either audio or video. The maintainer may upload them in any recognizable format, typically after uploading and viewing the rest of the data. Media files should have names (not including any format-specific extension, such as MP3 or WAV) according to these rules:

  • Fieldworks Explorer TEXT: Title-Segnum. If a project (after the uploader splits it) contains many titles, typically in several languages, use the first one.
  • Fieldworks Explorer LIFT: As specified in the pronunciation media tag (omitting pathname)
  • Praat TextGrid: same name as the TextGrid XML file, up to the first dot, if any (omitting pathname)
  • ELAN EAF: As specified in the EAF XML file (omitting pathname)

Behind the scenes

This is Version 2 of Kratylos; Version 1 was limited solely to Fieldworks Explorer dictionaries and text files. The implementation of Kratylos comprises several scripts written in Perl. The implementers are Raphael Finkel and Jiho Noh. The web server, Apache2, invokes these scripts on a computer running the Linux operating system, using the Common Gateway Interface (CGI). The Perl scripts use many modules archived at CPAN (the Comprehensive Perl Archive Network), including Carp, CGI, CGI::Carp, CGI::Session, Crypt::JWT, Data::Dumper, DataTables, Data::UUID, DBI, Digest::MD5, Digest::SHA, Email::Valid, Encode, Eval::Logic, Fcntl, File::Basename, File::HomeDir, File::Path, File::Spec, HTML::Entities, HTML::Template, IO::Handle, JSON, Log::Log4perl, LWP::UserAgent, MIME::Base64, SendEmail, Storable, Sys::Hostname, Text::Slugify, Unicode::Normalize, and URI::Escape.

Kratylos treats uploaded data in several steps.

  1. Each language has its own directory; Kratylos builds the language directory if needed.
  2. Each project within a language has its own directory; Kratylos builds the project directory if needed.
  3. Within the project directory, Kratylos stores all raw uploaded data in a subdirectory. Maintainers should not treat Kratylos as an archiving facility, because Kratylos does not provide a mechanism to retrieve the raw data.
  4. If necessary, Kratylos converts the uploaded data into its own datatype-specific XML format. For example, the ELAN EAF format, although in XML, is not divided into entries, so the Kratylos uploader reformats it into entries, each of which contains all the relevant tiers (such as headword, part of speech, and gloss) and a reference to the media file.
  5. The Kratylos uploader converts all uploaded media files to Ogg/Vorbis for audio and Ogg/Theora for video. It then stores the converted files in a media subdirectory of the project directory, discarding the original media files. This conversion compresses large media files (Vorbis uses far less space than WAV) and puts them in an Ogg container, which allows for accurate direct access to particular timestamps, unlike some other containers. These formats are free and require no licensing fees.
  6. Kratylos builds a Qddb (Quick and dirty database) directory for each datatype in the project. In it, Kratylos stores all data in Unicode Normalization Form D (Canonical Decomposition) and in a Qddb-specific format. The format is based on a tripartite datatype description called a template, which coordinates (1) the XML fields, described as XPath expressions, (2) the Qddb representation of those fields, which is hierarchical, and (3) the formatting that the linear display should employ for those fields, which involves Cascading Style Sheets (CSS). For instance, part of the template for Fieldworks Explorer LIFT datatype specifies that the XPath lift/entry/lexical-unit/form/@lang should have the Qddb field name HLanguage and should be displayed with a small blue font.
  7. Kratylos stores user data and project metadata in a mySQL database, which has the following tables.
    collaborators
    country_code
    languages
    projects
    users
    users_aboutme
    					

Kratylos uses Qddb format as a searchable representation to execute queries and format their results. In most cases, it searches the data by a complete scan of the data, because the databases are small enough to make this method efficient. Kratylos does use Qddb, however, for word searches. Searching for words in large lexicons is thereby much faster than a complete scan.

The Kratylos web pages contain a significant amount of CSS and JavaScript, some that we have built and some from third-party libraries: Bootstrap for general layout and typography, JQuery to access the components of pages, Datatables to provide lists of projects, Plyr to play media, and Alertify to provide ephemeral feedback. We use W3C online validation to ensure that Kratylos web pages conform to standards.