Kratylos is a facility to let researchers upload lexical and corpus datasets from FieldWorks, Praat, Elan and other software, and it lets researchers browse the results. All user data must be cited as specified in the search results. To cite the program, please use the following:

Finkel, Raphael, and Daniel Kaufman. Kratylos. Computer software. Kratylos: Unified Linguistic Corpora from Diverse Data Sources. Version 2.0. University of Kentucky and Endangered Language Alliance, 1 June 2016. Web www.kratylos.org.

The creators can be contacted through the following links:
Raphael Finkel at the University of Kentucky and Daniel Kaufman at the Endangered Language Alliance. Kratylos is sponsored by grant #1500753 from the National Science Foundation under the DEL program.

logo

Cats Claw

Computer- Assisted Technology Service
Computational Linguist's Automated Workbench

History

The Kratylos project started in 2012 as a collaboration between Daniel Kaufman at the Endangered Language Alliance (ELA, New York) and Raphael Finkel at the University of Kentucky (UKY, Lexington). Kratylos is intended as a means to present dictionary and text files with associated media from a variety of formats, including those created by FLEx, Praat, ELAN and Toolbox. Users can search over any combination of available language data as well as add their own data to the collection. Ultimately, users will also be able to create their own depositories by installing Kratylos on their own websites.

Since mid-2015, Kratylos has been supported by the NSF under grant #1500753. This grant has supported software development (Raphael Finkel, with a research assistant, Jiho Noh) and language fieldwork (Daniel Kaufman, with research assistants Ahmed Shamim, Lluvia Camacho-Cervantes and Daniel Barry).

Usage instructions

Kratylos presents three modes: upload, query, and profile. Every Kratylos web page has tabs at the top to allow users to switch modes.

Upload mode lets a user provide files for a language so that Kratylos can later display them. It requires that the user be registered and logged in. Query mode lets a user browse the currently available languages, displaying entries based on various kinds of search. Query mode does not require the user to be logged in, but private languages are only available for query to authorized and logged-in users. Profile mode (accessed by clicking on the user's logged-in name) allows a logged-in user to manipulate languages the user owns, changing authorizations and even deleting languages.

Languages have two-part names. The first part is typically the language itself, in Roman lower-case letters. The second part is a distinguishing mark, such as a version number or an informant name. Each language has an owner (the researcher who uploaded the data) and specified access privileges, such as "public" and "private". A single language may contain multiple data sets of different formats, such as FLExdict and FLEx text, but it may not have multiple sets of the same format. Each language also has a provenance, which typically indicates, as HTML-formatted text, the name of the researcher(s) and the dates of acquisition.

Upload mode

In upload mode, a logged-in user presents a single file, which can be a compressed archive of several files. Several compression techniques are acceptable, including ZIP, gzip/tar, and bzip2/tar. Kratylos scrutinizes the individual component files of the archive to determine their type. It rejects any files that it cannot identify. Most data files are in XML format; Kratylos distinguishes various data sources (FLEx, TextGrid, and EAF) and treats each appropriately. Kratylos subdivides data (utterances, lexical entries, lines from text) into entries for display.

Entries are further divided into fields, such as "part of speech" or "gloss", multiple instances of which might appear in any entry. Uploaded media files can be in any recognizable audio or video format. The researcher may upload data in several stages if desired, first providing data files and then media files, for example. The researcher may also overwrite existing data; the upload form has a checkbox that indicates whether the research wishes to allow overwriting.

Query mode

When a user enters query mode, Kratylos first lists those languages that the user may view based on whether they are public and whether the user is logged in and is authorized to view the language. The user then selects one or more languages by clicking on checkboxes and presents a query. This query can be a string, a word, or a pattern. Kratylos converts strings and words (but not patterns) into Unicode Normalization Form D (Canonical Decomposition). Kratylos matches the query against all fields of all entries for all the chosen languages; it then displays the first n (typically 5, but the user can select a different count) entries that match the query. A string query matches anywhere within a field. A word query matches only full words, delimited by spaces, punctuation, or the boundary of a field. A pattern query uses Perl regular-expression (regex) syntax. At present, queries can not match across fields, although that facility is planned. The result page that Kratylos shows for a query includes the query details, which the user may modify to submit a new query, along with entries.

Once Kratylos displays language-specific results, the user may specify a field-specific filter for further queries; each option in the menu is a data type available for that language, such as "text" or "dict", followed by the field name. If a filter is specified, then pattern-style queries apply across fields. For example, the pattern query this .*that would match any entry that has the word "this" followed by "that" at any distance, even in a different field within the entry.

Usually, Kratylos displays entries using "linear display" format, in which each entry is formatted according to a standard for its type of data. Kratylos uses specific templates for each kind of data: FLEx (dict, lift, text), TextGrid, and ELAN. Some languages have specialty templates. When the user places the cursor over a field of the display, Kratylos displays the name of the field. The user can also click on any field; that field then becomes a new string query.

If the entry displays a speaker symbol 🔊, then Kratylos has an associated sound file that the user can hear by clicking on that symbol. Similarly, the entry can display an eye symbol 👁 to indicate a video file. If the audio or video associated with an entry is a segment of a longer media file, Kratylos shows a control after playing the segment so the user can hear/see it again or play earlier or later portions. If the audio or video is an entire file, Kratylos does not show the control; the user can simply click on the symbol to replay the file.

The results page also has a Hide/Reveal button that leads to a menu of field names; clicking on a field name causes the associated field to disappear or appear on the display.

There are two experimental features in the Kratylos linear display, accessed by green buttons at the end of each entry. The user can get a rendition of the entry either in rich-text format (RTF) or in LaTeX format. These features are in development are are likely to change. The RTF output should be viewable in word processors like LibreOffice. The LaTeX output is intended to be inserted in a file that has this prologue:

\documentclass{article}
\usepackage{url}
\usepackage[usenames]{xcolor}
\usepackage{fontspec}
\setmainfont{FreeSans}
\usepackage{expex}
\setlength{\tabcolsep}{0em}
\begin{document}

Because entries usually use non-Latin alphabets, it is best to process the resulting LaTeX file with XeLaTeX.

Another experimental feature in linear display is to switch from 1-click search to modify mode. In that mode, clicking on a field brings up a form in which the user may enter a modified version of the field. At present, however, this modification is not saved to the underlying database.

The other two query-display formats are "outline display", which presents each entry as an indented sequence of fields, and "KWIC display", which shows just the field that matches the query along with the same field in several previous entries. This display format is useful if the entries comprise a narrative that includes more than one entry.

Profile mode

In profile mode, logged-in users may modify personal identification and access to the languages they have uploaded. This facility is under active development.

Behind the scenes

The implementation of Kratylos comprises several scripts written in Perl. The authors (as of 2016) are Raphael Finkel and Jiho Noh. The web server, Apache2, invokes these scripts using the Common Gateway Interface (CGI). The Perl scripts use several standard modules: CGI (and submodules Carp, Simple, and Session), HTML::Template, Digest::SHA, and JSON.

Kratylos converts uploaded data, if necessary, into a new XML format. For example, the EAF format, although in XML, is not divided into entries, so Kratylos reformats it into entries, each of which contain all the relevant tiers (such as headword, part of speech, and gloss) and a reference to the media file. It then applies a template to convert the XML into a Qddb (Quick and dirty database) representation. This representation stores all data in Unicode Normalization Form D (Canonical Decomposition). The template is format-specific and coordinates (1) the XML fields, described as XPath expressions, (2) the Qddb representation of those fields, which is hierarchical, and (3) the formatting that the linear display should employ for those fields, which involves Cascading Style Sheets (CSS). For instance, part of the template for FLEx lift format specifies that the XPath lift/entry/lexical-unit/form/@lang should have the Qddb field name HLanguage and should be displayed with a small blue font.

Although Kratylos could use Qddb itself to match queries, the databases are small enough that it can successfully apply a complete search. Some of the most complex parts of the software use the template to convert a matched entry into a displayable form.

The web pages that Kratylos presents to the user use the Bootstrap and JQuery libraries to format attractive pages. The query results page also contains JavaScript code to convert entries on the fly into RTF and LaTeX.