Kratylos is a facility to let researchers upload lexical and corpus datasets from FieldWorks, Praat, Elan and other software, and it lets researchers browse the results. All user data must be cited as specified in the search results. To cite the program, please use the following:

Finkel, Raphael, and Daniel Kaufman. Kratylos. Computer software. Kratylos: Unified Linguistic Corpora from Diverse Data Sources. Version 2.0. University of Kentucky and Endangered Language Alliance, 1 June 2016. Web www.kratylos.org.

The creators can be contacted through the following links:
Raphael Finkel at the University of Kentucky and Daniel Kaufman at the Endangered Language Alliance. Kratylos is sponsored by grant #1500753 from the National Science Foundation under the DEL program.

logo

Cats Claw

Computer- Assisted Technology Service
Computational Linguist's Automated Workbench

History

The Kratylos project started in 2012 as a collaboration between Daniel Kaufman at the Endangered Language Alliance (ELA, New York) and Raphael Finkel at the University of Kentucky (UKY, Lexington). Kratylos is intended as a means to present dictionary and text files with associated media from a variety of formats, including those created by FLEx, Praat, ELAN and Toolbox. Users can search over any combination of available language data as well as add their own data to the project. Ultimately, users will also be able to create their own depositories by installing Kratylos on their own websites.

Since mid-2015, Kratylos has been supported by the NSF under grant #1500753. This grant has supported software development (Raphael Finkel, with a research assistant, Jiho Noh) and language fieldwork (Daniel Kaufman, with research assistants Ahmed Shamim, Lluvia Camacho-Cervantes and Daniel Barry).

Data organization

Data are organized in two levels: language and collection. The language is recorded in Roman lower-case letters (without spaces or punctuation). The collection name is a single word, typically representing a text title, a version number, or an informant name. Together, a particular language/collection is called a project. Each project is associated with the researcher who has uploaded its data, called the project maintainer. The maintainer may upload new files at any time; files with the same name as old files overwrite the old files. Uploaded data may be accessed (searched and displayed), but not modified.

A project contains both data and metadata. The data comprise multiple uploaded data files, which can be of multiple datatypes, including Fieldworks Explorer LIFT and TEXT, Praat TextGrid, ELAN EAF, and Toolbox. Kratylos digests these files into an internal representation suitable for query and display. The metadata include the "official" language name (if any), the public/private choice, any collaborators, the provenance (free-form text, typically indicating the researcher's and consultant(s) names, and place/date of acquisition), the researcher(s) responsible for collecting the data (the creator), and, if applicable, a URL pointing to the original data. Kratylos uses the metadata to determine access rights and to construct citations for query results.

The maintainer can decide whether the project is to be public or private. Public projects are accessible by anybody. Private projects are only accessible by the maintainer and by collaborators to whom the maintainer has given permission.

Kratylos segments each project into entries. An entry is typically a single lexical item (for a lexicon, such as FLEx DICT), a small text (EAF), timing interval (Praat), or phrase (FLEx TEXT).

Users

Unregistered users and registered users who are not currently logged in are considered anonymous. Anonymous users may access all public projects. Registered users who are logged in may access public projects and private projects they maintain or for which they are listed as collaborators.

Uploading data

Uploading is accomplished from the page that a user accesses by clicking on the Upload tab at the top of the page. Registered users can create new projects (for which they become the maintainer), upload data to those projects, and establish metadata for the projects. If a project contains multiple data files of the same datatype, they should follow identical structure. For example, if there are multiple Toolbox files, they should use the same tags. Similarly, if there are multiple ELAN EAF files, they should have the same tier names. Otherwise, the maintainer should introduce separate projects for the different formats. However, a single project may contain multiple datatypes, such as FLExdict and FLEx text.

The uploader splits FLEx text files into multiple projects, one for each title. These projects are considered related and share the same metadata. Lists of projects group related projects together, allowing users to search all or some of them. Maintainers can modify the metadata of related projects in a single update, or they can choose to modify metadata on a project-by-project basis.

A user may upload a project in several steps, each time uploading a single file. Once a project has its first file, Kratylos displays its metadata on the upload page, so the maintainer need not re-enter it for further uploads, although the maintainer may modify it. The uploaded file may be a compressed archive (ZIP, gzip/tar, or bzip2/tar) of several files. Kratylos scrutinizes the individual component files of the archive to determine their type. It rejects any files that it cannot identify. File names are significant; if the maintainer submits a file with the same name and datatype as a previous one, the previous one is deleted in favor of the new one.

Some data files have associated media, either audio or video. The maintainer may upload them in any recognizable format, typically after uploading and viewing the rest of the data. Media files should have names (not including any format-specific extension, such as MP3 or WAV) according to these rules:

  • Fieldworks Explorer TEXT: Title-Segnum. If a project (after the uploader splits it) contains many titles, typically in several languages, use the first one.
  • Fieldworks Explorer LIFT: As specified in the pronunciation media tag (omitting pathname)
  • Praat TextGrid: same name as the TextGrid XML file, up to the first dot, if any (omitting pathname)
  • ELAN EAF: As specified in the EAF XML file (omitting pathname)

Querying data

Querying is accomplished from the page that a user accesses by clicking on the Query tab at the top of the page. The query page lists those projects that the user may view based on whether they are public and whether the user is logged in and is authorized to view the language. Related languages are grouped together. The user may filter the list by language, datatype, and maintainer. The user selects one or more projects and presents a query. Kratylos converts queries into Unicode Normalization Form D (Canonical Decomposition). Kratylos then displays the first n (typically 5, but the user can select a different count) entries in the selected projects that match the query. A query has one of these forms:

  • A string query matches anywhere within a field. Adjacent fields in the data are considered separated by a single space, so a string query such as the man would match both that string exactly as well as adjacent fields such as the mannerism.
  • A word query is like a string query, but it matches only full words, delimited by spaces, punctuation, or the boundary of a field.
  • A pattern query uses Perl regular-expression (regex) syntax. Pattern queries are not converted to Unicode Normalization Form D. Regex patterns can be quite complex and difficult to debug. Kratylos provides a query builder for complex patterns that search for two targets while avoiding specific intervening and adjacent elements.
  • A boolean query is composed of individual patterns separated by operators AND, OR, and NOT (which can be written as && || and !).

The user may choose to apply the query ignoring accent marks, so a word manana would match an entry containing mañana.

Instead of seeing full, formatted entries, the user can ask for a simple summary of the results, showing only the number of entries that match the query and the total number of matches (a single entry can match multiple times).

Query results

After executing a query, Kratylos displays a result page that shows both the query details, which the user may modify to submit a new query, and a list of selected projects and datatypes in which Kratylos has found matches, each with entries that match the query.

Each project with matches has a header indicating the language, the project, the datatype, and the provenance of that project. It also has a symbol that leads to a menu allowing the user to adjust the positioning of multiple projects in the result. The header also displays a button the user can click to obtain a citation for the project, either in bibTex, in APA (American Psychological Association), or simple URL style. The browser copies the citation into the selection buffer so the user can paste it into documents.

Kratylos emphasizes the part of each entry that matches the query by applying a yellow background, although it can't do that for matches that cross field boundaries.

Once Kratylos displays language-specific results, the user may specify a field-specific filter for further queries; each option in the menu is a tier name followed by the datatype to which it applies.

Kratylos initially displays entries using linear format, in which each entry is formatted according to a standard for its datatype. Some datatypes, such as EAF and Toolbox, have project-specific templates. When the user places the cursor over a field of the display, Kratylos displays the name of the field. The user may click on any field in the result, calling up a menu:

  • Hide tier: Stop showing this tier in the query results. This choice is remembered across searches.
  • Show all tiers: Show all the tiers in the data, even ones that are normally hidden. This choice is remembered across searches.
  • Restore default visibility: Hide tiers that are ordinarily hidden, and show tiers that are ordinarily shown. This choice is remembered across searches.
  • Query this value: Submit a new word query based on the content of this field.

If the entry displays an audio symbol, then Kratylos has an associated media file that the user can play by clicking on that symbol. If the media associated with an entry is a segment of a longer media file, Kratylos shows a control after playing the segment so the user can play it again or play earlier or later portions. If the media comprise an entire file, Kratylos does not show the control; the user can simply click on the symbol to replay the file.

Users who want to see query results in an outline format can change the mode for an entry by clicking on the button on the right.

Some data represent a narration. Users who wish to see subsequent or previous entries in a narration can click on the button and choose continuous mode, which begins by showing the entry in which the user clicks the button and then allows the user to move forward (more results) or backward (earlier results).

Each query result has an button, which allows the user to generate a representation of the entry as LaTeX source, as unformatted text, or as a PNG image. The first two export formats copy text to the selection buffer, from which they can be pasted into documents; the image format appears as a downloaded image. The LaTeX format produces text that is intended to be placed in a file that that has this preamble:

\documentclass{article}
\usepackage{url}
\usepackage[usenames]{xcolor}
\usepackage{fontspec}
\setmainfont{FreeSans}
\usepackage{expex}
\begin{document}
			

Because entries often use non-Latin alphabets, it is best to process the resulting LaTeX file with XeLaTeX.

Maintainers may add an annotation to any entry; the annotation can consist of text, images, audio, and video. Users may view the annotation either by clicking on the button or the button, which presents a menu including "hide/show annotation".

Logged-in users may submit feedback to a project maintainer by clicking the and then selecting "provide feedback". Kratylos displays a form that names the project and the query and prompts for a message. When the user sends the message, Kratylos converts it to email to the maintainer and includes a PNG image of the entry.

Project manipulation

Users user may click on the Projects tab at the top of the page. The projects page lists all the projects that the user can access, which depends on the user and logged-in status. Clicking on any project brings up its metadata, including dates of creation and modification. Maintainers can add and subtract collaborators, change whether the project is public, modify provenance and citation information, and even delete their projects.

Profile manipulation

Logged in users may see their profile by clicking on their email address at the top right of any page. They may edit their personal information: Name, affiliation, country, website, and any other information (free text) they wish to share. They cannot change their email address, because Kratylos uses that information as a unique personal identifier.

Behind the scenes

This is Version 2 of Kratylos; Version 1 was limited solely to Fieldworks Explorer dictionaries and text files. The implementation of Kratylos comprises several scripts written in Perl. The implementers are Raphael Finkel and Jiho Noh. The web server, Apache2, invokes these scripts on a computer running the Linux operating system, using the Common Gateway Interface (CGI). The Perl scripts use many modules archived at CPAN (the Comprehensive Perl Archive Network), including Carp, CGI, CGI::Carp, CGI::Session, Crypt::JWT, Data::Dumper, DataTables, Data::UUID, DBI, Digest::MD5, Digest::SHA, Email::Valid, Encode, Eval::Logic, Fcntl, File::Basename, File::HomeDir, File::Path, File::Spec, HTML::Entities, HTML::Template, IO::Handle, JSON, Log::Log4perl, LWP::UserAgent, MIME::Base64, SendEmail, Storable, Sys::Hostname, Text::Slugify, Unicode::Normalize, and URI::Escape.

Kratylos treats uploaded data in several steps.

  1. Each language has its own directory; Kratylos builds the language directory if needed.
  2. Each project within a language has its own directory; Kratylos builds the project directory if needed.
  3. Within the project directory, Kratylos stores all raw uploaded data in a subdirectory. Maintainers should not treat Kratylos as an archiving facility, because Kratylos does not provide a mechanism to retrieve the raw data.
  4. If necessary, Kratylos converts the uploaded data into its own datatype-specific XML format. For example, the ELAN EAF format, although in XML, is not divided into entries, so the Kratylos uploader reformats it into entries, each of which contains all the relevant tiers (such as headword, part of speech, and gloss) and a reference to the media file.
  5. The Kratylos uploader converts all uploaded media files to Ogg/Vorbis for audio and Ogg/Theora for video. It then stores the converted files in a media subdirectory of the project directory, discarding the original media files. This conversion compresses large media files (Vorbis uses far less space than WAV) and puts them in an Ogg container, which allows for accurate direct access to particular timestamps, unlike some other containers. These formats are free and require no licensing fees.
  6. Kratylos builds a Qddb (Quick and dirty database) directory for each datatype in the project. In it, Kratylos stores all data in Unicode Normalization Form D (Canonical Decomposition) and in a Qddb-specific format. The format is based on a tripartite datatype description called a template, which coordinates (1) the XML fields, described as XPath expressions, (2) the Qddb representation of those fields, which is hierarchical, and (3) the formatting that the linear display should employ for those fields, which involves Cascading Style Sheets (CSS). For instance, part of the template for Fieldworks Explorer LIFT datatype specifies that the XPath lift/entry/lexical-unit/form/@lang should have the Qddb field name HLanguage and should be displayed with a small blue font.
  7. Kratylos stores user data and project metadata in a mySQL database, which has the following tables.
    collaborators
    country_code
    languages
    projects
    users
    users_aboutme
    					

Kratylos uses Qddb format as a searchable representation to execute queries and format their results. In most cases, it searches the data by a complete scan of the data, because the databases are small enough to make this method efficient. Kratylos does use Qddb, however, for word searches. Searching for words in large lexicons is thereby much faster than a complete scan.

The Kratylos web pages contain a significant amount of CSS and JavaScript, some that we have built and some from third-party libraries: Bootstrap for general layout and typography, JQuery to access the components of pages, Datatables to provide lists of projects, Plyr to play media, and Alertify to provide ephemeral feedback. We use W3C online validation to ensure that Kratylos web pages conform to standards.