Kratylos is a facility to let researchers upload lexical and corpus datasets from FieldWorks, Praat, Elan, Toolbox, Pangloss, and other software, and it lets researchers browse the results. All user data must be cited as specified in the search results. To cite the program, please use the following:

Finkel, Raphael, and Daniel Kaufman. Kratylos. Computer software. Kratylos: Unified Linguistic Corpora from Diverse Data Sources. Version 2.0. University of Kentucky and Endangered Language Alliance, 1 June 2016. Web www.kratylos.org.

The creators can be contacted through the following links:
Raphael Finkel at the University of Kentucky and Daniel Kaufman at the Endangered Language Alliance. Kratylos is sponsored by grant #1500753 from the National Science Foundation under the DEL program.

logo

Cats Claw

Computer- Assisted Technology Service
Computational Linguist's Automated Workbench

History

The Kratylos project started in 2012 as a collaboration between Daniel Kaufman at the Endangered Language Alliance (ELA, New York) and Raphael Finkel at the University of Kentucky (UKY, Lexington). Kratylos is intended as a means to present dictionary and text files with associated media from a variety of formats, including those created by FLEx, Praat, ELAN and Toolbox. Users can search over any combination of available language data as well as add their own data to the project. Ultimately, users will also be able to create their own depositories by installing Kratylos on their own websites.

Since mid-2015, Kratylos has been supported by the NSF under grant #1500753. This grant has supported software development (Raphael Finkel, with a research assistant, Jiho Noh) and language fieldwork (Daniel Kaufman, with research assistants Ahmed Shamim, Lluvia Camacho-Cervantes and Daniel Barry).

Data organization

Data are organized in two levels: language and title. The language is typically in Roman letters, with initial capital Spaces are permissible. The title is a word or phrase, typically representing a text title or a version number. Together, a particular language/title is called a project. Each project is associated with the registered user who has uploaded its data, called the project maintainer. The maintainer may upload new files at any time; files for the same project as old files overwrite the old files. The maintainer may specify that the title is public, in which case any user, registered or not, may access it. Alternatively, the maintainer may specify that the title is private, in which case only the maintainer and registered users explicitly named as collaborators may access it. Access means to search and display uploaded data.

Kratylos stores data and metadata for each project The data comprise multiple uploaded data files, which can be of multiple datatypes, including Fieldworks Explorer LIFT and FLEXTEXT, Praat TextGrid, ELAN EAF, Lacito Pangloss, and Toolbox. Kratylos digests these files into an internal representation suitable for query and display.

The metadata include the language name, the public/private choice, any collaborators, and the provenance. The provenance. includes the title's genre (such as song or narration), its topic, the participants (often with a parenthesized comment, such as "transcriber"), the date of recording (if unspecified, Kratylos uses 9999/01/01), the location, the languages used in the recording, the name(s) of the researcher, any associated web site (such as an archive-specific site), an associated institution (typically an archive name, such as ELAR, or an institution, such as a university), and a description, which is free text describing the particular title's content. Kratylos uses the metadata to control access and to construct citations for query results.

Kratylos subdivides the data in each project into entries. An entry is typically a single lexical item (for a lexicon), a timing interval (Praat, ELAN), or an utterance (FLEXTEXT).

User categories

Unregistered users and registered users who are not currently logged in are considered anonymous. Anonymous users may access all public projects. Registered users who are logged in may access public projects and private projects they maintain or for which they are listed as collaborators.

Viewing projects

To query within a project or set of projects, first click on Projects at the top of the page. The first Projects page displays a list of the languages that have titles you may access; the list depends on your logged-in status. It also shows a map with a marker for each such language for which it has recorded a location; some languages may have no known location and are therefore missing from the map.

You can click on a map marker, which then bounces, and the list of languages is filtered to only that language. You can also filter the list of languages by name, references (such as the Glottolog or WALS identifier), or institution. Whenever the filtered list has only one language, its associated map marker, if any, bounces.

You can click on any of the references. Kratylos then opens a new browser tab containing an external document pertaining to that reference.

You may also click on any of the buttons. Kratylos then opens a new browser tab listing all the titles for that language. This list includes much of the metadata for each title; you can toggle which metadata it shows by clicking on one of the red or green column names near the top. Green names are currently displayed, and red ones are hidden. You can use the Search box near the top to filter the list based on the contents of any column.

Each title has several buttons. You can click to open a new browser tab showing the first few entries of the title. Alternatively, you can select and unselect titles for conducting searches. If you have selected any titles, you can click on the top of the page to move to the query page. Finally, if you click the button, Kratylos opens a new browser tab showing all the metadata for the title, and if you are the maintainer, it lets you modify the metadata and add/remove collaborators.

Queries

The query page lets you submit searches on the data in the currently selected titles. The search term can include any Unicode character. Kratylos converts all data and queries into Unicode Normalization Form D (Canonical Decomposition), so you can use precomposed non-ASCII characters if you like.

For convenience, if the selected titles contain data with non-ASCII characters, Kratylos displays a keyboard icon () that you can click to bring up a keyboard specialized to those special characters.

If the selected titles contain what look like morphological glosses (ASCII strings in ALL CAPS, possibly with numerals), Kratylos displays another keyboard icon () that you can click to bring up a keyboard specialized to those gloss elements.

When you submit a query (by clicking ), Kratylos displays the first n (typically 5, but you can select a different count) entries in the selected projects that match the query, adding more as you scroll if you only have one title selected.

A query has one of these forms:

  • A string query matches anywhere within a field. Adjacent fields in the data are considered separated by a single space, so a string query such as the man would match both that string exactly as well as adjacent fields such as the mannerism.
  • A word query is like a string query, but it matches only full words, delimited by spaces, punctuation, or the boundary of a field.
  • A pattern query uses Perl regular-expression (regex) syntax. A pattern can specify statistics gathering by including capture groups like this: (?<NAME>PATTERN), where NAME can be any word, and PATTERN can be any Perl pattern. Kratylos displays a table showing all matches to the pattern along with a match count, organized by the name.
    Regex patterns can be quite complex and difficult to debug. Kratylos provides two specialty query builders: (1) multi-target patterns that match two targets while avoiding specific intervening and adjacent elements, and (2) gloss patterns that match combinations of gloss elements within a single morpheme or word.
  • A multi-tier query is composed of nested units. A unit has the form <tierName content>. The content can be a Perl pattern, a nested tier, or empty. To see the relevant tier names, you can switch to Outline format. If a tier is nested within another tier, you must include the outer tier as well. For instance, if the outline of tiers looks like this:
    		Group
    			text
    			basicMorpheme
    			gloss
    		
    you can specify a gloss of foo by this unit: <Group <gloss foo>>. If you put a * after the tier name, Kratylos interprets it as "any subsequent instance of this tier". Use an empty content to force Kratylos to skip an instance of the tier. Here is a complex example based on the structure of flextext tiers:
    		<Segnum 16><Word* <Morpheme <Citation <CF feta>>><Morpheme <Morph -re>>>
    This multi-tier search pattern looks for an element with Segnum matching the pattern 16, any Word with a first Morpheme with first Citation with first CF matching the pattern feta, followed directly by a Morpheme with first Morph matching the pattern -re.
  • A boolean query is composed of individual patterns separated by operators AND, OR, and NOT (which can be written as && || and !).

You may choose to apply the query ignoring accent marks, so a word manana would match an entry containing mañana.

You may specify a field-specific filter for string, word, pattern, and boolean queries; each option in the menu is a tier name followed by the datatype to which it applies.

Instead of seeing full, formatted entries, you can ask for a simple summary of the results, showing only the number of entries that match the query and the total number of matches (a single entry can match multiple times).

Query results

After executing a query, Kratylos displays the query details, which you may modify to submit a new query, and a list of selected projects and datatypes in which Kratylos has found matches, each with entries that match the query.

Each project with matches has a header indicating the language, the project, the datatype, and the provenance of that project. If there are multiple matching projects, the header also contains a symbol that leads to a menu allowing you to adjust the positioning of multiple projects in the result. The header also displays a button you can click to obtain a citation for the project, either in bibTex, in APA (American Psychological Association), or simple URL style. The browser copies the citation into the selection buffer so you can paste it into documents.

Kratylos emphasizes the part of each entry that matches the query by applying a yellow background, although it can't do that for matches that cross field boundaries.

Kratylos initially displays entries using linear format, in which each entry is formatted according to a standard for its datatype. Some datatypes, such as EAF and Toolbox, have project-specific templates. When you places the cursor over a field of the display, Kratylos displays the name of the field. You may click on any field in the result, calling up a menu:

  • Hide tier: Stop showing this tier in the query results. This choice is remembered across searches.
  • Show all tiers: Show all the tiers in the data, even ones that are normally hidden. This choice is remembered across searches.
  • Restore default visibility: Hide tiers that are ordinarily hidden, and show tiers that are ordinarily shown. This choice is remembered across searches.
  • Query this value: Submit a new word query based on the content of this field.

If the entry displays an audio or video symbol, then Kratylos has an associated media file that you can play by clicking on that symbol. If the media associated with an entry is a segment of a longer media file, Kratylos shows a control after playing the segment so you can play it again or play earlier or later portions. If the media comprise an entire file, Kratylos does not show the control; you can simply click on the symbol to replay the file.

If you want to see query results in an outline format, change the mode for an entry by clicking on the button on the right.

Some data represent a narration. Users who wish to see subsequent or previous entries in a narration can click on the button and choose continuous mode, which begins by showing the entry in which you clicks the button and then allows you to move forward (more results) or backward (earlier results).

Each query result has an button, which allows you to generate a representation of the entry as LaTeX source (either for the expex package or the linguex package), as a PNG image, or as unformatted text. The first two export formats copy text to the selection buffer, from which they can be pasted into documents; the image format appears as a downloaded image. The LaTeX expex and linguex outputs are intended to be placed in a file that has this preamble:

		\documentclass{article}
		\usepackage{url}
		\usepackage[usenames]{xcolor}
		\usepackage{fontspec}
		\setmainfont{FreeSans}
		\usepackage{expex} % or \usepackage{linguex}
		\begin{document} 

Because entries often use non-Latin alphabets, it is best to process the resulting LaTeX file with XeLaTeX.

Maintainers may add an annotation to any entry; the annotation can consist of text, images, audio, and video. Text annotations are placed in an "Annotation" tier in the data and are searchable. Users may view the other annotations either by clicking on the button or the button, which presents a menu including "hide/show annotation".

Logged-in users may submit feedback to a project maintainer by clicking the and then selecting "provide feedback". Kratylos displays a form that names the project and the query and prompts for a message. When you sends the message, Kratylos converts it to email to the maintainer and includes a PNG image of the entry.

Profile manipulation

Logged in users may see their profile by clicking on their email address at the top right of any page. They may edit their personal information: Name, affiliation, country, website, and any other information (free text) they wish to share. They cannot change their email address, because Kratylos uses that information as a unique personal identifier.

Uploading data

Uploading is accomplished from the page that a user accesses by clicking on the Upload tab at the top of the page. Registered users can create new projects (for which they become the maintainer), upload data to those projects, and establish metadata for the projects. If a project contains multiple data files of the same datatype, they should follow identical structure. For example, if there are multiple Toolbox files, they should use the same tags. Similarly, if there are multiple ELAN EAF files, they should have the same tier names. Otherwise, the maintainer should introduce separate titles for the different formats. However, a single project may contain multiple datatypes, such as FLExdict and FLEx text.

The uploader automatically splits FLEx text files into multiple titles. These projects are considered related and share the same metadata. Maintainers can modify the metadata of related projects in a single update, or they can choose to modify metadata on a project-by-project basis.

A user may upload a project in several steps, each time uploading a single file. Once a project has its first file, Kratylos displays its metadata on the upload page, so the maintainer need not re-enter it for further uploads, although the maintainer may modify it. The uploaded file may be a compressed archive (ZIP, gzip/tar, or bzip2/tar) of several files. Kratylos scrutinizes the individual component files of the archive to determine their type. It rejects any files that it cannot identify. File names are significant; if the maintainer submits a file with the same name and datatype as a previous one, the previous one is deleted in favor of the new one.

Some data files have associated media, either audio or video. The maintainer may upload media in any recognizable format, typically after uploading and viewing the rest of the data. Media files should have names (not including any format-specific extension, such as MP3 or WAV) according to these rules:

  • Fieldworks Explorer LIFT: As specified in the pronunciation media tag (omitting pathname)
  • Fieldworks Explorer FLEXTEXT: Title-Segnum. If a project (after the uploader splits it) contains many titles, typically in several languages, use the first one.
  • Praat: same name as the TextGrid file, up to the first dot, if any (omitting pathname)
  • ELAN: As specified in the EAF XML file (omitting pathname)
  • Transcriber: As specified in the XML file (omitting pathname)
  • Pangloss: As specified in the ID attribute of the sentence (S) tag

Behind the scenes

This is Version 2 of Kratylos; Version 1 was limited solely to Fieldworks Explorer dictionaries and text files. The implementation of Kratylos comprises several scripts written in Perl. The implementers are Raphael Finkel and Jiho Noh. The web server, Apache2, invokes these scripts on a computer running the Linux operating system, using the Common Gateway Interface (CGI). The Perl scripts use many modules archived at CPAN (the Comprehensive Perl Archive Network), including Carp, CGI, CGI::Carp, CGI::Session Crypt::JWT Data::Dumper DataTables Data::UUID DBI Digest::MD5 Digest::SHA Email::Valid, Encode, Eval::Logic, Fcntl, File::Basename, File::HomeDir, File::Path, File::Spec, HTML::Entities, HTML::Template, IO::Handle, JSON, Log::Log4perl, LWP::UserAgent, MIME::Base64, SendEmail, Storable, Sys::Hostname, Text::Slugify, Unicode::Normalize, and URI::Escape.

Kratylos treats uploaded data in several steps.

  1. Each language has its own directory; Kratylos builds the language directory if needed, converting the language name into a "sluggified" ASCII-only name.
  2. Each title within a language has its own directory; Kratylos builds the (sluggified) title directory if needed.
  3. Within the title directory, Kratylos stores all raw uploaded data in a subdirectory. Maintainers should not treat Kratylos as an archiving facility, because Kratylos does not provide a mechanism to retrieve the raw data.
  4. If necessary, Kratylos converts the uploaded data into its own datatype-specific XML format. For example, the ELAN EAF format, although in XML, is not divided into entries, so the Kratylos uploader reformats it into entries, each of which contains all the relevant tiers (such as headword, part of speech, and gloss) and a reference to the media file.
  5. Kratylos builds a subtitle file in WebVtt format for a few datatypes, including EAF and Pangloss.
  6. The Kratylos uploader converts all uploaded media files to Ogg/Vorbis for audio and Ogg/Theora or MP4 for video. It then stores the converted files in a media subdirectory of the project directory, discarding the original media files. This conversion compresses large media files (Vorbis uses far less space than WAV) and puts them in an Ogg container, which allows for accurate direct access to particular timestamps, unlike some other containers. These formats are free and require no licensing fees.
  7. Kratylos builds a Qddb (Quick and dirty database) directory for each datatype in the project. In it, Kratylos stores all data in Unicode Normalization Form D (Canonical Decomposition) and in a Qddb-specific format. The format is based on a datatype-specific tripartite description called a template, which coordinates (1) the XML fields, described as XPath expressions, (2) the Qddb representation of those fields, which is hierarchical, and (3) the formatting that the linear display should employ for those fields, which involves Cascading Style Sheets (CSS). For instance, part of the template for Fieldworks Explorer LIFT datatype specifies that the XPath lift/entry/lexical-unit/form/@lang should have the Qddb field name HLanguage and should be displayed with a small blue font.
  8. Kratylos stores user data and project metadata in a mySQL database, which includes the following tables.
    			languages          
    			projects           
    			collaborators      
    			country_code       
    			users              
    		

Kratylos uses Qddb format as a searchable representation to execute queries and format their results. In most cases, it searches the data by a complete scan of the data, because the data files are small enough to make this method efficient. Kratylos does use Qddb, however, for word searches. Searching for words in large lexicons is therefore much faster than a complete scan.

The Kratylos web pages contain a significant amount of CSS and JavaScript, some that we have built and some from third-party libraries: Bootstrap for general layout and typography, JQuery to access the components of pages, Datatables to provide lists of projects, Plyr to play media, and Alertify to provide ephemeral feedback. We use W3C online validation to ensure that Kratylos web pages conform to standards.