Kratylos is a facility to let researchers upload lexical and corpus
datasets from FieldWorks,
and other software, and it lets researchers browse the results. All user
data must be cited as specified in the search results. To cite the
program, please use the following:
Finkel, Raphael, and Daniel Kaufman. Kratylos. Computer
software. Kratylos: Unified Linguistic Corpora from Diverse Data
Sources. Version 2.0. University of Kentucky and Endangered
Language Alliance, 1 June 2016. Web www.kratylos.org.
The creators can be contacted through the following links:
at the University of Kentucky and
Daniel Kaufman at the
Endangered Language Alliance. Kratylos is sponsored by grant
#1500753 from the National Science Foundation under the DEL program.
The Kratylos project started in 2012 as a collaboration between
Daniel Kaufman at the Endangered
Language Alliance (ELA, New York) and Raphael Finkel at the
University of Kentucky (UKY, Lexington). Kratylos is intended as a
means to present dictionary and text files with associated media
from a variety of formats, including those created by FLEx, Praat,
ELAN and Toolbox. Users can search over any combination of available
language data as well as add their own data to the project.
Ultimately, users will also be able to create their own depositories
by installing Kratylos on their own websites.
Since mid-2015, Kratylos has been supported by the NSF under grant
#1500753. This grant has supported software development (Raphael
Finkel, with a research assistant, Jiho Noh) and language fieldwork
(Daniel Kaufman, with research assistants Ahmed Shamim, Lluvia
Camacho-Cervantes and Daniel Barry).
Data are organized in two levels: language
and collection. The language is recorded
in Roman lower-case letters (without spaces or punctuation). The
collection name is a single word, typically representing a text
title, a version
number, or an informant name. Together, a particular
language/collection is called a project.
Each project is associated with the researcher who has uploaded its
data, called the project maintainer. The
maintainer may upload new files at any time; files with the same
name as old files overwrite the old files. Uploaded data may be
accessed (searched and displayed), but not
A project contains both data and metadata. The data comprise multiple uploaded data files,
which can be of multiple datatypes,
including Fieldworks Explorer LIFT and TEXT, Praat TextGrid, ELAN
EAF, and Toolbox. Kratylos digests these files into an internal
representation suitable for query and display. The metadata include the "official" language name
(if any), the public/private choice, any collaborators, the provenance (free-form text, typically indicating
the researcher's and consultant(s) names, and place/date of
acquisition), the researcher(s) responsible for collecting the data
(the creator), and, if applicable, a URL
pointing to the original data. Kratylos uses the metadata to
determine access rights and to construct citations for
The maintainer can decide whether the project is to be public or
private. Public projects are accessible by
anybody. Private projects are only
accessible by the maintainer and by collaborators to whom the maintainer has given
Kratylos segments each project into entries. An entry is typically a single
lexical item (for a lexicon, such as FLEx DICT), a small text (EAF),
timing interval (Praat), or phrase (FLEx TEXT).
Unregistered users and registered users who are not currently logged
in are considered anonymous. Anonymous
users may access all public projects. Registered users who are
logged in may access public projects and private projects they
maintain or for which they are listed as collaborators.
Uploading is accomplished from the page that a user accesses by
clicking on the Upload tab at the top of the page. Registered users
can create new projects (for which they become the maintainer),
upload data to those projects, and establish metadata for the
projects. If a project contains multiple data files of the same
datatype, they should follow identical structure. For example, if
there are multiple Toolbox files, they should use the same tags.
Similarly, if there are multiple ELAN EAF files, they should have
the same tier names. Otherwise, the maintainer should introduce
separate projects for the different formats. However, a single
project may contain multiple datatypes, such as FLExdict and FLEx
The uploader splits FLEx text files into multiple projects, one for
each title. These projects are considered related and share the same metadata.
Lists of projects group related projects together, allowing users to
search all or some of them. Maintainers can modify the metadata of
related projects in a single update, or they can choose to modify
metadata on a project-by-project basis.
A user may upload a project in several steps, each time uploading a
single file. Once a
project has its first file, Kratylos displays its metadata on the
upload page, so the maintainer need not re-enter it for further
uploads, although the maintainer may modify it. The uploaded file
may be a compressed archive (ZIP, gzip/tar, or bzip2/tar) of several
files. Kratylos scrutinizes the individual component files of the
archive to determine their type. It rejects any files that it
cannot identify. File names are significant; if the maintainer
submits a file with the same name and datatype as a previous one,
the previous one is deleted in favor of the new one.
Some data files have associated media, either audio or video. The
maintainer may upload them in any recognizable format, typically
after uploading and viewing the rest of the data. Media files
should have names (not including any format-specific extension, such
as MP3 or WAV)
according to these rules:
Fieldworks Explorer TEXT: Title-Segnum. If a project
(after the uploader splits it) contains many titles, typically
in several languages, use the first one.
Fieldworks Explorer LIFT: As specified in the pronunciation
media tag (omitting pathname)
Praat TextGrid: same name as the TextGrid XML file, up to
the first dot, if any (omitting pathname)
ELAN EAF: As specified in the EAF XML file (omitting
Querying is accomplished from the page that a user accesses by
clicking on the Query tab at the top of the page. The query page
lists those projects that the user may view based on whether they
are public and whether the user is logged in and is authorized to
view the language. Related languages are grouped together. The
user may filter the list by language, datatype, and maintainer. The
user selects one or more projects and
presents a query. Kratylos converts queries into Unicode
Normalization Form D (Canonical Decomposition).
Kratylos then displays the first n (typically 5, but the
user can select a different count) entries in the selected projects
that match the query. A query has one of these forms:
A string query matches anywhere within
a field. Adjacent fields in the data are considered separated
by a single space, so a string query such as the
man would match both that string exactly as well as
adjacent fields such as the mannerism.
A word query is like a string query,
but it matches only full words, delimited by spaces,
punctuation, or the boundary of a field.
A pattern query uses Perl
regular-expression (regex) syntax. Pattern queries are not
converted to Unicode Normalization Form D. Regex patterns can
be quite complex and difficult to debug. Kratylos provides a
query builder for complex patterns that search for two targets
while avoiding specific intervening and adjacent elements.
A boolean query is composed of
individual patterns separated by operators AND, OR, and
NOT (which can be written as &&||
The user may choose to apply the query ignoring accent marks, so a
word manana would match an entry containing
Instead of seeing full, formatted entries, the user can ask for a
simple summary of the results, showing only the number of entries
that match the query and the total number of matches (a single entry
can match multiple times).
After executing a query, Kratylos displays a result page that shows
both the query details, which the user may modify to submit a new
query, and a list of selected projects and datatypes in which
Kratylos has found matches, each with entries that match the query.
Each project with matches has a header indicating the language, the
project, the datatype, and the provenance of that project.
It also has a symbol that
leads to a menu allowing the user to adjust the positioning of
multiple projects in the result. The header also displays a
button the user can click to
obtain a citation for the project, either in bibTex, in APA
(American Psychological Association), or simple URL style.
The browser copies the citation into the selection buffer so the
user can paste it into documents.
Kratylos emphasizes the part of each entry that matches the query by
applying a yellow background, although it can't do that for matches
that cross field boundaries.
Once Kratylos displays language-specific results, the user may
specify a field-specific filter for further queries; each option in
the menu is a tier name followed by the datatype to which it
Kratylos initially displays entries using linear format, in which each entry is formatted
according to a standard for its datatype. Some datatypes, such as
EAF and Toolbox, have project-specific templates. When the user
places the cursor over a field of the display, Kratylos displays the
name of the field. The user may click on any field in the result,
calling up a menu:
Hide tier: Stop showing this tier
in the query results. This choice is remembered across
Show all tiers: Show all the
tiers in the data, even ones that are normally hidden. This
choice is remembered across searches.
Restore default visibility: Hide
tiers that are ordinarily hidden, and show tiers that are
ordinarily shown. This choice is remembered across searches.
Query this value: Submit a new
word query based on the content of this field.
If the entry displays an audio symbol, then Kratylos has an associated media
file that the user can play by clicking on that symbol. If the
media associated with an entry is a segment of a longer media file,
Kratylos shows a control after playing the segment so the user can
play it again or play earlier or later portions. If the media
comprise an entire file, Kratylos does not show the control; the
user can simply click on the symbol to replay the file.
Users who want to see query results in an outline format can change the mode for an entry
by clicking on the button on the right.
Some data represent a narration. Users who wish to see subsequent
or previous entries in a narration can click on the button
and choose continuous mode, which begins
by showing the entry in which the user clicks the button and then
allows the user to move forward (more results) or backward (earlier
Each query result has an button, which allows the user to
generate a representation of the entry as LaTeX source, as
unformatted text, or as a PNG image. The first two export formats
copy text to the selection buffer, from which they can be pasted
into documents; the image format appears as a downloaded image. The
LaTeX format produces text that is intended to be placed in a file
that that has this preamble:
Because entries often use non-Latin alphabets, it is best to process
the resulting LaTeX file with XeLaTeX.
Maintainers may add an annotation to any entry; the annotation can
consist of text, images, audio, and video. Users may view the
annotation either by clicking on the button or the button, which presents a menu
including "hide/show annotation".
Logged-in users may submit feedback to a project maintainer by
clicking the and then selecting "provide
feedback". Kratylos displays a form that names the project and
the query and prompts for a message. When the user sends the
message, Kratylos converts it to email to the maintainer and
includes a PNG image of the entry.
Users user may click on the Projects tab at the top of the
page. The projects page lists all the projects that the user can
access, which depends on the user and logged-in status. Clicking on
any project brings up its metadata, including dates of creation and
modification. Maintainers can add and subtract collaborators,
change whether the project is public, modify provenance and citation
information, and even delete their projects.
Logged in users may see their profile by clicking on their email
address at the top right of any page. They may edit their personal
information: Name, affiliation, country, website, and any other
information (free text) they wish to share. They cannot change
their email address, because Kratylos uses that information as a
unique personal identifier.
Behind the scenes
This is Version 2 of Kratylos; Version 1 was limited solely to
Fieldworks Explorer dictionaries and text files. The implementation
of Kratylos comprises several scripts written in Perl. The
implementers are Raphael Finkel and Jiho Noh. The web server,
Apache2, invokes these scripts on a computer running the Linux
operating system, using the Common Gateway Interface (CGI). The Perl
scripts use many modules archived at CPAN (the Comprehensive Perl
Archive Network), including
Kratylos treats uploaded data in several steps.
Each language has its own directory; Kratylos builds the
language directory if needed.
Each project within a language has its own directory;
Kratylos builds the project directory if needed.
Within the project directory, Kratylos stores all raw uploaded
data in a subdirectory. Maintainers should not treat Kratylos as
an archiving facility, because Kratylos does not provide a
mechanism to retrieve the raw data.
If necessary, Kratylos converts the uploaded data into its own
datatype-specific XML format. For example, the ELAN EAF format,
although in XML, is not divided into entries, so the Kratylos
uploader reformats it into entries, each of which contains all
the relevant tiers (such as headword, part of speech, and gloss)
and a reference to the media file.
The Kratylos uploader converts all uploaded media files to
Ogg/Vorbis for audio and Ogg/Theora for video. It then stores
the converted files in a media subdirectory of the project
directory, discarding the original media files. This conversion
compresses large media files (Vorbis uses far less space than
WAV) and puts them in an Ogg container, which allows for
accurate direct access to particular timestamps, unlike some
other containers. These formats are free and require no
Kratylos builds a Qddb
(Quick and dirty database) directory for each datatype in the
project. In it, Kratylos stores all data in Unicode
Normalization Form D (Canonical Decomposition) and in a
Qddb-specific format. The format is based on a tripartite
datatype description called a template, which coordinates (1) the XML
fields, described as XPath expressions, (2) the Qddb
representation of those fields, which is hierarchical, and (3)
the formatting that the linear display should employ for those
fields, which involves Cascading Style Sheets (CSS). For
instance, part of the template for Fieldworks Explorer LIFT
datatype specifies that the XPath
lift/entry/lexical-unit/form/@lang should have the
Qddb field name HLanguage and should be displayed
with a small blue font.
Kratylos stores user data and project metadata in a mySQL
database, which has the following tables.
Kratylos uses Qddb format as a searchable representation to execute
queries and format their results. In most cases, it searches the
data by a complete scan of the data, because the
databases are small enough to make this method efficient.
Kratylos does use Qddb, however, for word searches. Searching for
words in large lexicons is thereby much faster than a complete
The Kratylos web pages contain a significant amount of CSS and
libraries: Bootstrap for general layout and typography, JQuery to
access the components of pages, Datatables to provide lists of
projects, Plyr to play media, and Alertify to provide ephemeral
feedback. We use W3C online validation to ensure that Kratylos web
pages conform to standards.