Kratylos is a facility to let researchers upload lexical and corpus
datasets from
FieldWorks,
Praat,
Elan,
Toolbox,
Pangloss,
and other software, and it lets researchers browse the results. All user
data must be cited as specified in the search results. To cite the
program, please use the following:
Finkel, Raphael, and Daniel Kaufman. Kratylos. Computer
software. Kratylos: Unified Linguistic Corpora from Diverse Data
Sources. Version 2.0. University of Kentucky and Endangered
Language Alliance, 1 June 2016. Web www.kratylos.org.
The creators can be contacted through the following links:
Raphael Finkel
at the University of Kentucky and
Daniel Kaufman at the
Endangered Language Alliance. Kratylos is sponsored by grant
#1500753 from the National Science Foundation under the DEL program.
Cats Claw
Computer-
Assisted
Technology
Service Computational
Linguist's
Automated
Workbench
History
The Kratylos project started in 2012 as a collaboration between
Daniel Kaufman at the Endangered
Language Alliance (ELA, New York) and Raphael Finkel at the
University of Kentucky (UKY, Lexington). Kratylos is intended as a
means to present dictionary and text files with associated media from a
variety of formats, including those created by FLEx, Praat, ELAN and
Toolbox. Users can search over any combination of available language
data as well as add their own data to the project. Ultimately, users
will also be able to create their own depositories by installing
Kratylos on their own websites.
Since mid-2015, Kratylos has
been supported by the NSF under grant #1500753. This grant has
supported software development (Raphael Finkel, with a research
assistant, Jiho Noh) and language fieldwork (Daniel Kaufman, with
research assistants Ahmed Shamim, Lluvia Camacho-Cervantes and Daniel
Barry).
Data organization
Data are organized in two levels: language
and title. The language is typically in Roman
letters, with initial capital Spaces are permissible. The title is a
word or phrase, typically representing a text title or a version number.
Together, a particular language/title is called a project. Each project is associated with the
registered user who has uploaded its data, called the project maintainer. The maintainer may upload new files at
any time; files for the same project as old files overwrite the old files.
The maintainer may specify that the title is public, in which case any user, registered or not,
may access it. Alternatively, the maintainer may specify that the title
is private, in which case only the maintainer
and registered users explicitly named as collaborators may access it.
Access means to search and display uploaded data.
Kratylos stores data and metadata for each project The data comprise multiple uploaded data files, which
can be of multiple datatypes, including
Fieldworks Explorer LIFT and FLEXTEXT, Praat TextGrid, ELAN EAF, Lacito
Pangloss, and Toolbox. Kratylos digests these files into an internal
representation suitable for query and display.
The metadata include the language name, the
public/private choice, any collaborators, and the provenance. The provenance. includes the title's genre (such as
song or narration), its topic, the participants (often with a
parenthesized comment, such as "transcriber"), the date of recording (if
unspecified, Kratylos uses 9999/01/01), the location, the languages used
in the recording, the name(s) of the researcher, any associated web site
(such as an archive-specific site), an associated institution (typically
an archive name, such as ELAR, or an institution, such as a university),
and a description, which is free text describing the particular title's
content. Kratylos uses the metadata to control access and to construct
citations for query results.
Kratylos subdivides the data in each project into entries. An entry is typically a single lexical item (for a
lexicon), a timing interval (Praat, ELAN),
or an utterance (FLEXTEXT).
User categories
Unregistered users and registered users who are not currently logged
in are considered anonymous. Anonymous users
may access all public projects. Registered users who are logged in may
access public projects and private projects they maintain or for which
they are listed as collaborators.
Viewing projects
To query within a project or set of projects, first click on
Projects at the top of the page. The first Projects page displays a
list of the languages that have titles you may access; the list depends
on your logged-in status. It also shows a map with a marker for each
such language for which it has recorded a location; some languages may
have no known location and are therefore missing from the map.
You
can click on a map marker, which then bounces, and the list of languages
is filtered to only that language. You can also filter the list of
languages by name, references (such as the Glottolog or WALS
identifier), or institution. Whenever the filtered list has only one
language, its associated map marker, if any, bounces.
You can click on any of the references. Kratylos then opens a new
browser tab containing an external document pertaining to that
reference.
You may also click on any of the buttons. Kratylos then opens
a new browser tab listing all the titles for that language. This list
includes much of the metadata for each title; you can toggle which
metadata it shows by clicking on one of the red or green column names
near the top. Green names are currently displayed, and red ones are
hidden. You can use the Search box near the top to filter the list
based on the contents of any column.
Each title has several buttons. You can click to open a new browser tab
showing the first few entries of the title. Alternatively, you can
select and unselect titles for conducting searches. If you have
selected any titles, you can click on the top of the
page to move to the query page. Finally, if you
click the button, Kratylos opens a new
browser tab showing all the metadata for the title, and if you are the
maintainer, it lets you modify the metadata and add/remove
collaborators.
Queries
The query page lets you submit searches on the data in the currently
selected titles. The search term can include any Unicode character.
Kratylos converts all data and queries into
Unicode Normalization Form D (Canonical Decomposition), so you can use
precomposed non-ASCII characters if you like.
For convenience,
if the selected titles contain data with non-ASCII characters,
Kratylos displays a keyboard icon () that you can click to bring up a keyboard
specialized to those special characters.
If the selected titles contain what look like morphological glosses
(ASCII strings in ALL CAPS, possibly with numerals),
Kratylos displays another keyboard icon () that you can click to bring up a keyboard
specialized to those gloss elements.
When you submit
a query (by clicking ), Kratylos displays the first n
(typically 5, but you can select a different count) entries in the
selected projects that match the query, adding more as you scroll if you
only have one title selected.
A query has one of these forms:
A string query matches anywhere
within a field. Adjacent fields in the data are considered separated by
a single space, so a string query such as the man
would match both that string exactly as well as adjacent fields such as
the mannerism.
A word query is like a string query, but it matches
only full words, delimited by spaces, punctuation, or the boundary of a
field.
A pattern query uses Perl
regular-expression (regex) syntax. A pattern can specify statistics
gathering by including capture groups like this:
(?<NAME>PATTERN), where NAME can be any word,
and PATTERN can be any Perl pattern. Kratylos displays a
table showing all matches to the pattern along with a match count,
organized by the name.
Regex patterns can be quite complex and
difficult to debug. Kratylos provides two specialty query builders:
(1) multi-target
patterns that match two targets while avoiding specific intervening
and adjacent elements, and (2) gloss patterns that match combinations of
gloss elements within a single morpheme or word.
A multi-tier query is composed of nested units. A
unit has the form <tierName content>.
The content can be a Perl pattern, a nested
tier, or empty. To see the relevant tier names, you can switch to
Outline format. If a tier is nested within another tier, you must
include the outer tier as well. For instance, if the outline of tiers
looks like this:
Group
text
basicMorpheme
gloss
you can specify a gloss of foo by this unit:
<Group <gloss foo>>. If you put
a * after the tier name, Kratylos interprets
it as "any subsequent instance of this tier". Use an empty content to
force Kratylos to skip an instance of the tier. Here is a complex
example based on the structure of flextext tiers:
This multi-tier search pattern looks for an element with Segnum matching the
pattern 16, any Word with a first Morpheme
with first Citation with first CF matching the pattern feta, followed directly by a Morpheme with first
Morph matching the pattern -re.
A boolean query is composed of individual
patterns separated by operators AND, OR, and NOT (which can
be written as &&|| and !).
You may choose to apply the query ignoring accent marks, so a word
manana would match an entry containing mañana.
You may specify a field-specific filter for string, word,
pattern, and boolean queries; each option in the menu is a tier name
followed by the datatype to which it applies.
Instead of
seeing full, formatted entries, you can ask for a simple summary of the
results, showing only the number of entries that match the query and the
total number of matches (a single entry can match multiple times).
Query results
After executing a query, Kratylos displays
the query details, which you may modify to submit a new query, and a
list of selected projects and datatypes in which Kratylos has found
matches, each with entries that match the query.
Each project with
matches has a header indicating the language, the project, the datatype,
and the provenance of that project. If there are multiple matching
projects, the header also contains a symbol that leads to a menu allowing you to adjust the
positioning of multiple projects in the result. The header also
displays a
button you can click to obtain a citation for the project, either in
bibTex, in APA (American Psychological Association), or simple URL
style. The browser copies the citation into the selection buffer so you
can paste it into documents.
Kratylos emphasizes the part of
each entry that matches the query by applying a yellow background,
although it can't do that for matches that cross field boundaries.
Kratylos initially displays entries using linear format, in which each entry is formatted
according to a standard for its datatype. Some datatypes, such as EAF
and Toolbox, have project-specific templates. When you places the
cursor over a field of the display, Kratylos displays the name of the
field. You may click on any field in the result, calling up a menu:
Hide tier: Stop showing this
tier in the query results. This choice is remembered across searches.
Show all tiers: Show all the tiers
in the data, even ones that are normally hidden. This choice is
remembered across searches.
Restore
default visibility: Hide tiers that are ordinarily hidden, and
show tiers that are ordinarily shown. This choice is remembered across
searches.
Query this value: Submit
a new word query based on the content of this field.
If
the entry displays an audio or video
symbol, then Kratylos has an associated media file that you can play by
clicking on that symbol. If the media associated with an entry is a
segment of a longer media file, Kratylos shows a control after playing
the segment so you can play it again or play earlier or later portions.
If the media comprise an entire file, Kratylos does not show the
control; you can simply click on the symbol to replay the file.
If you want to see query results in an outline format, change the mode for an entry by
clicking on the button on the right.
Some data represent a
narration. Users who wish to see subsequent or previous entries in a
narration can click on the button and choose continuous
mode, which begins by showing the entry in which you clicks the button
and then allows you to move forward (more results) or backward (earlier
results).
Each query result has an button, which allows you to generate a
representation of the entry as LaTeX source (either for the expex
package or the linguex package), as a PNG image, or as unformatted text.
The first two export formats copy text to the selection buffer, from
which they can be pasted into documents; the image format appears as a
downloaded image. The LaTeX expex and linguex outputs are intended to
be placed in a file that has this preamble:
\documentclass{article}
\usepackage{url}
\usepackage[usenames]{xcolor}
\usepackage{fontspec}
\setmainfont{FreeSans}
\usepackage{expex} % or \usepackage{linguex}
\begin{document}
Because entries often use non-Latin
alphabets, it is best to process the resulting LaTeX file with XeLaTeX.
Maintainers may add an annotation to any entry; the annotation
can consist of text, images, audio, and video. Text annotations are
placed in an "Annotation" tier in the data and are searchable. Users may view the
other annotations either by clicking on the button or the button, which presents a menu
including "hide/show annotation".
Logged-in users may submit
feedback to a project maintainer by clicking the and then selecting "provide
feedback". Kratylos displays a form that names the project and the
query and prompts for a message. When you sends the message, Kratylos
converts it to email to the maintainer and includes a PNG image of the
entry.
Profile manipulation
Logged in users may see their profile
by clicking on their email address at the top right of any page. They
may edit their personal information: Name, affiliation, country,
website, and any other information (free text) they wish to share. They
cannot change their email address, because Kratylos uses that
information as a unique personal identifier.
Uploading data
Uploading is accomplished from the page that
a user accesses by clicking on the Upload tab at the top of the page.
Registered users can create new projects (for which they become the
maintainer), upload data to those projects, and establish metadata for
the projects. If a project contains multiple data files of the same
datatype, they should follow identical structure. For example, if there
are multiple Toolbox files, they should use the same tags. Similarly,
if there are multiple ELAN EAF files, they should have the same tier
names. Otherwise, the maintainer should introduce separate titles for
the different formats. However, a single project may contain multiple
datatypes, such as FLExdict and FLEx text.
The uploader automatically splits FLEx text files into multiple
titles. These projects are considered related
and share the same metadata. Maintainers can modify the metadata of
related projects in a single update, or they can choose to modify
metadata on a project-by-project basis.
A user may upload a project in several steps, each time uploading a
single file. Once a project has its first file, Kratylos displays its
metadata on the upload page, so the maintainer need not re-enter it for
further uploads, although the maintainer may modify it. The uploaded
file may be a compressed archive (ZIP, gzip/tar, or bzip2/tar) of
several files. Kratylos scrutinizes the individual component files of
the archive to determine their type. It rejects any files that it
cannot identify. File names are significant; if the maintainer submits
a file with the same name and datatype as a previous one, the previous
one is deleted in favor of the new one.
Some data files have associated media, either audio or video. The
maintainer may upload media in any recognizable format, typically after
uploading and viewing the rest of the data. Media files should have
names (not including any format-specific extension, such as MP3 or WAV) according to
these rules:
Fieldworks Explorer LIFT: As specified in the pronunciation media
tag (omitting pathname)
Fieldworks Explorer FLEXTEXT: Title-Segnum. If a project (after
the uploader splits it) contains many titles, typically in several
languages, use the first one.
Praat: same name as the TextGrid file, up to the first
dot, if any (omitting pathname)
ELAN: As specified in the EAF XML file (omitting pathname)
Transcriber: As specified in the XML file (omitting
pathname)
Pangloss: As specified in the ID attribute of the sentence (S)
tag
Behind the scenes
This is Version 2 of Kratylos; Version 1 was limited solely to
Fieldworks Explorer dictionaries and text files. The implementation of
Kratylos comprises several scripts written in Perl. The implementers are
Raphael Finkel and Jiho Noh. The web server, Apache2, invokes these
scripts on a computer running the Linux operating system, using the
Common Gateway Interface (CGI). The Perl scripts use many modules
archived at CPAN (the Comprehensive
Perl Archive Network), including
Carp,
CGI,
CGI::Carp,
CGI::Session
Crypt::JWT
Data::Dumper
DataTables
Data::UUID
DBI
Digest::MD5
Digest::SHA
Email::Valid,
Encode,
Eval::Logic,
Fcntl,
File::Basename,
File::HomeDir,
File::Path,
File::Spec,
HTML::Entities,
HTML::Template,
IO::Handle,
JSON,
Log::Log4perl,
LWP::UserAgent,
MIME::Base64,
SendEmail,
Storable,
Sys::Hostname,
Text::Slugify,
Unicode::Normalize,
and
URI::Escape.
Kratylos treats uploaded data in
several steps.
Each language has its own directory; Kratylos builds the
language directory if needed, converting the language name into a
"sluggified" ASCII-only name.
Each title within a language has its own directory; Kratylos
builds the (sluggified) title directory if needed.
Within the title directory, Kratylos stores all raw uploaded data
in a subdirectory. Maintainers should not treat Kratylos as an archiving
facility, because Kratylos does not provide a mechanism to retrieve the
raw data.
If necessary, Kratylos converts the uploaded data into its own
datatype-specific XML format. For example, the ELAN EAF format,
although in XML, is not divided into entries, so the Kratylos uploader
reformats it into entries, each of which contains all the relevant tiers
(such as headword, part of speech, and gloss) and a reference to the
media file.
Kratylos builds a subtitle file in WebVtt format for a few
datatypes, including EAF and Pangloss.
The Kratylos uploader converts all uploaded media files to Ogg/Vorbis for audio and Ogg/Theora or MP4 for video. It then
stores the converted files in a media subdirectory of the project
directory, discarding the original media files. This conversion
compresses large media files (Vorbis uses far less space than WAV) and
puts them in an Ogg container, which allows for accurate direct access
to particular timestamps, unlike some other containers. These formats
are free and require no licensing fees.
Kratylos builds a Qddb
(Quick and dirty database) directory for each datatype in the project.
In it, Kratylos stores all data in Unicode Normalization Form D
(Canonical Decomposition) and in a Qddb-specific format. The format is
based on a datatype-specific tripartite description called a template, which coordinates (1) the XML fields,
described as XPath expressions, (2) the Qddb representation of those
fields, which is hierarchical, and (3) the formatting that the linear
display should employ for those fields, which involves Cascading Style
Sheets (CSS). For instance, part of the template for Fieldworks
Explorer LIFT datatype specifies that the XPath lift/entry/lexical-unit/form/@lang should have the
Qddb field name HLanguage and should be
displayed with a small blue font.
Kratylos stores user data and project metadata in a mySQL database,
which includes the following tables.
languages
projects
collaborators
country_code
users
Kratylos uses Qddb format as a searchable representation to execute
queries and format their results. In most cases, it searches the data
by a complete scan of the data, because the data files are small enough
to make this method efficient. Kratylos does use Qddb, however, for
word searches. Searching for words in large lexicons is therefore much
faster than a complete scan.
The Kratylos web pages contain a significant amount of CSS and
JavaScript, some that we have built and some from third-party libraries:
Bootstrap for general layout and typography, JQuery to access the
components of pages, Datatables to provide lists of projects, Plyr to
play media, and Alertify to provide ephemeral feedback. We use W3C
online validation to ensure that Kratylos web pages conform to
standards.