Multilingual Metadata

Overview

Digital collections at UT Austin come in a wide variety of languages; as such, records about these materials require varying levels of multilingual support. These may range from simply recording and encoding the language of an object's title to full translation of a record. The following pages detail various resources on how to create and maintain multilingual metadata.

Navigate pages within the Multilingual Metadata section by using the page tree on the left sidebar or using the table of contents to the right.

Levels of Multilingual Metadata

Metadata records can contain varying levels of multilingual content, ranging in both number of languages represented and number of fields translated. Even if a record only has a single field in a language other than English, it still contains multilingual metadata, represented in the content and in the metadata describing that language.

Example of a record with non-roman script; only names in a second language

Example of a record fully translated, using roman script

Guidelines

GENERAL GUIDELINES

Languages should be treated with the same equivalency; translations should have as a close relationship to the first language's meaning as possible (this may not always be English.)

If you are providing multilingual metadata beyond titles and proper nouns, ensure that linguistic diversity is respected. Take the time to evaluate terms from different countries/communities that speak that language.

If detailed translation is desired, consider paying for professional translation.

Full translation of the record is a desired goal, but not always possible for materials. Approaching multilingual metadata as a tiered process can help identify what items to translate.

BY FIELD

For Titles and proper nouns (Contributor, Publisher, Subject Name fields, etc.):

Keep the item in its original language; do not translate them to English.

For non-roman script, a transliterated version of the title can be recorded in a second title field so it can have its own language tag. (Subtitles should be reserved for actual subtitles of the title.)

For Subject (Topic, Geographic Place, Temporal fields), Genre, Form/Medium:

These fields can be an easier place to start translation work, as they can be found in existing controlled vocabularies such as the Getty Vocabularies or VIAF.

Evaluate whether the term adequately describes the meaning of the original English concept and if the term is used with high frequency in its context (dialect, country, community, or subject domain can all apply.)

Apply the same terms consistently across materials.

Critical cataloging best practices still apply; you may need to investigate whether certain terms should be used, or to see if remediated English terms have the same meaning as their counterparts in other languages.

For Descriptions and Rights/Citation information:

These fields, while critical, can be more challenging to translate as they require a high degree of familiarity with the language.

Professional translation services can be one way to make this work more sustainable. If these services are not available, translation by UT Austin Libraries workers should be prioritized for frequently accessed materials or those deemed key by collections curators.

Translation of these fields are especially encouraged if usage statistics have found the collections object to be accessed or used by other language communities. For example, some of our collections have translated these fields into Spanish. The origin of these materials, their use in Spanish speaking communities, and the high use of Spanish on UT Austin's campus have demonstrated this need.

Diacritics and Character Encoding

Diacritics are marks or symbols that show the phonetic value of a letter. (Ex. "á" or "ñ")

Character encoding is a process of making those letters machine-readable, assigning numerical codes. The majority of UTL's metadata uses UTF-8 character encoding. When creating metadata spreadsheets or XML, ensure that your character encoding is UTF-8. This will allow your diacritics to be displayed.

To check the character encoding in an Excel spreadsheet, navigate to the "Save As" option, then choose "Web Options" in the "Tools" dropdown menu.

UTF-8 encoding menu in Excel (Tools→Web Options)

For XML editors like Oxygen, navigating to Preferences→ Encoding will allow you to check the encoding of the document (UTF-8 by default.)

For Google Sheets, documents are in UTF-8 by default.

Metadata Modeling

Languages used in metadata fields are modeled in different ways, based on the metadata schema. Below are a few examples. For more guidance on where to source Language information for the UTL DAMS, see the "Assets" section of the wiki. Currently, ISO 639-3 codes are advised.

EAD XML example

<profiledesc> 
	<creation encodinganalog="500">Text converted and initial EAD tagging provided by Apex Data Services,<date era="ce"calendar="gregorian">July 2001.</date>
	</creation> 
	<language>Finding aid written in <language langcode="eng" scriptcode="Latn">English.</language>
	</language>
</profiledesc>
<langmaterial label="Language:" encodinganalog="546$a"><language encodinganalog="546" langcode="eng" scriptcode="Latn">English</language> </langmaterial>

In EAD, the language of the finding aid and the language of the material are both represented in separate fields.

MODS XML Example

<language> 
	<languageTerm type="text" lang="eng">Western Highland Purepecha</languageTerm> 
	<languageTerm type="code" authority="iso639-3" authorityURI="https://iso639-3.sil.org/code_tables/639/data">pua</languageTerm> </language> 

<languageOfCataloging usage="primary"> <languageTerm type="code" authority="iso639-2b" authorityURI="http://id.loc.gov/vocabulary/iso639-2">eng</languageTerm> 
</languageOfCataloging> 

<subject lang="spa"> <topic>acuarelas (obra visual)</topic> </subject>

In MODS, the language of the finding aid and the language of the material are both represented in separate fields. Language tags are also assigned to individual fields to allow for greater flexibility.

JSON Example

{"typeName":"language","multiple":true,"typeClass":"controlledVocabulary","value":["Spanish, Castilian"]}

In this JSON example, the language of the material is represented as an array, containing "Spanish, Castilian." The array structure and other key : value pairs allow for multiple languages to be represented.

Ethics, or Multilingual Metadata as Accessibility

Providing multilingual metadata aligns with the libraries' mission to support research and to include IDEA (inclusion, diversity, equity, and accessibility) concepts in all aspects of library work by:

Developing new access points to the libraries' collections by using languages our students, faculty, and researchers use
Providing enhanced access to collections that are from locations/communities that speak languages other than English, opening them up to source communities
Addressing UT Austin's status as a Hispanic Serving Institution by focusing on Spanish and Portuguese translation of collections' metadata
Encouraging new methods of research by expanding linkages to multilingual vocabularies and linked data sources