Skyrim Mod:String Table File Format

The UESPWiki – Your source for The Elder Scrolls since 1995
Jump to: navigation, search

With each plugin/master file that contains an lstring (lookup string) datatype, there is an accompanying set of string tables in Data\Strings. The naming convention is the plugin/master filename then an underscore then the language and the extension. For example, English Skyrim.esm has 'Skyrim_English' as the base filename. There are 3 files with different extensions (DLSTRINGS, ILSTRINGS, STRINGS), the significance of which appears to be that DLSTRINGS contains Journal/Book entries, ILSTRINGS has subtitled conversations and STRINGS contains general strings like item names. With the exception of STRINGS having a slightly different string data format, they share the same format.

The string files are simple uncompressed data with a layout that consists of an 8-byte header that contains the count of strings and the total size of the string data at the end of the file. This is followed by a series of 8-byte structs that consist of the string ID for reference and a relative offset to the string from the beginning of the string data.

The string data itself has 2 formats that are only slightly different, the .STRINGS file has simple null-terminated (C-style) strings, while the .ILSTRINGS and .DLSTRINGS also have null-terminated strings but additionally have a uint32 preceding the string that declares the length.

Header[edit]

Type/Size Info
uint32 Number of entries in the string table.
uint32 Size of string data that follows after header and directory.
Directory Entry[count] Directory (see below).
uint8[dataSize] Raw data.

Directory Entry[edit]

Directory entries are simple 8-byte structs that consist of two uint32, the first being the ID used by mod files to refer to it and the second is the offset from the beginning of the string data to the string itself. These entries are not required to be sequential, and additionally while the ID is unique the offset is not (eg 2 different IDs can point to the same string).

Type/Size Info
uint32 String ID
uint32 Offset (relative to beginning of data) to the string. These entries are not required to be sequential. See String Data below.

String Data[edit]

There are 2 slightly different types of string data, depending on the file extension.

.strings[edit]

Null-terminated C-style string.

Type/Size Info
zstring Null-terminated string data.

.dlstrings, .ilstrings[edit]

Also null-terminated C-style string but has an additional uint32 that specifies length preceding the string data. The length includes the null terminator.

Type/Size Info
uint32 Length of following string, including null-terminator.
uint8[length] Null-terminated string data.

String Encodings[edit]

The string encodings supported by Skyrim are decided by the "fonts_en.swf" file in the "Skyrim - Interface.bsa", which varies between languages. The following table gives the known supported localizations of the "fonts_en.swf" file (which all have the same filename - the "_en" substring is confusingly not indicative of target language) and corresponding encodings. Blank boxes are unknown.

Localization Primary
Encoding
Secondary
Encoding
English UTF-8 Windows-1252
French UTF-8 Windows-1252
German UTF-8 Windows-1252
Italian UTF-8 Windows-1252
Spanish UTF-8 Windows-1252
Polish UTF-8 Custom
Czech Custom
Russian UTF-8 Windows-1251
Japanese UTF-8

The official translations all use the secondary encoding given in the table above, apart from Japanese. Polish and Czech use a custom Windows-1250-based encoding with the following character set (note that original ů and ý characters are not used):

ąáłśźżćščéęťěíůď
ýńňóžőö÷ř?úűü?ţ˙

Skyrim first attempts to interpret a string as encoded in its primary encoding, but if it contains invalid byte sequences then the secondary encoding is used to interpret it. It is unknown what happens if the string also contains invalid bytes when interpreted using its secondary encoding (eg. by including unused bytes).

Note that interpretation is done after alias lookup and substitution, so if the string used for an alias is in a different encoding to the string containing the alias, the combined string will not be displayed correctly. Note also that each localization's fonts include incomplete character support, eg. the English localization's font cannot display Cyrillic characters even when strings are encoded in UTF-8, nor can it display some of the lesser-used characters available in Windows-1252.

There also appears to be a lack of UTF-8 support in certain circumstances, thus far reported for text in scripts. In these circumstances it appears that the secondary encoding is used, but this issue has not yet been investigated.