VTrain (Vocabulary Trainer) --- Learning resources

The ultimate vocabulary learning software

Alphabets & fonts | HOME | DOWNLOAD | DONATE | WHAT'S NEW | VOCABULARY DATABASE |

Introduction

Writing systems

Western alphabets

Non-Western alphabets

Phonetic and offbeat

Varying glyphs

CJK

Miscellaneous

Solutions

Types of fonts

Inserting single characters

Inserting diacritics:
á, à, â, ä, ...

Keyboard layouts

Input methods

Multilingual browsing

All versions of Windows are able and ready to handle Western alphabets, but if you intend to read or type in certain non-Western scripts on your system, you will need to take certain measures. These may include installing fonts, tuning up your Windows system a little bit, or both.

On this page you will find background information about how to handle this topic in Windows.

Contents

Types of scripts
One-byte fonts
Language-specific two-byte fonts
Unicode fonts
More links

Types of scripts

There are languages that use alphabets (e.g. English, Spanish), languages that use alphabets with "context-sensitive" glyphs (e.g. Arabic, Hebrew), languages that use syllabaries (e.g. Korean), and languages that use more complex scripts (e.g. Chinese, Japanese).

For solutions listed by language, see our Fonts index page.

For a survey on the different writing systems that there are in the world see:

omniglot.com

One-byte fonts

Introduction

One-byte fonts map up to 256 characters and are usually designed for use with a given script or alphabet, i.e. they are language-specific.

For this reason, the characters are usually mapped in compliance to some standard chart. Anyway, sometimes non-standard fonts containing offbeat characters are very useful for language learning. For example, you may need the accented Russian vowels or the vowel-pointed Arabic consonants that are in use in dictionaries.

Technical information about one-byte fonts

The "byte" [abbreviated with a B] is a unit of data that is 8 "bits" [abbr. b] long. For this reason, you can use the byte as a unit to define up to 256 characters (1 byte = 8 bits; 2 to the power 8 makes 256 different "permutations").

Since most language scripts are alphabet-based, 256 characters are usually enough to represent all symbols of a given language. This is why all fonts in use in the first versions of Windows were single-byte fonts.

Back then, different language-specific "codepages" were developed. A codepage is a chart in which each of 256 entries is allocated a character:

-
Characters 0-127 match the old ASCII character set (7 bits) and are common to most codepages. They contain the good old letters A-Z, the digits 0-9, and common punctuation signs.

-
The remaining characters, 128-255, are reserved for so-called "special" or "extended" characters, also known as "Acorn extensions". Such characters are language-specific and differ from one codepage to another.
In the Western European ANSI codepage, most special characters are accented letters and the like (ä, é, ñ, ø, ...)

Morale: Every one-byte font complies with some codepage, i.e. it is language-specific.

Some examples of codepages:

-
OEM 437(US) Western European alphabets.
OEM codepages are as old as MS-DOS.

-
Windows 1252 Western European alphabets.
Also known as ANSI, almost identical to ISO 8859-1.

-
Windows 1251 Cyrillic alphabet (e.g. Russian).

-
ISO 8859-1 Western European alphabets.
Also known as Latin 1, almost identical to Windows 1252.

-
Symbol Greek characters and mathematical symbols.

More information:

The ISO 8859 character soup shows charts of 8-bit (single-byte) Latin, Cyrillic, Arabic, Greek, Hebrew codepages in use today. See also XenCraft's page and alis.com's page.

Limitations of one-byte fonts

One important disadvantage to one-byte fonts is that multilingual edition is not possible in plain text format. (Remember the gibberish text in much of the spam email we receive today?) In order to deal with this problem, most modern fonts comply with the new Unicode standard.

On the other hand, languages such as Chinese, Japanese and Korean (Hanja) have too many characters to fit in a chart with only 256 entries. For this reason, so-called double-byte character sets (DBCS) have been in use for decades.

Language-specific
two-byte fonts

Two-byte font systems can map up to 65536 (256 squared) characters. Specifically, so-called DBCS (Double-byte Character Sets) have been developed to make computers able to handle complex scripts such as Chinese, Japanese, and Korean.

To display or type in a DBCS font, you have to upgrade your Windows system with a CJK input method editor.

Unicode fonts

Introduction

Unicode fonts can contain thousands of characters, thus potentially covering all languages of the world.

Note for Windows users: Please note that Unicode fonts and methods work properly only on 2000 / XP and later. In older operating systems, trying to use non-Western Unicode characters is bound to cause you trouble.

Notwithstanding, you can still use the good old one-byte fonts in Windows 2000 / XP. In fact, certain one-byte fonts with a non-standard codepage prove very useful for language learning. For example, you may need accented characters to learn Russian (More...)

Technical information about the Unicode standard

Unicode is a character encoding system aimed at covering all languages in the world. In opposition to Double-Byte Character Sets (which can contain up to 65536 characters), the Unicode standard can be extended progressively.

So far, 17 "Planes" have been defined for Unicode, and most scripts are covered by Plane 0 ("Basic Multilingual Plane"). Moreover, in most Unicode fonts not all entries ("code points") have been mapped, partly in order not to waste system resources (memory), and partly to save the cost involved by a high increase in work time.

The characters in a Unicode font are grouped into "subsets" (Latin, Greek, Braille, mathematical symbols, etc.), which by the way do not match the codepages of one-byte fonts.

Check out the Free Online Unicode Character Map at Oxford University to view the characters contained in Unicode fonts. See also A Unicode Test Page to see if your browser is Unicode-compliant.

Now, how are Unicode characters actually stored in memory? For this purpose, there are several systems called "encodings". The most important encodings are:

-
UCS-2 uses two bytes per character. Thus, you can represent up to 65536 (256 squared) different characters with this system.
Windows NT4 / 2000 / XP use UCS-2 as their native string type. If you use legacy one-byte-based software, the operating system will try to convert the characters from ANSI to UCS-2 and vice versa.

-
UTF-8 stores characters 0-127 using one byte per character and the remaining characters using from two through six bytes.
UTF-8 wastes more memory than DBCS-systems when storing East Asian characters, it has the advantage that legacy software works respectably with them, provided that you are using Western European characters only.

-
Language-specific encodings match the old codepages used in one-byte fonts. For example, you have ISO 8859-1 (Latin 1), Windows 1252 (ANSI) etc.
In many web browsers, you can choose an encoding for page display. This is not the case of many email clients.
Note that you can view how Unicode characters are mapped by these encodings in the Windows Character Map (Start menu | Programs | Accessories | (System Tools)).

Unfortunately, due to the issues explained below, we must advise you against using Unicode fonts for non-Western scripts. Use a language-specific, single-byte font instead.

Controversies

It has been argued that Unicode will be too small for future East Asian online digital libraries, which will need to support old character sets. There are many characters present in Asian historical and personal names, which make up much more than the 65536 characters covered by the two-byte Unicode set. Moreover, Unicode ignores the fact that for many characters several different glyphs are in use, folding the differences into unified glyphs (Unihan glyphs).

On the other hand, the Government of the Popular Republic of China has made compatibility with the new four-byte character set GB18030 compulsory for software sold in that country.

More links

For more information about Unicode, see

Unicode.org The worldwide consortium defining the Unicode standard.
Alan Wood's Unicode resources Comments on Unicode-related utilities.

More links

See also:

Korpela's tutorial on character code issues
Wolfgang Kirsch's Fremdsprachige Textverarbeitung in Windows [in German]

Updated: 2017 January 16
Legal notice. Copyright © 1999-2017 by The authors. All rights reserved.
Our homepage is http://www.vtrain.net