http://desktoppub.about.com/cs/placetext/a/plaintext.htm

 

·      Plain text files normally use one of two types of character sets: ASCII or ANSI.

 

·      DOS uses ASCII.

 

·       ANSI is the character set used by Windows.

 

·      Sometimes the term ASCII file is used to describe any plain text file.

 


 

http://www.jimprice.com/jim-asc.htm

 

ASCII - The American Standard Code for Information Interchange is a standard seven-bit code that was proposed by ANSI in 1963, and finalized in 1968.

 

The standard ASCII character set consists of 128 decimal numbers ranging from zero through 127 assigned to letters, numbers, punctuation marks, and the most common special characters. The Extended ASCII Character Set also consists of 128 decimal numbers and ranges from 128 through 255 representing additional special, mathematical, graphic, and foreign characters.

 

 

 

http://www.ansi.org/

 

The American National Standards Institute (ANSI) has served in its capacity as administrator and coordinator of the United States private sector voluntary standardization system for more than 80 years.

 

Founded in 1918 by five engineering societies and three government agencies, the Institute remains a private, nonprofit membership organization supported by a diverse constituency of private and public sector organizations.

 

http://www.asciitable.com/

 

 

…ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose

 

ASCII was actually designed for use with teletypes and so the descriptions are somewhat obscure.

 

Notepad.exe creates ASCII text, or in MS Word you can save a file as 'text only'.

 


Control Codes

The following is a more detailed description of some ASCII characters, often referred to as control codes.

 

 

NUL (null)  
                 
SOH (start of heading) 
      
STX (start of text)  
        
ETX (end of text)
            
EOT (end of transmission) - Not the same as ETB   
 
ENQ (enquiry) 
               
ACK (acknowledge) 
           
BEL (bell) - Caused teletype machines to ring a bell.  Causes a beep in many common terminals and terminal emulation programs.
 
BS  (backspace) - Moves the cursor (or print head) move backwards (left) one space.

 

 

 


http://www.microsoft.com/typography/unicode/cs.htm

 

Windows ANSI

You can think of Windows ANSI as a lower 128, and an upper 128. The lower 128 is identical to ASCII, and the upper 128 is different for each ANSI character set, and is where the various international characters are parked.

 

code page

1250

1251

1252

1253

1254

etc.,

upper
128

Eastern Europe

Cyrillic

West Euro
ANSI

Greek

Turkish

etc.,

lower
128

ASCII

ASCII

ASCII

ASCII

ASCII

etc.,

 


Windows ANSI Character Set

 

 

http://www.bpsoft.com/reference/ansi.html

 


ISO Character Sets

ASCII(ISO-646 or ISO-ASCII)

7-bit ASCII uses characters 0-127.

ISO-8859-1(Latin-1) Note: all ISO-8859 sets are 8-bit

ASCII plus the accented characters and other characters needed for most Latin-alphabet Western European languages, including Danish, Dutch, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.

ISO-8859-2(Latin-2)

ASCII plus the other characters needed to write most Latin alphabet Central and Eastern

European languages, including Czech, English, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, and Sorbian.

ISO-8859-3(Latin-3)

ASCII plus the accented letters and other characters needed to write Esperanto, Maltese, and Turkish.

ISO-8859-4(Latin-4)

ASCII plus the accented letters and other charactees needed to write most Baltic languages,including Estonain, Latvian, Lithuranian, Greenlandic, and Lappish. Now deprecated – use

8859-10 or 8859-13 instead.

ISO-8859-5

ASCII plus the Cyrillic alphabet used for Russian and other languages of the former Soviet Union and other Slavic countries, including Bulgarian, Byelorussian, Macedonian, Serbian, and

Ukrainian.

ISO-8859-6

ASCII plus Arabic, not including Farsi and Urdu.

ISO-8859-7

ASCII plus modern Greek. This set does not have the extra letters and accents necessary for ancient and Byzantine Greek.

ISO-8859-8

ASCII plus Hebrew script used for Hebrew and Yiddish.

ISO-8859-9(Latin-5)

Like Latin-1, except six letters used in Icelandic have been replaced with six letters used in Turkish.

ISO-8859-10(Latin-6)

ASCII plus accented letters and other characters needed to write most Baltic languages,

including Estonian, Icelandic, Latvian, Lithuanian, Greenlandic, and Lappish.

ISO-8859-11

ASCII plus Thai.

ISO-8859-13(Latin-7)

Similar to Latin-6, except with some question marks.

ISO-8859-14(Latin-8)

ASCII plus Celtic languages, including Gaelic and Welsh.

ISO-8859-15(Latin-9, Latin-0)


http://www.unicode.org/

 

 

What is Unicode?

 

Unicode provides a unique number for every character,

no matter what the platform,

no matter what the program,

no matter what the language.

 

 


http://www.linguistics.ucsb.edu/faculty/cumming/WordForLinguists/Fonts.htm

Character Sets

The following are the character sets most commonly found in Windows fonts.

ASCII (32 control codes, 95 characters). This is the set familiar from the dawn of computing; all of these characters correspond to keyboard keys. 33 is the space, 48-57 are numbers, 65-90 are upper-case letters, 97-122 are lower-case letters, and the rest are punctuation and symbols. This set is shared across computing platforms; other sets are operating-system specific (except for Unicode, at least theoretically).

Windows ANSI (224 characters; since Win3.1). This adds the accented characters common in Western European languages (including vowels with acute, grave, circumflex and umlaut accents), plus some other typographical symbols like “curly quotes”, bullets, currency symbols etc. The additional 126 characters above the ASCII set are called “Upper ANSI”.

Extended Windows ANSI (652 characters; since Win95). Adds accented characters for Baltic, Central European, and Turkish, plus Cyrillic and Greek.

Unicode (up to 64,000 characters, at this writing (version 3) 49,194; sort of supported in Windows since Win 95). Intended as a universal character set. Adds characters (broken into “ranges”) for dozens of writing systems, including logographic systems; also several symbol sets, including IPA. The standard is a work in progress. Any given “Unicode” font includes only a subset of the characters. For more information, go to www.unicode.org.


http://weber.ucsd.edu/~dkjordan/chin/chineseunicode.html

 

On PCs, Microsoft has designated "Arial Unicode MS" as a typefont continuously enlarged to contain all characters in the latest official version of the Unicode Standard. It's not beautiful, but it is complete if you have the latest version.


 

http://www.w3schools.com/

 

 

HTML

 

 

How does the browser display the pages?

 

*       All Web pages contain instructions for display

*       The browser displays the page by reading these instructions.

*       The most common display instructions are called HTML tags.

*       HTML tags look like this <p>This is a Paragraph</p>.

 

------------------------------------------------------------------------

 

Who is making the Web standards?

 

*       The Web standards are not made up by Netscape or Microsoft.

*       The rule-making body of the Web is the W3C.

*       W3C stands for the World Wide Web Consortium.

*       W3C puts together specifications for Web standards.

*       The most essential Web standards are HTML, CSS and XML.

*       The latest HTML standard is XHTML 1.0.

 

 


http://www.w3schools.com/

 

XML

 

 

*       XML stands for EXtensible Markup Language

*       XML is a markup language much like HTML

*       XML was designed to describe data

*       XML tags are not predefined. You must define your own tags

*       XML uses a Document Type Definition (DTD) or an XML Schema to describe the data

*       XML with a DTD or XML Schema is designed to be self-descriptive

 

------------------------------------------------------------------------

 

The main difference between XML and HTML

 

XML was designed to carry data.

 

XML is not a replacement for HTML.

XML and HTML were designed with different goals:

 

XML was designed to describe data and to focus on what data is.

HTML was designed to display data and to focus on how data looks.

 

HTML is about displaying information, while XML is about describing information.

 


XML does not DO anything

 

XML was not designed to DO anything.

 

Maybe it is a little hard to understand, but XML does not DO anything. XML was created to structure, store and to send information.

 

The following example is a note to Tove from Jani, stored as XML:

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

 

The note has a header and a message body. It also has sender and receiver information. But still, this XML document does not DO anything. It is just pure information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.

 

menu.xml

 


XML is free and extensible

 

XML tags are not predefined. You must "invent" your own tags.

 

The tags used to mark up HTML documents and the structure of HTML documents are predefined. The author of HTML documents can only use tags that are defined in the HTML standard (like <p>, <h1>, etc.).

 

XML allows the author to define his own tags and his own document structure.

 

The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document.

------------------------------------------------------------------------

 

XML is a complement to HTML

 

XML is not a replacement for HTML.

 

It is important to understand that XML is not a replacement for HTML. In future Web development it is most likely that XML will be used to describe the data, while HTML will be used to format and display the same data.

 

My best description of XML is this: XML is a cross-platform, software and hardware independent tool for transmitting information.

 


http://xml.coverpages.org/sgml.html

 

Standard Generalized Markup Language (SGML)

 

SGML and XML as (Meta-) Markup Languages

 

Both SGML and XML "meta" languages because they are used for defining markup languages. A markup language defined using SGML or XML has a specific vocabulary (labels for elements and attributes) and a declared syntax (grammar defining the hierarchy and other features).

 

Conceived notionally in the 1960s - 1970s, the Standard Generalized Markup Language (SGML, ISO 8879:1986) gave birth to a profile/subset called the Extensible Markup Language (XML), published as a W3C Recommendation in 1998.

 

Depending upon your perspective and requirements, the differences between SGML and XML are inconsequential or immense. SGML is more customizable (thus flexible and more "powerful") at the expense of being (much) more expensive to implement. For an overview of differences, see James Clark's document "Comparison of SGML and XML"; for other treatments, see references in XML and/versus SGML. As of 2002-07, relatively few enterprise-level projects are started as SGML applications, but many SGML applications implemented before 1999 are still running productively. In some cases, peculiar business requirements favor the use of SGML for certain features that have been eliminated in XML.


 

http://www.microsoft.com/globaldev/reference/winxp/xploclang.mspx

 

Language Support in Windows XP

 

·      There's no need in Windows XP for separate language groups that can be installed as it is under Windows 2000.

 

·      Most of the language files are already included in the core installation of Windows XP.

 

·      The exceptions are the fonts and IMEs for the East Asian languages (CHS, CHT, Korean and Japanese) and the Uniscribe Script Processor (USP) engine required for the shaping and display of complex scripts.


 

Language support files are now grouped into three collections, as follows:

 

Basic Collection (installed on all languages of the OS):

 

*      Western Europe and United States (1)

*      Central Europe (2)

*      Baltic (3)

*      Greek (4)

*      Cyrillic (5)

*      Turkic (6)

 

Complex Script Collection (always installed on Arabic and Hebrew localized OSes):

 

*      Thai (11)

*      Hebrew (12)

*      Arabic (13)

*      Vietnamese (14)

*      Indic (15)

*      Georgian (16)

*      Armenian (17)

 

East Asian Collection (always installed on East Asian localized OSes):

 

*      Japanese (7)

*      Korean (8)

*      Traditional Chinese (9)

*      Simplified Chinese (10)

 

Installing any one of the languages in one collection will automatically install all the other languages within that collection. For example, an answer file specifying Language Group 8 (Korean) will invoke the installation of Language Groups 7, 9 and 10 too.


Use Insert Symbol to see full range of characters available in Windows XP.

 

Use “subset” and “Unicode” options to explore the full range.


 

http://deall.ohio-state.edu/chan.9/conc/concordance.htm

 

 

Instructions for Concordancing East Asian E-Texts
using
concordance_logo