http://desktoppub.about.com/cs/placetext/a/plaintext.htm
·
Plain text files normally use one of two types of character sets:
ASCII or ANSI.
·
DOS uses ASCII.
·
ANSI is the character
set used by Windows.
·
Sometimes the term ASCII file is used to describe any plain text
file.
http://www.jimprice.com/jim-asc.htm
ASCII - The American Standard Code for Information Interchange is a standard
seven-bit code that was proposed by ANSI in 1963, and finalized in 1968.
The
standard ASCII character set consists of 128 decimal numbers ranging from zero
through 127 assigned to letters, numbers, punctuation marks, and the most
common special characters. The Extended ASCII Character Set also consists of
128 decimal numbers and ranges from 128 through 255 representing additional
special, mathematical, graphic, and foreign characters.
The
American National Standards Institute (ANSI) has served in its capacity as
administrator and coordinator of the United States private sector voluntary
standardization system for more than 80 years.
Founded
in 1918 by five engineering societies and three government agencies, the
Institute remains a private, nonprofit membership organization supported by a
diverse constituency of private and public sector organizations.
…ASCII
was developed a long time ago and now the non-printing characters are rarely
used for their original purpose
ASCII
was actually designed for use with teletypes and so the descriptions are
somewhat obscure.
Notepad.exe
creates ASCII text, or in MS Word you can save a file as 'text only'.
The
following is a more detailed description of some ASCII characters, often
referred to as control codes.
NUL (null) SOH (start of heading) STX (start of text) ETX (end of text) EOT (end of transmission) - Not the same as ETB ENQ (enquiry) ACK (acknowledge) BEL (bell) - Caused teletype machines to ring a bell. Causes a beep in many common terminals and terminal emulation programs. BS (backspace) - Moves the cursor (or print head) move backwards (left) one space.
http://www.microsoft.com/typography/unicode/cs.htm
Windows
ANSI
You can
think of Windows ANSI as a lower 128, and an upper 128. The lower 128 is
identical to ASCII, and the upper 128 is different for each ANSI character set,
and is where the various international characters are parked.
|
code page |
1250 |
1251 |
1252 |
1253 |
1254 |
etc., |
|
upper |
Eastern
Europe |
Cyrillic |
West
Euro |
Greek |
Turkish |
etc., |
|
lower |
ASCII |
ASCII |
ASCII |
ASCII |
ASCII |
etc., |
Windows ANSI Character Set
ISO Character Sets
ASCII(ISO-646 or ISO-ASCII)
7-bit ASCII
uses characters 0-127.
ISO-8859-1(Latin-1) Note: all ISO-8859 sets are 8-bit
ASCII plus the
accented characters and other characters needed for most Latin-alphabet Western
European languages, including Danish, Dutch, Finnish, French, German,
Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.
ISO-8859-2(Latin-2)
ASCII plus the
other characters needed to write most Latin alphabet Central and Eastern
European
languages, including Czech, English, German, Hungarian, Polish, Romanian,
Croatian, Slovak, Slovenian, and Sorbian.
ISO-8859-3(Latin-3)
ASCII plus the
accented letters and other characters needed to write Esperanto, Maltese, and
Turkish.
ISO-8859-4(Latin-4)
ASCII plus the
accented letters and other charactees needed to write most Baltic
languages,including Estonain, Latvian, Lithuranian, Greenlandic, and Lappish.
Now deprecated – use
8859-10 or
8859-13 instead.
ISO-8859-5
ASCII plus the
Cyrillic alphabet used for Russian and other languages of the former Soviet
Union and other Slavic countries, including Bulgarian, Byelorussian,
Macedonian, Serbian, and
Ukrainian.
ISO-8859-6
ASCII plus
Arabic, not including Farsi and Urdu.
ISO-8859-7
ASCII plus
modern Greek. This set does not have the extra letters and accents necessary
for ancient and Byzantine Greek.
ISO-8859-8
ASCII plus
Hebrew script used for Hebrew and Yiddish.
ISO-8859-9(Latin-5)
Like Latin-1,
except six letters used in Icelandic have been replaced with six letters used
in Turkish.
ISO-8859-10(Latin-6)
ASCII plus
accented letters and other characters needed to write most Baltic languages,
including
Estonian, Icelandic, Latvian, Lithuanian, Greenlandic, and Lappish.
ISO-8859-11
ASCII plus
Thai.
ISO-8859-13(Latin-7)
Similar to
Latin-6, except with some question marks.
ISO-8859-14(Latin-8)
ASCII plus
Celtic languages, including Gaelic and Welsh.
ISO-8859-15(Latin-9, Latin-0)
What is Unicode?
Unicode provides a
unique number for every character,
no matter what the
platform,
no matter what the
program,
no matter what the
language.
The following are the character sets most commonly found in Windows fonts.
ASCII (32 control codes, 95 characters). This is the set
familiar from the dawn of computing; all of these characters correspond to
keyboard keys. 33 is the space, 48-57 are numbers, 65-90 are upper-case
letters, 97-122 are lower-case letters, and the rest are punctuation and
symbols. This set is shared across computing platforms; other sets are
operating-system specific (except for Unicode, at least theoretically).
Windows ANSI (224 characters; since Win3.1). This adds the
accented characters common in Western European languages (including vowels with
acute, grave, circumflex and umlaut accents), plus some other typographical
symbols like “curly quotes”, bullets, currency symbols etc. The additional 126
characters above the ASCII set are called “Upper ANSI”.
Extended Windows ANSI (652 characters; since Win95). Adds accented
characters for Baltic, Central European, and Turkish, plus Cyrillic and Greek.
Unicode (up to 64,000 characters, at this writing (version 3)
49,194; sort of supported in Windows since Win 95). Intended as a universal
character set. Adds characters (broken into “ranges”) for dozens of writing
systems, including logographic systems; also several symbol sets, including
IPA. The standard is a work in progress. Any given “Unicode” font includes only
a subset of the characters. For more information, go to www.unicode.org.
http://weber.ucsd.edu/~dkjordan/chin/chineseunicode.html
On PCs, Microsoft has designated
"Arial Unicode MS" as a typefont continuously enlarged to contain all
characters in the latest official version of the Unicode Standard. It's not
beautiful, but it is complete if you have the latest version.
HTML
How does
the browser display the pages?
* All Web
pages contain instructions for display
* The browser
displays the page by reading these instructions.
* The most
common display instructions are called HTML tags.
* HTML tags
look like this <p>This is a Paragraph</p>.
------------------------------------------------------------------------
Who is
making the Web standards?
* The Web
standards are not made up by Netscape or Microsoft.
* The
rule-making body of the Web is the W3C.
* W3C stands
for the World Wide Web Consortium.
* W3C puts
together specifications for Web standards.
* The most
essential Web standards are HTML, CSS and XML.
* The latest
HTML standard is XHTML 1.0.
http://www.w3schools.com/
XML
* XML
stands for EXtensible Markup Language
* XML
is a markup language much like HTML
* XML
was designed to describe data
* XML
tags are not predefined. You must define your own tags
* XML
uses a Document Type Definition (DTD) or an XML Schema to describe the data
* XML
with a DTD or XML Schema is designed to be self-descriptive
------------------------------------------------------------------------
The main difference between XML and HTML
XML was designed to carry data.
XML is not a replacement for HTML.
XML and HTML were designed with different goals:
XML was designed to describe data and to focus on what data is.
HTML was designed to display data and to focus on how data looks.
HTML is about displaying information, while XML is about
describing information.
XML does not DO
anything
XML was not
designed to DO anything.
Maybe it is a little hard to understand, but XML does not DO
anything. XML was created to structure, store and to send information.
The following example is a note to Tove from Jani, stored as XML:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't
forget me this weekend!</body>
</note>
The note has a header and a message body. It also has sender and
receiver information. But still, this XML document does not DO anything. It is
just pure information wrapped in XML tags. Someone must write a piece of
software to send, receive or display it.
XML is free and
extensible
XML tags are not
predefined. You must "invent" your own tags.
The tags used to mark up HTML documents and the structure of HTML
documents are predefined. The author of HTML documents can only use tags that
are defined in the HTML standard (like <p>, <h1>, etc.).
XML allows the author to define his own tags and his own document
structure.
The tags in the example above (like <to> and <from>)
are not defined in any XML standard. These tags are "invented" by the
author of the XML document.
------------------------------------------------------------------------
XML is a complement to HTML
XML is not a replacement for HTML.
It is important to understand that XML is not a replacement for HTML.
In future Web development it is most likely that XML will be used to describe
the data, while HTML will be used to format and display the same data.
My best description of XML is this: XML is a cross-platform,
software and hardware independent tool for transmitting information.
http://xml.coverpages.org/sgml.html
Standard Generalized Markup Language (SGML)
SGML and XML as (Meta-) Markup Languages
Both SGML and XML "meta" languages because they are used
for defining markup languages. A markup language defined using SGML or XML has
a specific vocabulary (labels for elements and attributes) and a declared
syntax (grammar defining the hierarchy and other features).
Conceived notionally in the 1960s - 1970s, the Standard
Generalized Markup Language (SGML, ISO 8879:1986) gave birth to a
profile/subset called the Extensible Markup Language (XML), published as a W3C
Recommendation in 1998.
Depending upon your perspective and requirements, the differences
between SGML and XML are inconsequential or immense. SGML is more customizable
(thus flexible and more "powerful") at the expense of being (much)
more expensive to implement. For an overview of differences, see James Clark's
document "Comparison
of SGML and XML"; for other treatments, see references in XML and/versus
SGML.
As of 2002-07, relatively few enterprise-level projects are started as SGML
applications, but many SGML applications implemented before 1999 are still
running productively. In some cases, peculiar business requirements favor the
use of SGML for certain features that have been eliminated in XML.
http://www.microsoft.com/globaldev/reference/winxp/xploclang.mspx
Language Support in
Windows XP
·
There's no need in Windows XP for separate language groups that
can be installed as it is under Windows 2000.
·
Most of the language files are already included in the core
installation of Windows XP.
·
The exceptions are the fonts and IMEs for the East Asian languages
(CHS, CHT, Korean and Japanese) and the Uniscribe Script Processor (USP) engine
required for the shaping and display of complex scripts.
Language support files are now grouped into three collections, as
follows:
Basic Collection (installed on all languages of the OS):
* Western
Europe and United States (1)
* Central
Europe (2)
* Baltic
(3)
* Greek
(4)
* Cyrillic
(5)
* Turkic
(6)
Complex Script Collection (always installed on Arabic and Hebrew
localized OSes):
* Thai
(11)
* Hebrew
(12)
* Arabic
(13)
* Vietnamese
(14)
* Indic
(15)
* Georgian
(16)
* Armenian
(17)
East Asian Collection (always installed on East Asian
localized OSes):
* Japanese
(7)
* Korean
(8)
* Traditional
Chinese (9)
* Simplified
Chinese (10)
Installing any one of the
languages in one collection will automatically install all the other languages
within that collection. For example, an answer file specifying Language Group 8
(Korean) will invoke the installation of Language Groups 7, 9 and 10 too.
Use Insert Symbol to see full range of characters available in Windows XP.
Use “subset” and “Unicode” options to explore the full
range.
http://deall.ohio-state.edu/chan.9/conc/concordance.htm
Instructions for Concordancing East Asian E-Texts
using
![]()