Contact Us|Careers|Feedback|Login|Terms & Conditions

Encoding standards

Introduction

Encoding standards deal with the representation of data within the computer. All computers store data in binary format, but the same character can be stored differently depending on the encoding standard used.

Some encoding standards, such as ASCII and EBCDIC, are based on an 8-bit byte (or octet), which allows them to represent English letters and some non-English characters, graphics symbols, and mathematical symbols. Some ASCII standards are only based on 7-bit bytes.

Other encoding standards, such as Unicode, are based on two octets (16 bits), allowing them to represent a much larger character set including Arabic, Cyrillic, Greek and Hebrew characters as well as English.

To illustrate how various encoding standards represent the same character differently, we can use the letter A as an example, showing its binary bit pattern.

  • in ASCII format, A is represented by 01000001
  • in EBCDIC format, A is represented by 11000001
  • in Unicode format, A is represented by 0000000001000001

This means that, for example, if a file in EBCDIC format is transferred to a computer which expects files in ASCII format, each character will be interpreted as if it were in ASCII, in effect becoming corrupt and losing its meaning.

One answer is to allow for the conversion of files from one format to the other when transferred between computers which use incompatible encoding standards. ODEX Enterprise, for example, recognises EBCDIC files and can convert them before displaying their details in human-readable text. However, if a user tries to view an EBCDIC file using a standard editor, such as Notepad, no conversion will be done and the file will be illegible.

N.B. All the encoding standards represent characters as numbers so that they can be stored as binary values. This means that the binary bit pattern which represents each character also represents a number. For example, the binary bit patterns for the letter A, shown above, also represent the numbers 65, 193 and 65 respectively. To the computer, the character is no different from the number. The difference is only made by the programs which access the data and must define the data as numeric or non-numeric.

ASCII

ASCII (prononunced “ask-ee”) stands for American Standard Code for Information Exchange.

ASCII is a code for representing characters as numbers inside the computer. The standard ASCII character set uses just 7 bits for each character, which allows each letter to be assigned a number from 0 to 127. For example, the ASCII code for uppercase A is 65. There are several larger ASCII character sets that use 8 bits, which gives them 128 additional characters. The extra characters are used to represent non-English characters, graphics symbols, and mathematical symbols.

ASCII encoding is used by all PCs, Unix machines and Apple Macs.

EBCDIC

EBCDIC (pronounced “eb-sih-dik”) stands for Extended Binary-Coded Decimal Interchange Code. EBCDIC is an IBM code for representing characters as numbers inside the computer and is based on an 8-bit byte.

EBCDIC was developed at a time when one of the main criteria for the character set was its ease of use with punched cards. Even though the days of punched cards are long gone, EBCDIC is still used in IBM mainframes, such as MVS, and mid-range systems, such as the AS/400, mainly for backward compatibility.

Unicode

True Unicode, based on 16-bit character representation, provides a single unique number for every character, no matter which platform, language or program is being used. This means that if all systems were to adopt Unicode as their encoding standard, there would be no need for conversion. However, most systems have continued to use ASCII or EBCDIC, and Unicode is as yet mainly used by systems where a different language character set is required, such as Chinese or Arabic.

As well as true Unicode based on 16 bits (2 bytes or octets), Unicode has several other versions, such as UTF7 and UTF8, which are simply Unicode versions based on 7-bit and 8-bit encoding respectively.

All Unicode is based on the ASCII representation of characters. In some systems, when the characters are part of the ASCII character set, the character representation is held in the second byte while the first byte represents binary zero. This is called Little Endian Unicode (i.e. the bytes in each file character are low order first). In other systems, the representation is vice versa. This is called Big Endian Unicode (i.e. the bytes in each file character are high order first).

Files encoded in Unicode, Big Endian Unicode and UTF8 all contain a byte order mark in the first few bytes of the file to indicate how the file is encoded.

Big and Little Endian encoding is only applicable to encodings which use 2 bytes per character (i.e. true Unicode). On a Windows system, most applications will expect Little Endian encoding. Mainframes will expect Big Endian encoding. Unix systems may use either, depending on their operating system.

Whether a system uses Big or Little Endian encoding is inherent in that computer system. Even if character representation is only based on 8 bits, the system is still referred to as either Big Endian or Little Endian.

File encoding in DI products

DARWIN 3 and all members of the ODEX family support the use of both ASCII and EBCDIC encoding. ODEX Enterprise supports not only ASCII and EBCDIC but also Unicode and its variations. Uniquely, ODEX Enterprise can recognise which encoding it is dealing with.

In the case of DARWIN, messages received from specific trading partners (i.e. those, such as Ford, whose data is produced on mainframes) are expected in EBCDIC format and are therefore translated into ASCII on receipt. Messages for these trading partners are translated into EBCDIC before they are sent.

ODEX/MVS and ODEX/400 are both based on EBCDIC. All other ODEX members are based on ASCII.

In the case of all ODEX members apart from ODEX Enterprise, the user must configure the details for each trading partner to specify whether files to and from the trading partner are to be translated from and into EBCDIC. In the unlikely event that a trading partner normally using EBCDIC were to send a file in ASCII format, ODEX would in fact try to translate the ASCII file into ASCII, resulting in an unintelligible file.

ODEX Enterprise, however, can recognise the encoding of files sent in any of the following formats:

ASCII, Big Endian Unicode, EBCDIC, Unicode, UTF7, UTF8

Using the Workflow Manager, ODEX can be configured to translate files in any of these formats into another of these formats before passing the file on to a system requiring the translated file

CONTACT US

UK: +44 (0) 1733 371 311
Spain: +34 912686629
Sweden: +46 (0) 322 935 25


sales@di-international.com
Terms and Conditions | Copyright Data Interchange Plc 2010