Search This Blog

2013-04-15

CRLF Newlines and Carriage Returns

Some important information to note is that indicators for new lines are not maintained in both Linux and Windows environments.

In unix environments they use LF (Line feed, '\n', 0x0A, 10 in decimal). In some instances they will use CR+LF which is (Carriage return '\r', 0x0D, 13 in decimal). But generally, from what I have seen they only use LF.

If you move a file that has been delimited with new lines in a Windows environment into a unix environment you will see strange characters represented by ^M. This is due to the fact that the CR has not been stripped off in the windows environment. Though their line-feeds remain the same the CR's are different and that causes some incompatibilities and annoyances.

The Unicode standard defines a large number of characters that are recognized as line terminators, which may or may not be recognized properly by some interpreters like the "less" utility in unix.

Most modern text editors recognize all standards for new lines, but some programs may face incompatibilities when manipulating the text. To ensure you're not dealing with incompatible formats its a good idea to check to make sure you're not using ^M when running unix utilities. This can be done using the cat utility just to print the contents to stdout
cat -v file.txt
OR
hexdump -c file.txt

 LF:    Line Feed, U+000A (10 in decimal)
 VT:    Vertical Tab, U+000B (11 in decimal)
 FF:    Form Feed, U+000C (12 in decimal)
 CR:    Carriage Return, U+000D (13 in decimal)
 CR+LF: CR (U+000D) (13 in decimal) followed by LF (U+000A) (10 in decimal)
 NEL:   Next Line, U+0085 (133 in decimal)
 LS:    Line Separator, U+2028 (8232 in decimal)
 PS:    Paragraph Separator, U+2029 (8233 in decimal)

Converting DOS/Windows format to unix by removing CR
tr -d '\r' < inputfile > outputfile

If they only have CR newlines you can convert the CR to LF
tr '\r' '\n' < inputfile > outputfile

Generally it is accepted that the following are the standards for the specific following OSes
Unix     : \n = \012
Macintosh: \n = \015
Windows  : \n = \012     if handled as ASCII
Windows  : \n = \015\012 if handled as binary

References:
http://www.onlamp.com/pub/a/onlamp/2006/08/17/understanding-newlines.html?page=3
http://en.wikipedia.org/wiki/Newline
http://www.perlmonks.org/index.pl?node_id=68687
http://en.wikipedia.org/wiki/List_of_Unicode_characters

No comments:

Post a Comment