Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

"It is easier to port a shell than a shell script." -- Larry Wall


devel / comp.lang.awk / Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

SubjectAuthor
* Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.Kenny McCormack
`* Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.Mack The Knife
 +- Re: Handling DOS (Windows) text files in Unix (Linux) - a niftyEd Morton
 `* Re: Handling DOS (Windows) text files in Unix (Linux) - a niftyJanis Papanagnou
  `* Re: Handling DOS (Windows) text files in Unix (Linux) - a niftyEd Morton
   +- Re: Handling DOS (Windows) text files in Unix (Linux) - a niftyJ Naman
   `- Re: Handling DOS (Windows) text files in Unix (Linux) - a niftyJanis Papanagnou

1
Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<se3m1s$7qsa$1@news.xmission.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=830&group=comp.lang.awk#830

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gazelle@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.lang.awk
Subject: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.
Date: Sat, 31 Jul 2021 14:17:32 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <se3m1s$7qsa$1@news.xmission.com>
Injection-Date: Sat, 31 Jul 2021 14:17:32 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="256906"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Sat, 31 Jul 2021 14:17 UTC

A frequent task of mine is to write GAWK scripts that process files created
by Windows users. Naturally, I prefer to work/develop on Linux, while they
are creating/editing the files on Windows. I access their files via a
Samba share.

This all works well, except that I periodically get bit by the fact that
DOS files have an extra character as the last character of the line (as
seen by a program running on Linux). One ends up writing AWK code to deal
with this, but it would be nice to not have to do so.

To this end, I have written a GAWK extension that removes the CRs from the
file. Source code is below. But first a few notes:

1) This was developed on Linux, and is written for version 1 of the GAWK API.
This is the version that the GAWK executables on most of my Linux
systems use. However, you can also compile it with Cygwin to run on a
Windows machine, using the Cygwin GAWK.EXE. Unfortunately, Cygwin
keeps GAWK up to date on their platform, so it is running GAWK 5.1 and
this uses API version 3. Luckily, this requires only one source code
change. Here is the diff between the Windows version and the Linux
version of the source code:

77c76
< { NULL, NULL, 0, 0, awk_false, NULL }
---
> { NULL, NULL, 0 }

1a) Also, if compiling on Windows, you should change the output filename
in the compile command from RemoveCRs.so to RemoveCRs.dll. Then you won't
have to specify a filename extension when loading the DLL into GAWK.EXE.

2) I think you could do this more simply just by setting RS to something
that allows for the CRs, but I prefer a more global approach. I dislike
mucking up the AWK source with this sort of thing. I also dislike changing
any of the builtin "control" variables (FS, RS, etc) if I can avoid it.

3) I believe that frequent c.l.a. poster Kaz has made some modifications to
the Cygwin DLLs to do this same sort of thing. That is, of course, another
option.

4) Finally, note that for total completeness, you should do the similar
conversion on output - that is converting LF to CRLF. But my experience is
that this isn't really needed, since most (but not all!) Windows programs
nowadays handle Unix style line endings just fine. Notepad is, of course,
a noted exception.

--- Cut Here ---
/*
* RemoveCRs.c - A fixer for dealing with DOS formatted text files
* Compile command:
gcc -std=gnu99 -shared -s -W -Wall -Werror -fPIC -o RemoveCRs.so RemoveCRs.c
*/

#include <stdio.h>
#include <stddef.h>
#include <string.h>
#include <assert.h>
#include <errno.h>
#include <stdlib.h>
#include <alloca.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <signal.h>
#include <stdarg.h>

#include "gawkapi.h"

static const gawk_api_t *api; /* for convenience macros to work */
static awk_ext_id_t *ext_id;
static const char *ext_version = "RemoveCRs extension: version 1.0";
static awk_bool_t init_RemoveCRs(void);
static awk_bool_t (*init_func)(void) = init_RemoveCRs;

int plugin_is_GPL_compatible;

static ssize_t RemoveCRs_read(int fd, void *buf, size_t nbytes)
{ static char *buffer;
char *p = buf;

assert(buffer = realloc(buffer,nbytes));
int n = read(fd,buffer,nbytes);
for (int i=0; i<n; i++)
if (buffer[i] != '\r')
*p++ = buffer[i];
return p - (char *) buf;
}

/* RemoveCRs_can_take_file --- return true if we want the file */
static awk_bool_t
RemoveCRs_can_take_file(const awk_input_buf_t *iobuf)
{ return iobuf->fd != INVALID_HANDLE;
}

/* RemoveCRs_take_control_of --- set up input parser. */
static awk_bool_t
RemoveCRs_take_control_of(awk_input_buf_t *iobuf)
{ iobuf->read_func = RemoveCRs_read;
return awk_true;
}

static awk_input_parser_t RemoveCRs_parser = {
"RemoveCRs",
RemoveCRs_can_take_file,
RemoveCRs_take_control_of,
NULL
};

/* init_RemoveCRs --- set things up */

static awk_bool_t
init_RemoveCRs()
{ register_input_parser(&RemoveCRs_parser);
return awk_true;
}

static awk_ext_func_t func_table[] = {
{ NULL, NULL, 0 }
};

/* define the dl_load function using the boilerplate macro */

dl_load_func(func_table, RemoveCRs, "")
--- Cut Here ---

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<se6asv$1rvb$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=831&group=comp.lang.awk#831

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!aioe.org!eXsOhAt/ESFYEBbn44DnpA.user.46.165.242.75.POSTED!not-for-mail
From: mack@the-knife.org (Mack The Knife)
Newsgroups: comp.lang.awk
Subject: Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.
Date: Sun, 1 Aug 2021 14:25:35 -0000 (UTC)
Organization: Knives, Incorporated
Message-ID: <se6asv$1rvb$1@gioia.aioe.org>
References: <se3m1s$7qsa$1@news.xmission.com>
Injection-Info: gioia.aioe.org; logging-data="61419"; posting-host="eXsOhAt/ESFYEBbn44DnpA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
Originator: aharon@aharon-ThinkPad-E580.(none) (Aharon Robbins)
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
X-Notice: Filtered by postfilter v. 0.9.2
 by: Mack The Knife - Sun, 1 Aug 2021 14:25 UTC

This is a lot of work to do what

BEGIN { RS = "\r?\n" }

would do. Even simpler would be to put

tr -d '\r'

as a stage in your pipleine before calling gawk.

Mack

In article <se3m1s$7qsa$1@news.xmission.com>,
Kenny McCormack <gazelle@shell.xmission.com> wrote:
>A frequent task of mine is to write GAWK scripts that process files created
>by Windows users. Naturally, I prefer to work/develop on Linux, while they
>are creating/editing the files on Windows. I access their files via a
>Samba share.
>
>This all works well, except that I periodically get bit by the fact that
>DOS files have an extra character as the last character of the line (as
>seen by a program running on Linux). One ends up writing AWK code to deal
>with this, but it would be nice to not have to do so.
>
>To this end, I have written a GAWK extension that removes the CRs from the
>file. Source code is below.

Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<se6dos$dg0$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=832&group=comp.lang.awk#832

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty
extension lib.
Date: Sun, 1 Aug 2021 10:14:35 -0500
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <se6dos$dg0$1@dont-email.me>
References: <se3m1s$7qsa$1@news.xmission.com> <se6asv$1rvb$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Aug 2021 15:14:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f8abd3c42cf0080b81f685dde29a1ffb";
logging-data="13824"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+PLtWqBQbJAtjQjuDeaGFC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:rhS7Zk3OXYgfW1GXVnJKg7cfcvM=
In-Reply-To: <se6asv$1rvb$1@gioia.aioe.org>
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 210801-4, 8/1/2021), Outbound message
 by: Ed Morton - Sun, 1 Aug 2021 15:14 UTC

On 8/1/2021 9:25 AM, Mack The Knife wrote:
> This is a lot of work to do what
>
> BEGIN { RS = "\r?\n" }
>
> would do. Even simpler would be to put
>
> tr -d '\r'
>
> as a stage in your pipleine before calling gawk.

Those would both break files that use `\n` within quoted fields and
`\r\n` record endings such as you'd get in CSVs exported from Excel when
there are cells with newlines within the spreadsheet, e.g. (assume `\r`
and `\n` are literal):

Right:

$ printf '"foo\nbar"\r\n' | awk 'BEGIN{RS="\r\n"; FS=","} {print
NR, $1}'
1 "foo
bar"

Wrong:

$ printf '"foo\nbar"\r\n' | awk 'BEGIN{RS="\r?\n"; FS=","} {print
NR, $1}'
1 "foo
2 bar"

The `tr` would also break files that include `\r`s mid-record:

Right:

$ printf '"foo\rbar"\r\n' | awk 'BEGIN{RS="\r\n"; FS=","} {print
NR, $1}' | cat -Ev
1 "foo^Mbar"$

Wrong:

$ printf '"foo\rbar"\r\n' | tr -d '\r' | awk 'BEGIN{RS="\r\n";
FS=","} {print NR, $1}' | cat -Ev
1 "foobar"$

You cant robustly tell by reading a file if it uses DOS line endings or
not. For example is this:

$ printf 'foo\nbar\r\n' | cat -Ev
foo$
bar^M$

1 line using DOS line endings where no line contains `\r`:

$ printf 'foo\nbar\r\n' | awk -v RS='\r\n' '{print NR, $0}' | cat -Ev
1 foo$
bar$

or 2 lines using Unix line endings where the 2nd line ends in `\r`?

$ printf 'foo\nbar\r\n' | awk -v RS='\n' '{print NR, $0}' | cat -Ev
1 foo$
2 bar^M$

It's impossible to tell from the data, you need to KNOW the format in
advance to be able to parse it correctly.

So, just figure out what your record endings are expected to be (`\r\n`
or `\n`) in advance based on some criteria that doesn't involve reading
the file (e.g. wherever you're getting the input from) and then use the
appropriate RS to parse it robustly.

Regards,

Ed.

>
> Mack
>
> In article <se3m1s$7qsa$1@news.xmission.com>,
> Kenny McCormack <gazelle@shell.xmission.com> wrote:
>> A frequent task of mine is to write GAWK scripts that process files created
>> by Windows users. Naturally, I prefer to work/develop on Linux, while they
>> are creating/editing the files on Windows. I access their files via a
>> Samba share.
>>
>> This all works well, except that I periodically get bit by the fact that
>> DOS files have an extra character as the last character of the line (as
>> seen by a program running on Linux). One ends up writing AWK code to deal
>> with this, but it would be nice to not have to do so.
>>
>> To this end, I have written a GAWK extension that removes the CRs from the
>> file. Source code is below.

Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<se6ig7$9pi$1@news-1.m-online.net>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=834&group=comp.lang.awk#834

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!3.eu.feeder.erje.net!feeder.erje.net!news.in-chemnitz.de!news2.arglkargh.de!news.karotte.org!news.space.net!news.m-online.net!.POSTED!not-for-mail
From: janis_papanagnou@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.awk
Subject: Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty
extension lib.
Date: Sun, 1 Aug 2021 18:35:19 +0200
Organization: (posted via) M-net Telekommunikations GmbH
Lines: 39
Message-ID: <se6ig7$9pi$1@news-1.m-online.net>
References: <se3m1s$7qsa$1@news.xmission.com> <se6asv$1rvb$1@gioia.aioe.org>
NNTP-Posting-Host: 2001:a61:241e:cc01:a824:5569:723f:e9df
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-Trace: news-1.m-online.net 1627835719 10034 2001:a61:241e:cc01:a824:5569:723f:e9df (1 Aug 2021 16:35:19 GMT)
X-Complaints-To: news@news-1.m-online.net
NNTP-Posting-Date: Sun, 1 Aug 2021 16:35:19 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
X-Enigmail-Draft-Status: N1110
In-Reply-To: <se6asv$1rvb$1@gioia.aioe.org>
 by: Janis Papanagnou - Sun, 1 Aug 2021 16:35 UTC

On 01.08.2021 16:25, Mack The Knife wrote:
> This is a lot of work to do what
>
> BEGIN { RS = "\r?\n" }
>
> would do. [...]

That's what I'd also do. It's simple and covers the DOS and the
Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).

The OP's 'C'-code seems to just remove all CRs from anywhere in
the file - , so I also don't see Ed's (IMO non-text) Excel-export
case as comparable with the original question in the thread.

Your RS approach considers even the NL context which the C-code
does not.

But to be honest, I may as well just be missing the OP's point.
In which cases it may make sense to use a separate module with a
lot of C-code that needs to be (pre-)compiled and some specific
mechanism to load it isn't obvious to me. (The "global approach"
argument, beyond being personal taste, isn't very clear either.)

And the mentioned reverse conversion is as simply done in plain
Awk as the input conversion is.

Janis

N.B.: A case where I had written C-code for a similar task was
for an environment where several people wrote and extended text
files from different OSes; the files contained mixtures of every
CR, LF, or CR LF combination, some had no final line ending and
all such a mess. The program checked files and/or fixed them for
any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
DOS, as desired. - Maybe such a more universal function may be
useful for the GNU Awk extension library?

Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<se6pda$vic$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=836&group=comp.lang.awk#836

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty
extension lib.
Date: Sun, 1 Aug 2021 13:33:13 -0500
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <se6pda$vic$1@dont-email.me>
References: <se3m1s$7qsa$1@news.xmission.com> <se6asv$1rvb$1@gioia.aioe.org>
<se6ig7$9pi$1@news-1.m-online.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Aug 2021 18:33:14 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f8abd3c42cf0080b81f685dde29a1ffb";
logging-data="32332"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Q2hf4ofvsBDY/qPo8BkyD"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:1P26jhg8Xtyo0PDMRU8uOcGH5UM=
In-Reply-To: <se6ig7$9pi$1@news-1.m-online.net>
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 210801-4, 8/1/2021), Outbound message
 by: Ed Morton - Sun, 1 Aug 2021 18:33 UTC

On 8/1/2021 11:35 AM, Janis Papanagnou wrote:
> On 01.08.2021 16:25, Mack The Knife wrote:
>> This is a lot of work to do what
>>
>> BEGIN { RS = "\r?\n" }
>>
>> would do. [...]
>
> That's what I'd also do. It's simple and covers the DOS and the
> Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
> line termination character, and all other payload data should
> be plain text (with the exception of the TAB control character).

Here's the POSIX definition of a text file:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
"Text File = A file that contains characters organized into zero or more
lines. The lines do not contain NUL characters and none can exceed
{LINE_MAX} bytes in length, including the <newline> character."

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
"Line = A sequence of zero or more non- <newline> characters plus a
terminating <newline> character."

So a file that has CR as the line termination character is not a valid
text file per POSIX but a file that has NL as the line termination
character and contains CRs (be they immediately before the NLs or not)
is a valid text file. Not sure why you mention the tab character as it's
no different from X or # or any other character.

>
> The OP's 'C'-code seems to just remove all CRs from anywhere in
> the file - , so I also don't see Ed's (IMO non-text) Excel-export
> case as comparable with the original question in the thread.

The examples I posted were just plain text per the POSIX definition
above. I frequently have to write awk scripts to deal with CSVs exported
from Excel with lines that end in CRNL and can contain NL in quoted
fields. I don't know if any of those files have had CRs as I certainly
haven't assumed they can't be present or otherwise special-cased them.

Ed.

>
> Your RS approach considers even the NL context which the C-code
> does not.
>
> But to be honest, I may as well just be missing the OP's point.
> In which cases it may make sense to use a separate module with a
> lot of C-code that needs to be (pre-)compiled and some specific
> mechanism to load it isn't obvious to me. (The "global approach"
> argument, beyond being personal taste, isn't very clear either.)
>
> And the mentioned reverse conversion is as simply done in plain
> Awk as the input conversion is.
>
> Janis
>
> N.B.: A case where I had written C-code for a similar task was
> for an environment where several people wrote and extended text
> files from different OSes; the files contained mixtures of every
> CR, LF, or CR LF combination, some had no final line ending and
> all such a mess. The program checked files and/or fixed them for
> any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
> DOS, as desired. - Maybe such a more universal function may be
> useful for the GNU Awk extension library?
>

Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<c532055c-e4e3-44b2-9541-f7219c6a542en@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=838&group=comp.lang.awk#838

  copy link   Newsgroups: comp.lang.awk
X-Received: by 2002:a05:620a:14b7:: with SMTP id x23mr12854013qkj.387.1627849788983;
Sun, 01 Aug 2021 13:29:48 -0700 (PDT)
X-Received: by 2002:a25:14c2:: with SMTP id 185mr305389ybu.374.1627849788767;
Sun, 01 Aug 2021 13:29:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Sun, 1 Aug 2021 13:29:48 -0700 (PDT)
In-Reply-To: <se6pda$vic$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=96.255.253.97; posting-account=BcR7vAoAAABY9YgIIYIhD68t7wwjMvJW
NNTP-Posting-Host: 96.255.253.97
References: <se3m1s$7qsa$1@news.xmission.com> <se6asv$1rvb$1@gioia.aioe.org>
<se6ig7$9pi$1@news-1.m-online.net> <se6pda$vic$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c532055c-e4e3-44b2-9541-f7219c6a542en@googlegroups.com>
Subject: Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty
extension lib.
From: jnaman2@gmail.com (J Naman)
Injection-Date: Sun, 01 Aug 2021 20:29:48 +0000
Content-Type: text/plain; charset="UTF-8"
 by: J Naman - Sun, 1 Aug 2021 20:29 UTC

On Sunday, 1 August 2021 at 14:33:16 UTC-4, Ed Morton wrote:
> On 8/1/2021 11:35 AM, Janis Papanagnou wrote:
> > On 01.08.2021 16:25, Mack The Knife wrote:
> >> This is a lot of work to do what
> >>
> >> BEGIN { RS = "\r?\n" }
> >>
> >> would do. [...]
> >
> > That's what I'd also do. It's simple and covers the DOS and the
> > Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
> > line termination character, and all other payload data should
> > be plain text (with the exception of the TAB control character).
> Here's the POSIX definition of a text file:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
> "Text File = A file that contains characters organized into zero or more
> lines. The lines do not contain NUL characters and none can exceed
> {LINE_MAX} bytes in length, including the <newline> character."
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
> "Line = A sequence of zero or more non- <newline> characters plus a
> terminating <newline> character."
>
> So a file that has CR as the line termination character is not a valid
> text file per POSIX but a file that has NL as the line termination
> character and contains CRs (be they immediately before the NLs or not)
> is a valid text file. Not sure why you mention the tab character as it's
> no different from X or # or any other character.
> >
> > The OP's 'C'-code seems to just remove all CRs from anywhere in
> > the file - , so I also don't see Ed's (IMO non-text) Excel-export
> > case as comparable with the original question in the thread.
> The examples I posted were just plain text per the POSIX definition
> above. I frequently have to write awk scripts to deal with CSVs exported
> from Excel with lines that end in CRNL and can contain NL in quoted
> fields. I don't know if any of those files have had CRs as I certainly
> haven't assumed they can't be present or otherwise special-cased them.
>
> Ed.
> >
> > Your RS approach considers even the NL context which the C-code
> > does not.
> >
> > But to be honest, I may as well just be missing the OP's point.
> > In which cases it may make sense to use a separate module with a
> > lot of C-code that needs to be (pre-)compiled and some specific
> > mechanism to load it isn't obvious to me. (The "global approach"
> > argument, beyond being personal taste, isn't very clear either.)
> >
> > And the mentioned reverse conversion is as simply done in plain
> > Awk as the input conversion is.
> >
> > Janis
> >
> > N.B.: A case where I had written C-code for a similar task was
> > for an environment where several people wrote and extended text
> > files from different OSes; the files contained mixtures of every
> > CR, LF, or CR LF combination, some had no final line ending and
> > all such a mess. The program checked files and/or fixed them for
> > any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
> > DOS, as desired. - Maybe such a more universal function may be
> > useful for the GNU Awk extension library?
> >
I receive downloaded CSV files (from financial web sites, e.g. Fidelity.com) with terminating Ascii formfeeds, so I use:
BEGIN { RS = @/[\r\n\f]/; } # I assume "[\n\f\r]" is portable; "\n|\f|\r" works too
Doesn't cover Ed's Excel special cases.

Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty extension lib.

<se8p6o$toc$1@news-1.m-online.net>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=840&group=comp.lang.awk#840

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!news.swapon.de!news.in-chemnitz.de!news2.arglkargh.de!news.karotte.org!news.space.net!news.m-online.net!.POSTED!not-for-mail
From: janis_papanagnou@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.awk
Subject: Re: Handling DOS (Windows) text files in Unix (Linux) - a nifty
extension lib.
Date: Mon, 2 Aug 2021 14:42:00 +0200
Organization: (posted via) M-net Telekommunikations GmbH
Lines: 55
Message-ID: <se8p6o$toc$1@news-1.m-online.net>
References: <se3m1s$7qsa$1@news.xmission.com> <se6asv$1rvb$1@gioia.aioe.org>
<se6ig7$9pi$1@news-1.m-online.net> <se6pda$vic$1@dont-email.me>
NNTP-Posting-Host: 2001:a61:241e:cc01:e93b:85e7:d84c:3c9b
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-Trace: news-1.m-online.net 1627908120 30476 2001:a61:241e:cc01:e93b:85e7:d84c:3c9b (2 Aug 2021 12:42:00 GMT)
X-Complaints-To: news@news-1.m-online.net
NNTP-Posting-Date: Mon, 2 Aug 2021 12:42:00 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
X-Enigmail-Draft-Status: N1110
In-Reply-To: <se6pda$vic$1@dont-email.me>
 by: Janis Papanagnou - Mon, 2 Aug 2021 12:42 UTC

On 01.08.2021 20:33, Ed Morton wrote:
>>
>> [...] In a text file I'd expect CR and/or NL as
>> line termination character, and all other payload data should
>> be plain text (with the exception of the TAB control character).
>
> Here's the POSIX definition of a text file:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
>
> "Text File = A file that contains characters organized into zero or more
> lines. The lines do not contain NUL characters and none can exceed
> {LINE_MAX} bytes in length, including the <newline> character."
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
>
> "Line = A sequence of zero or more non- <newline> characters plus a
> terminating <newline> character."
>
> So a file that has CR as the line termination character is not a valid
> text file per POSIX but a file that has NL as the line termination
> character and contains CRs (be they immediately before the NLs or not)
> is a valid text file. Not sure why you mention the tab character as it's
> no different from X or # or any other character.

Last question first; I mentioned the Tab because it is besides the
Blank a common white-space character to separate text in text files.
I mentioned it separately because it is a control character (in many
ways), one of it is that it is interpreted, where an 'X' or '#' is
just displayed as a readable character as it is.

The English Wikipedia (-> "ASCII") says about the control codes:
"codes originally intended not to represent printable information,
but rather to control devices".

I had searched in the past for a general definition of a text file
but don't recall to have found one. The German Wikipedia page shows
a couple relevant aspects. It says that it contains representable
(~printable) characters, that can be organized by control characters
like those to change line and change page. A characteristic is that
it is readable without specific tools through simple text editors.

The Excel program using NL/CR for line structuring and substructures
is not different from using ASCII FS, GS, RS, or, US; it's a control
code that needs interpretation. While the NL/CR/CR-NL is at least
uniquely defined for the same purpose on the respective platforms.

A text file definition that considers STX, ACK, DC1, NAK, SYN, ETB,
etc. in a file to still constitute a "text file" is not well suited
for the purposes where I had been talking about such files during my
long IT time. (There may be contexts where it makes sense, or maybe
not.)

Janis

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor