Internet Standard #80: ASCII Format for Network Interchange (1969)

1vuio0pswjnm7 · on Aug 6, 2023

Although I use HTML every day, most personal documentation I save is not in HTML format. It is in ASCII.

I have some text files that can grow quite large and one trick I use is to keep a less history file to help me navigate. For example, I can save searches and marks in a history file. The format is simple.

   .less-history-file
   .search
   "term1
   "term2
   .mark
    m a 1 46248 1.fifo
    m b 1 11509 1.fifo

Then I use a small shell script to read the text file. Something like

   #!/bin/sh

   (zstd -dc /path/file.txt.zst > 1.fifo&)  
   LESSHISTSIZE=999999999999999999999999 \
   LESSHISTFILE=/path/.lesshst.file.txt.zst \
   exec less -G --save-marks --no-histdups 1.fifo

Another aproach might be to use an editor with macros like ex/vi as the pager.

Sometimes I see people converting text files into HTML, e.g., RFCs or manpages, but in these documents there are often no "hyperlinks" so the HTML at best only adds enhanced appearance and possibly the ability to jump around within the document. At a cost of making them significantly larger.

Instead of using markup and storing navigation information in the file, and using an HTML reader, this stores the information in a small, separate, associated text file.

hn_throwaway_99 · on Aug 6, 2023

Some clarification:

ASCII is not a format, it's an encoding. You are referring to "plain text" vs. HTML. For the rest of what you've written, this feels like a million times more complicated than using markdown.

1vuio0pswjnm7 · on Aug 6, 2023

"You are referring to "plain text" vs. HTML."

The term "ASCII format" in the submission title comes from the IETF Internet Standard, not me. Contact the author Vint Cerf or the IETF with objections. Not much I can do to change an Internet Standard.

In this comment, I never used the term "ASCII format". I used the term "HTML format".

What is "markdown". Word play on the term "markup".

hn_throwaway_99 · on Aug 6, 2023

You are misunderstanding:

1. ASCII simply defines a byte encoding to represent characters. E.g. UTF-8 is another character encoding. HTML files are often written in ASCII.

2. Markdown is a very common standard for representing minimal formatting in a plain text file: https://en.m.wikipedia.org/wiki/Markdown

1vuio0pswjnm7 · on Aug 6, 2023

Thanks, I understand now.

For personal documentation, I prefer files of MIME type "text/plain" rather than "text/html". For text encoding, I prefer 7-bit ASCII. I do most reading in textmode. I generally dislike use of UTF-8. I dislike use of ANSI escape sequences to enhance text. These are my preferences, not necessarily anyone else's.

Markdown has extra garbage in the file. No different than tags in HTML. RFC 7763 describes markdown as "lightweight markup".

In the comment I am describing keeping extra garbage in a separate file. Although I have thought about storing the contents of the less history file at the top of the document, sort of like a table of contents, then extract it to a less history file when I want to read the document with less.

Someone · on Aug 6, 2023

How do you guarantee that file is ascii? I would think it rapidly would become non-ascii for most users, even for English speakers. You can’t write café in ascii, for example, use curly quotes, or write a pound, euro, or yen sign

1vuio0pswjnm7 · on Aug 6, 2023

"How do you guarantee that the file is ascii."

I use an HTML reader set to 7-bit ASCII. If necessary I replace or eliminate non-ASCII characters I do not want. I often use flex or tr for this task.

In the situation described in the comment, I'm not sharing these files with anyone else. I'm reading these files myself. As such, other users' preferences are irrelevant.

The submission is about ASCII for "network exchange". But the comment replied to here is not necessarily about ASCII for network exchange. It's about a method used by yours truly for reading large ASCII files.

1vuio0pswjnm7 · on Aug 7, 2023

s/characters/bytes/

evandale · on Aug 6, 2023

I'm not the parent, but I personally can't think of a single instance where I would write "café" instead of "cafe". I can't imagine needing to use a pound, euro, or yen sign either. I'd probably just write pound, euro, or yen if I needed to talk about those currencies.

It's a pain to use any of those characters on every device I use so I don't see how any of them could get in the file if I were to adopt this.

paulddraper · on Aug 6, 2023

Most people write only the characters on their QUERTY keyboards.

cafe, resume, fiance, etc

ElectricalUnion · on Aug 6, 2023

But are you gonna generate 100% of such files using only your typing? What about Input Method Editors or copy-pasting?

gumby · on Aug 6, 2023

Of course the terms "Internet" wouldn't be used for another decade (and deployed even later) but the rfc-editor's index is more recent than that.

There are some implications of RFC 20 right in the abstract that are obscure today.

> use of standard 7-bit ASCII embedded in an 8 bit byte whose high order bit is always 0.

This was very important because character sets had still not been standardized back then, though ASCII newer non-IBM systems tended to choose ASCII by default. Even then, since byte and word lengths varied, multiple character sets were common even on a single machine (e.g. both ascii and SIXBIT) so specifying the 8-bit byte (and IBMism IIRC) was necessary. 36-bit words were quite popular in research machines and back in '69 I think all the arpanet hosts were 36-bit PDP-10s, which supported bytes of width 1-36 bits. You can still see remnants of these machines in some older protocols.

As a consequence, the FTP protocol had a "binary mode" and a "text mode" because you can't safely transfer binary data from machines with incompatible word sizes and endianness. Text mode guaranteed a stream of seven bit characters in the correct order. If you specified the wrong transfer mode you usually ended up with gibberish.

> SRI uses "." (ASCII X'2E' or 2/14) as the end-of-line character, where as UCLA uses X'OD' or 0/13 (carriage return).

You can see that this issue also goes way back.

So if you think things were better before the confusion of utf-8 vs other representations, or were better before Windows code pages, well, they weren't.

_notreallyme_ · on Aug 6, 2023

> the FTP protocol had a "binary mode" and a "text mode" because you can't safely transfer binary data from machines with incompatible word sizes and endianness

I'm not sure to follow you here. The TYPE modes are ASCII, IMAGE, EBCDIC and Local byte size. The "binary mode" being Image and "text mode" being ASCII in current FTP clients.

The IMAGE mode is the one that actually transfers file as they really are without any changes. The other 3 formats are actually used for character conversion, with an additional parameters to specify telnet or ASA conversion, and even changing the byte size in case of L.

In the end, the "text mode" is the one that will tend to corrupt your file (thanks to the CR/LF, LF/CR, LF discrepancy between the 3 major desktop OS).

Always choose TYPE IN (Image Non-print) by default!

gumby · on Aug 6, 2023

Thanks for the reminder — it’s been decades since I thought about FTP, much less used it.

While we’re at it, let’s not forget that it was originally also the transport for network mail!

bawolff · on Aug 6, 2023

I think SMTP would be a more relevant example.

gumby · on Aug 6, 2023

Does smtp have another mode other than text?

bawolff · on Aug 7, 2023

SMTP generally assumes you only have 7bit transfer. You have to negotiate a protocol extension (8bitmime) to send bytes with a leading 1.

https://datatracker.ietf.org/doc/html/rfc6152

Someone · on Aug 6, 2023

> 36-bit PDP-10s, which supported bytes of width 1-36 bits.

I thought they only supported (some) elements of bit sizes that divided 36, but http://www.hakmem.org/pdp-10.html#Byte thought me:

“In the PDP-10 a "byte" is some number of contiguous bits within one word. A byte pointer is a quantity (which occupies a whole word) which describes the location of a byte. There are three parts to the description of a byte: the word (i.e., address) in which the byte occurs, the position of the byte within the word, and the length of the byte”

jwilk · on Aug 6, 2023

https://www.sensitiveresearch.com/Archive/CharCodeHist/#NOTE...

> From Eric Fischer: You may also be interested to know that RFC 20 […] is, aside from minor typos and the opening paragraph, essentially a word-for-word copy of the X3.4-1968 standard, presumably without the knowledge or consent of ANSI.

unglaublich · on Aug 6, 2023

I really like the formatting of the RFCs, both the text and HTML ones.

Does anyone know what tool or formatting machinery is used for this?

I'd like to keep personal notes in this format.

gabrielsroka · on Aug 6, 2023

IETF has some tools for this. A quick search revealed https://github.com/ietf-tools/ietf-at

See also https://www.rfc-editor.org/rse/format-faq/

jwilk · on Aug 6, 2023

https://github.com/ietf-tools/rfc2html

laurensr · on Aug 6, 2023

Please refer to 'rendering and converting' at https://authors.ietf.org/choosing-a-format-and-tools

quickthrower2 · on Aug 6, 2023

It seems as though ASCII was completed in 1963 according to: http://edition.cnn.com/TECH/computing/9907/06/1963.idg/, but this Internet Standard (nee RFC) I suppose is saying to use it for "networks"

bhaak · on Aug 6, 2023

Not exactly. The first edition form 1963 only specified a subset of modern ASCII.

https://www.sensitiveresearch.com/Archive/CharCodeHist/X3.4-...

Bits 6 and 7 were mostly unassigned.

Obviously, lowercase characters didn't exist in 1963, they were invented later and that's why they added them later. The version from 1968 is basically what we consider to be ASCII. /s

From what I remember, I think this was due to compatibility reasons to old teletype(?) hardware but those reservations went out of the windows pretty fast though.

https://en.wikipedia.org/wiki/ASCII#History

Someone · on Aug 6, 2023

https://www.sensitiveresearch.com/Archive/CharCodeHist/X3.4-... has the full 1963 standard (12 jpg’s)

I find it interesting to read the section on considerations.

They, for example, mention that COBOL is supported, but ALGOL isn’t, explain how the standard could be extended for use for “European alphabets” (they could use the 5 values after Z and the one before A for additional letters) or “base 12 numeric digits” (to support pre-decimalization British monetary values)

bhaak · on Aug 8, 2023

> how the standard could be extended for use for “European alphabets” (they could use the 5 values after Z and the one before A for additional letters)

https://en.wikipedia.org/wiki/ASCII#7-bit_codes

They not only could, they actually did. Now you know why C has trigraphs.

mike_hock · on Aug 6, 2023

> Obviously, lowercase characters didn't exist in 1963, they were invented later and that's why they added them later.

Lowercase letters were invented in 1967 and later backported to cursive writing for use in hospitals to thwart Chinese medical espionage.

p_l · on Aug 6, 2023

ASCII being late with final release is why IBM S/360 and derived mainframes use EBCDIC - IBM planned on replacing it with ASCII, but the standard wasn't completed and firm yet when they needed it.

kragen · on Aug 6, 2023

this is one of the early rfcs where it would be most desirable to see a scan of the original document

jwilk · on Aug 6, 2023

Here it is: https://www.rfc-editor.org/rfc/rfc20.pdf

kragen · on Aug 6, 2023

this is wonderful, thank you

an unexpected thing about this scan is that it seems to literally be a scan of the ansi standard for ascii

like on page 4 it says 'usas x3.4-1968, revision of x3.4-1967' and the page headers say 'x3.4, usa standard code for information interchange'

it seems clear that the only part by vint cerf is the first page

so the internet was founded on piracy and copyright infringement from the very beginning

tialaramex · on Aug 6, 2023

The IETF also hijacked the OID 1.3.6.1

ISO and the ITU agreed a system of OIDs (Object Identifiers). Such identifiers are needed in various systems which want some way to uniquely identify arbitrary things in a hierarchy, such as the fields in the X.509 certificates for your HTTPS servers, nodes in your SNMP system, and DICOM medical image data.

To mint new OIDs you need a parent OID from which you can just branch off children. So for example if you owned 5.6.7.8 then you're entitled to make 5.6.7.8.9 but also 5.6.7.8.1, 5.6.7.8.2, 5.6.7.8.4261 and so on, and they in turn can have children of their own, maybe 5.6.7.8.4261.8302048.1.384823.5 ends up existing. But how can you get such an arc in the first place? ISO and ITU issued themselves initial numbers 0, 1 and 2 and they in turn issued some prominent international entities their own arcs, to e.g. the US Department of Defence.

The IETF understandably wanted to issue OIDs, but in the early days it wasn't obvious who should issue an arc to them. So, some authors wrote an RFC which just "presumes" the US Department of Defence (owners of the 1.3.6 arc) will issue 1.3.6.1 to the IETF. No such formal issuance occurred, but too late, there's an IETF RFC which says the DoD is going to give the IETF the 1.3.6.1 arc so everybody writing RFCs which need OIDs just uses ones from the 1.3.6.1 hierarchy. A successful namespace hijacking is about numbers. If most people who have an opinion think you own 1.3.6.1 then you do, if not, you don't.

Even if today the US DoD actually said "No, we decided you can't have it" that wouldn't be effective, way more people care what the IETF thinks about this.

kragen · on Aug 8, 2023

probably it would be better to use 48 bits of a sha-256 of a specification document to identify a namespace within which your oids are defined, then serial numbers plus a pointer to that namespace identifier for the individual oids. in the usual case where you have less than 15 oid namespaces in an snmp packet or whatever the fuck, the namespace identifier in a particular oid eats just four bits

48 bits is short enough that intentional collisions are eminently feasible but long enough that unintentional collisions are vanishingly unlikely, and as you point out, the whole way the space works depends on everyone trying to avoid collisions anyway

jwilk · on Aug 7, 2023

The RFC in question:

https://www.rfc-editor.org/rfc/rfc1065.html#page-5

Someone · on Aug 6, 2023

> it seems to literally be a scan of the ansi standard for ascii

More than “seems”. Page 1 says it is (“copies from USAS X3,4-1968”)

wlindley · on Aug 6, 2023

This also sets the standard for the names of characters. (Parentheses are round) [Brackets are square.] {And braces are 'curly.'} There are no 'round braces' and to say 'curly braces' is redundant.

rmwaite · on Aug 6, 2023

I also use <angle brackets> to differentiate [square brackets].

jwilk · on Aug 6, 2023

HTML version: https://www.rfc-editor.org/rfc/rfc20.html