Introduction#

What is Plain Text?#

For a video talk about plain text please visit Dylan Beattie: Plain Text @ NDC Copenhagen 2022.

Plain text designates data that represent only readable characters but not any graphical representation, styling nor other objects like images etc. It includes some whitespace characters that affect simple arrangement of text like spaces, line breaks or tabulation characters.

Plain text files differ significantly from the well-known “Word” files where style information is embedded as binary, non-human-readable, objects: They only contain text, but not its graphical representation or other objects (formatting such as different fonts, font sizes, bold or italics, images, etc.). In principle, they are similar to texts that were written with a typewriter.

Plain text is therefore the most basic concept representing textual information in a digital system. The “lowest common denominator” if you will. Since the introduction of the Unicode UTF-8 encoding it can been seamlessly processed by any modern UTF-8 enabled system (which ist to be considered almost any system by now).

Note

Plain text is a (technically) simple representation of textual and other forms of content.

Markup adds Meaning to an otherwise boring Plain Text#

Text doesn’t exit in a vacuum. While plain text is just a way to store characters, what the meaning of this characters is needs some kind of convention. Depending on you requirements you might want to add some context to your text files:

  • If you are a programmer you may want to write a computer program.

  • If you a writer you want to define headings, emphasize stuff etc.

  • If you want to manage data you want this data to be formalized so that another human being or another computer can recognize the actual meaning of your data.

While plain text only contains a certain set of human readable characters, extra meaning can be added to the text while still maintaining the principal properties and characteristics of a plain text. This can be a programming language or some kind of defined markup language.

A markup language specifies the structure and formatting of a document. It is basically a set of rules defining how meta information to the content to facilitate automated processing and proper formating. The term “markup” evolved from the “marking up” of paper manuscripts with red ink to give formatting instructions to eventually create production ready materials.

When typewriters where widely used a similar technique was widley adopted like when you wanted to emphasize a text you simply added an asterisk (*) at the beginning and the end of a word (like This is *emphasized*!). A markup language takes it to the next level by defining strictly and unambiguously how markup has to be done thus enabling automated processing. The more extensive the definitions the more features like defining hyperlinks, links to images files etc. can be added to an otherwise plain text.

It has to be emphasized that the markup does not constitute the formatting of the text in the end product. It is merely a placeholder. The actual formatting has to be carried out in an extra step with regard to the possibilities and requirements of the end product, be it a web page or printed work. This work is done by a compiler who has some kind of layout and style defintions for the varying formats.

On the first glance this extra step seems cumbersome, however this division between meaning and formatting enables the creation of different target products (print, electronic, online, …) using the very same source material. Furthermore changes in the style or layout can easily accomplished by simply changing the style definitions of the compiler without ever touching the actual source material.

Markup

A Markup language defines how to add semantics and features to an otherwise unformatted plain text by simply using ordinary characters.

One of the most popular markup languages is Markdown which exists in a variety of dialects.

A Compiler with Style Defintions translates Plain Text to the End Product#

Plain text allows the focus to be placed on the actual text content. As mentioned before you can add semantics (emphasize, headings) and features (links to images, hyperlinks, …) to plain text by the use of a markup language.

In order to put the text into a presentation ready form in an appropriate layout, you need

  • a compiler that translates the plain text to the desired format and

  • style defintions that tell the inerpreter how an item (like a heading or emphasized text) should actually look like (which font, size, shape, color, …).

The advantage of the separation of content, meaning and layout is obvious:

  • The layout can be changed centrally.

  • Style definitions may vary depending on the desired output type or format. For example, one font may be suitable for printing, but a different one for a website.

  • The interpreter can support different output formats, e.g. B. PDF for printing or HTML for use as a web page. The same text can therefore be used for different output formats without change.

digraph {

"Plain Text File"
        [
                shape = note,
                color=crimson,
                style=filled,
                fillcolor=white,
                fontcolor=crimson,
                fontname="Latin Modern Mono",
                label="Plain Text File",
        ];
"Style"
        [
                shape = component,
                color=fuchsia,
                style=filled,
                fillcolor=white,
                fontcolor=fuchsia,
                fontname="Lexend",
                label="Style Definition",
        ];
"Compiler"
        [
                shape = doublecircle,
                color=grey,
                style=filled,
                fillcolor=black,
                fontcolor=white,
                fontname="Lexend",
                label="Interpreter",
        ];



"File.pdf"
        [
                shape = note,
                color=black,
                style=filled,
                fillcolor=white,
                fontcolor=black,
                fontname="Lexend",
                label="PDF File",
        ];

"File.html"
        [
                shape = note,
                color=blue,
                style=filled,
                fillcolor=white,
                fontcolor=blue,
                fontname="Lexend",
                label="Webpage",
        ];
"File.epub"
        [
                shape = note,
                color=darkgreen,
                style=filled,
                fillcolor=white,
                fontcolor=darkgreen,
                fontname="Lexend",
                label="E-Book",
        ];

rankdir = TB;

subgraph {
        "Plain Text File" -> "Compiler" -> "File.pdf";
        "Style" -> "Compiler" [color=fuchsia];
        "Compiler" -> "File.html";
        "Compiler" -> "File.epub";

        {rank = same; "Compiler" ; "Style";}
        }
}

Fig. 74 An interpreter translates the plain text file into different output files and formats#

As mentioned before one of the most popular markup languages for textual content is Markdown which exists in a variety of dialects. One of this dialects is MyST Markodown whcih is feutured and supported in the Sphinx toolchain, a mighty text processing systems of print and online publishing.

Plain Text can contain extensive Data Collections: XML and CSV#

While common the representation of textual content isn’t the only domain of plain text by far. Using special markup languages designed for data structering and storage enables us to store data ind a human readable format. One of the most common formats is XML, the Extensible Markup Language. Its main purpose of XML is storing arbitrary data in a standardized format in a structured way so that it can be easily processed. However since XML is as flexibela as it is extensible by beeing able to specify and structure all the data fields someone can think of, a XML file has to be accompanied by some sort definition or standard documenting what these data fields actually mean in order to be useful.

Another well known format is CSV, which stands for Comma Separated Values. With CSV you can export the values of a spreadsheet, the columns separated by commas (or another other deliminator), the rows by lines. It’s simple, quick and supported by almost every software designed to process tabulated data thus making it a beloved choice if you have to exchange tabular data with different programs.