Monday, January 7, 2008

from .txt to text

As an online news producer, I have often found myself using software tools that seem clumsily suited to working with journalistic content. It’s becoming somewhat cliché in the industry to say that all content management systems are bad, and I believe I have an idea why that is.

The problem is that there are a number of words that are shared between the humanistic disciplines and computer science—words which are deceptive in that they appear to be simple and straightforward in meaning, but which in fact have evolved different, if overlapping, meanings in each of those two traditions. I refer to words like “graphic” and “music” but especially to “text.”

To the writer, the word “text” can mean any number of things—a handwritten note, an article from a journal, a book. A text may include illustrations, photographs, snippets from other languages, unusual formatting choices, whatever is necessary to express the intent of the author. These are not add-ons to the content, but integral to it.

In contrast, to the computer scientist, “text” is what a text editor edits. The word refers to a binary representation of a typewritten text. There is a limited character palette, as each character must be assigned a number. These are then arranged in sequence, allowing for display only in rigid rows, left-to-right, top-to-bottom (some latitude is allowed for non-Latin alphabets, but that is a relatively recent development), with no other formatting allowed.

Any additional presentational features of the text—including hard line breaks, character formatting, kerning and leading, mathematical typesetting, included images—must be added through any of a number of conventions allowing metadata to be introduced into the flow of the text. Examples include LaTeX, HTML, and the Microsoft Word document format. The text then splits into code and WYSIWYG views, and your opinion as to which of these constitutes the “real” text is likely to be a function of whether you work as a writer or a computer programmer. The people who use the software want to manipulate the WYSIWYG text, but the people who create the software are concerned with manipulating the code text.

The theory behind WYSIWYG word processing is that this should make no difference, since the two forms are simply different expressions of the same structure. In reality, however, the code text represents the structure directly, and the WYSIWYG text is a translation. The situation is better with well-crafted software, but writing texts that differ markedly in structure from typewritten or “plain” texts is often impossible or extremely difficult without learning how to work directly in the code.

I’m not sure this situation can ever be fixed as long as WYSIWYG texts are coded as “plain” texts with embedded metadata. The writer will require a new model of textual representation that deals with characters and formatting information at the same level of abstraction.

No comments: