Where is the Fonttable XML file

Practical tips Word: A format in all openness?

(Montero Pineda, Manuel, Herkert, Steffen, Klevenz, Tobias, Kutscherauer, Nico: Practical tips Word: A format in all openness ?, in: technical communication, issue 1, 2009).

Today's word processing programs are getting closer and closer to the functionality of a DTP application. Including graphics and tables, handling fonts, structuring in columns and using text fields - functions that were not originally part of word processing have now become part of common office solutions. The more extensive Office applications become, the more complex file formats are created. More and more software providers are therefore recognizing the advantages of an XML interface and relying on XML-based file formats.

Microsoft Word is the clear market leader in word processing. For years, the Word DOC format was the standard. Since Microsoft Word 2003, in addition to the proprietary format, there has been the option of saving Word documents as WordprocessingML, or “WordML” for short. WordML is an application of XML such as XHTML, Docbook or DITA. It is open and can be converted into other XML formats using XSLT transformations. While WordML is only one possible storage format in Office 2003 and the user uses the DOC format as the standard, Microsoft uses a new format for Word 2007: DOCX, a ZIP-compressed XML data collection based on the Office Open XML (OOXML ). In March 2008, the OOXML format specification was declared an ISO standard. Its publication has not yet taken place due to objections. The appeals have already been dismissed.

Languages ​​of OOXML

As the name suggests, Office Open XML does not only include the markup language for Microsoft Word. The format also integrates the other formats of the Office suite at this level. The main components of OOXML are:

  • WordProcessingML, a slightly revised version of the XML application developed for Word 2003, which takes over all word processing and text design functions from Word.
  • SpreadsheetML, which is used to integrate Excel tables.
  • PresentationML, which covers the functions of PowerPoint.

The file extensions of Excel and Powerpoint are extended by the obligatory "x", which is supposed to refer to XML as a component: XLSX and PPTX.
Word can also display mathematical equations and vector drawings. Originally, the Mathematical Markup Language recommended by W3C, "MathML", established itself as the standard for mathematical equations. However, Microsoft uses its own XML language in OOXML: Office MathML (OMML). Due to the similarity, partial compatibility between MathML and OMML can be created using XSL transformation. DrawingML is used for vector graphics instead of the W3C standardized and well-established SVG language.

Structure of OOXML using Word as an example

As mentioned, a DOCX document is a file compressed as a ZIP container. If a DOCX file is decompressed, the user receives a fixed folder structure. In this structure all necessary information of the Word document is stored as a kind of collection of XML files and non-XML files. The structure of the ZIP container is specified by the Open Packaging Conventions (OPC) - part of the OOXML standardization.

Fig .: After unpacking it becomes clear what is in a DOCX document.

The XML files in the ZIP container can be divided into "Parts" and "Items". Parts are files that contain parts of the content of a document, while items provide meta-data about these files. There is, in turn, a distinction between content type items and relationship items. Relationship items determine how the individual parts are put together to form a document. Content-type items, on the other hand, regulate the presentation of the content, for example formatting instructions or page format. Every document has a main part. It can usually be found in the DOCX-ZIP container in the document.xml file in the "word" folder. This file contains all the content that Word contains as continuous text. The following content is saved in other files that are created in the "word" directory:

  • Bullets and numbering that were automatically created as such or with a corresponding function. Here, however, only the characters, not the contents of these lists, are stored in an external XML file. In this file, for example, the symbols of an enumeration are stored in the file "numbering.xml".
  • The static content of footers and headers is saved in the file "footer1197.xml" or "header1197.xml". "N" stands for an integer that indicates the position of the object within the document. Since the footer and header area does not have to contain the same content on every page, but can, there may be different parts of these parts.
  • The footnotes and endnotes inserted via the corresponding function are collected in "footnotes.xml" or "endnotes.xml".

A subfolder is created for the graphics integrated in the diagrams, "SmartArt" (- diagrams), and for the table of contents:

  • The data required for these diagrams can be found under "charts / chartN.xml". Excel tables are included in the “embeddings” subfolder.
  • With “data1197.xml”, the “diagrams” subfolder not only contains the content of the SmartArt diagrams, but also the necessary meta information as content with “colors1197.xml”, “layout [N] .xml” and “quickStyle1197.xml” Type items.

The table of contents is saved as a separate sub-document in the "glossary" folder.
FontTable.xml, settings.xml, styles.xml and webSettings. xml that are present in every OOXML document.
The general document properties are saved in settings.xml, a list of the fonts used in fontTable.xml and the various formatting instructions with which the content was marked in styles.xml. If there are special instructions for the web display, the webSettings.xml file is reserved for these. In addition to these items, the file [Content_Types] .xml is located in the top level of the ZIP container by default, which describes the content of the complete archive.
The files app. Xml and core.xml are located in the docProps folder and contain document properties relating to the context. Background information for the application, in this case Microsoft Word, about the author, number of pages, number of words, number of characters or version of the application are stored in app.xml. The date of the last storage and the date of creation are noted in core.xml.

ODF - the competition

Before Office Open XML was nominated as an ISO standard, there was already a corresponding standard that dealt with document formats for office applications. As early as 2006, the OASIS Open Document Format for Office Applications, "OpenDocument" or "ODF", ​​was specified as an ISO standard. In contrast to OOXML, which is tailored to the properties of Microsoft Word, ODF is based on the file format of the free application OpenOffice. Due to the time advantage, ODF receives support not only from supporters of the open source movement, but also from IT companies such as IBM, Oracle, Sun Microsystems and Google as well as various authorities.

Criticism of OOXML

The main points of criticism are the scope of the OOXML specification, around 6,000 pages, and the incompatibility with established standards such as MathML or SVG. The enormous scope of the specification prevents full support of the OOXML standard by competing products. Even MS Office will not fully support the standard until the next version, and MS Office 2007 will only use it as a basis.
Likewise, the use of Office MathML instead of MathML and DrawingML instead of SVG make it more difficult for other programs to support OOXML. In order to be able to carry out the complete conversion to standards such as ODF or MathML, new converters would also have to be developed for this. However, this contradicts the basic idea of ​​a generally applicable standardization, which in turn should be based on standards. In addition, the new version of WordML in no way solves the structural problems of this XML language.
But OOXML does not only have to hold its own against competitive formats. Every innovation in the software is associated with a lengthy conversion process so that previous versions can be supported. Since old Office versions without extension functions do not support DOCX files, there are compatibility problems. The development of a new converter is required. The DOCX format must therefore first assert itself internally against the widespread DOC format.
In view of Microsoft's previous policy, the current development can be seen as a breakthrough. However, the multicompany cannot or does not want to get rid of the aftertaste of monopoly.





Tip from the data2type editors:
On the subject WordML we also offer the following courses for deepening and professional training:

Copyright © 2009 tcwolrd GmbH
You can print out the online version for your private use.
Otherwise, this article from the specialist journal "Technische kommunikation" (Issue 1, 2009) is subject to the same provisions as the hardback edition: The work, including all of its parts, is protected by copyright. All rights reserved including duplication, translation, microfilming as well as storage and processing in electronic systems.

tcworld GmbH, Rotebühlstrasse 64, 70178 Stuttgart, [email protected], www.tekom.de