Saturday, 2 June 2012

OpenDocument Format (ODF) - a look into ODT

The OpenDocument Format is an XML based file format for word processing, spreadsheets, presentations documents and an widely used open alternative to proprietary document formats. It provides an open XML-based document file format for office applications and reuses established standards like HTML, SVG, XML, MathML etc.

The most commonly used filename extensions are :

Word Processing - .odt, .fodt
Spreadsheets - .ods, .fods
Presentations - .odp, .fodp
Databases - .odb
Graphics - .odg, .fodg
Formulae - .odf

ODF supports 2 types of document representation:

1. Single XML document / Flat XML / Uncompressed XML - Even though this is not widely used but the most common file extensions are .xml, .fodt, .fods, etc.

2. ZIP compressed archive - as a collection of several sub documents within a package, each of which stores part of the complete document. Filename extensions used are .odt, .ods, .odp, etc.

The package is a standard ZIP file with different filename extension and with defined structure of sub-documents which in turn has a different document root and store a particular aspect of the XML document. This way it separates the content, styles, metadata & application setting into different XML files

.odt : this is the most common file extension used for text documents and is a zip file container with a ODT extension. It actually contains multiple files instead of just one XML file. The Zipped set of files and directories include XML Files (content.xml, meta.xml, settings.xml, styles.xml) , mimetype, Directories (META-INF, Thumbnails, Objects) etc.

It's capabilities include :

Content: OpenDocument’s text content format supports both typical and advanced capabilities. content.xml is the most important file and carries the actual content of the document (except for binary data, such as images)

Metadata: There are predefined metadata elements as well as an option to custom define metadata to store the data about the data. Some predefined metadata fields are Generator, Title, Description, Subject, etc. (meta.xml contains the file metadata)

Style & Formatting : styles.xml contains the style information, OpenDocument makes heavy use of styles for formatting and layout. Style types include Paragraph styles, Page Styles, Character Styles, Frame Styles etc.

Objects : This is where the Math in ODT comes in. The OpenDocument format can contain 2 types of objects : a) that have an OpenDucument representation (Formulas - MathML, charts etc.) and b) that don’t have an XML representation (these only have a binary representation)

The above shows how the OpenDocument format provides a strong separation between the content, layout and metadata.

Some other files and directories include settings.xml (which contains settings such as zoom factor or the cursor position), mimetype (contains the one line mimetype of the document), Thumbnails folder (contains thumbnail.png which is a representation of the first page), META_INF folder (which contains manifest.xml which has information about the files in the package such as list of all the files, their media types etc.) and several other folders to hold word processing preference data.

In essence to import the math from ODT, we need to identify which objects are MathML objects and then convert them to iTeX to edit it inside AbiWord.

No comments:

Post a Comment