Sunday, 3 June 2012

XML & it's use in AbiWord

Extensible Markup Language (XML) is a markup language which defines a set of rules for encoding documents in a human & machine readable format. It is a markup language much like HTML but with a completely different goal, HTML was designed to display data with a focus on how data looks whereas XML is about transporting and storing data with a focus on what the data is. Unlike HTML, there are no tags defined for XML and it is designed to be self descriptive. Since it allows users to define their own tags, there is a data definition table (DTD) required to decode the data, which is defined near the top of the file.

Some common constructs that appear in XML:

XML Declaration : <?xml version="1.0" encoding="UTF-8"?> , this is XML declaration which is not required but it identifies the document as XML and indicates the version of XML.

Character : Any XML document is a string of characters and almost every legal Unicode character can appear in it. All XML processors must be able to read entities in both the UTF-8 & UTF-16 encodings.

Markup and Content : The contents in an XML document are divided into Markup and Content, which are distinguished by simple syntactic rules. Like all strings which constitute Markup either begin with the character < and end with >, or begin with & and end with ;  And the strings of characters which are not Markup are the Content.

Tags are the markup construct which begin with < and end with > (it can be a start-tag <block> , end-tag </block> or a empty-element tag <line-break />). And the the document component which starts with a start-tag and ends with the end-tag or consists of an empty-element tag is called Element. And the content within the tags are Element’s content which might contain child elements as well.

Attributes : another markup construct which contains the name/value pair

The processor analyzes the markup and passes the structured information into an application. This processor is often called an XML parser. Many word processing programs have XML as their native document format for e.g. our very own AbiWord (.abw documents are XML)
 
XML in AbiWord : AbiWord uses a straightforward XML document format in which appearance and layout are specified in CSS-like attributes but only as a starting point. An entire XML source of a document (sample.abw) created in AbiWord which contains the text “AbiWord Rocks!) looks like :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abiword PUBLIC "-//ABISOURCE//DTD AWML 1.0 Strict//EN" "http://www.abisource.com/awml.dtd">
<abiword template="false" xmlns:ct="http://www.abisource.com/changetracking.dtd" xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:math="http://www.w3.org/1998/Math/MathML" xid-max="2" xmlns:dc="http://purl.org/dc/elements/1.1/" styles="unlocked" fileformat="1.0" xmlns:svg="http://www.w3.org/2000/svg" xmlns:awml="http://www.abisource.com/awml.dtd" xmlns="http://www.abisource.com/awml.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" version="0.99.2" xml:space="preserve" props="dom-dir:ltr; document-footnote-restart-section:0; document-endnote-type:numeric; document-endnote-place-enddoc:1; document-endnote-initial:1; lang:en-US; document-endnote-restart-section:0; document-footnote-restart-page:0; document-footnote-type:numeric; document-footnote-initial:1; document-endnote-place-endsection:0">
<!-- ======================================================================== -->
<!-- This file is an AbiWord document.                                        -->
<!-- AbiWord is a free, Open Source word processor.                           -->
<!-- More information about AbiWord is available at http://www.abisource.com/ -->
<!-- You should not edit this file by hand.                                   -->
<!-- ======================================================================== -->

<metadata>
<m key="abiword.generator">AbiWord</m>
<m key="dc.creator">Prashant</m>
<m key="dc.format">application/x-abiword</m>
</metadata>
<rdf>
</rdf>
<history version="1" edit-time="14" last-saved="1338761497" uid="0e329dea-adc9-11e1-9005-9b3eee35aa57">
<version id="1" started="1338761497" uid="16910e40-adc9-11e1-9005-9b3eee35aa57" auto="0" top-xid="2"/>
</history>
<styles>
<s type="P" name="Normal" followedby="Current Settings" props="font-family:Times New Roman; margin-top:0pt; color:000000; margin-left:0pt; text-position:normal; widows:2; font-style:normal; text-indent:0in; font-variant:normal; font-weight:normal; margin-right:0pt; font-size:12pt; text-decoration:none; margin-bottom:0pt; line-height:1.0; bgcolor:transparent; text-align:left; font-stretch:normal"/>
</styles>
<pagesize pagetype="Letter" orientation="portrait" width="8.500000" height="11.000000" units="in" page-scale="1.000000"/>
<section xid="1" props="page-margin-footer:0.5in; page-margin-header:0.5in">
<p style="Normal" xid="2"><c>AbiWord Rocks !</c></p>
</section>
</abiword>

The inherent readability of XML makes the interchange and format specification quite easier. Apart from AbiWord, other formats like the Open Document Format  as mentioned in the previous post is XML-based , whereas Microsoft office uses OOXML (a zipped XML based file format) as its default format now.

Saturday, 2 June 2012

OpenDocument Format (ODF) - a look into ODT

The OpenDocument Format is an XML based file format for word processing, spreadsheets, presentations documents and an widely used open alternative to proprietary document formats. It provides an open XML-based document file format for office applications and reuses established standards like HTML, SVG, XML, MathML etc.

The most commonly used filename extensions are :

Word Processing - .odt, .fodt
Spreadsheets - .ods, .fods
Presentations - .odp, .fodp
Databases - .odb
Graphics - .odg, .fodg
Formulae - .odf

ODF supports 2 types of document representation:

1. Single XML document / Flat XML / Uncompressed XML - Even though this is not widely used but the most common file extensions are .xml, .fodt, .fods, etc.

2. ZIP compressed archive - as a collection of several sub documents within a package, each of which stores part of the complete document. Filename extensions used are .odt, .ods, .odp, etc.

The package is a standard ZIP file with different filename extension and with defined structure of sub-documents which in turn has a different document root and store a particular aspect of the XML document. This way it separates the content, styles, metadata & application setting into different XML files

.odt : this is the most common file extension used for text documents and is a zip file container with a ODT extension. It actually contains multiple files instead of just one XML file. The Zipped set of files and directories include XML Files (content.xml, meta.xml, settings.xml, styles.xml) , mimetype, Directories (META-INF, Thumbnails, Objects) etc.

It's capabilities include :

Content: OpenDocument’s text content format supports both typical and advanced capabilities. content.xml is the most important file and carries the actual content of the document (except for binary data, such as images)

Metadata: There are predefined metadata elements as well as an option to custom define metadata to store the data about the data. Some predefined metadata fields are Generator, Title, Description, Subject, etc. (meta.xml contains the file metadata)

Style & Formatting : styles.xml contains the style information, OpenDocument makes heavy use of styles for formatting and layout. Style types include Paragraph styles, Page Styles, Character Styles, Frame Styles etc.

Objects : This is where the Math in ODT comes in. The OpenDocument format can contain 2 types of objects : a) that have an OpenDucument representation (Formulas - MathML, charts etc.) and b) that don’t have an XML representation (these only have a binary representation)

The above shows how the OpenDocument format provides a strong separation between the content, layout and metadata.

Some other files and directories include settings.xml (which contains settings such as zoom factor or the cursor position), mimetype (contains the one line mimetype of the document), Thumbnails folder (contains thumbnail.png which is a representation of the first page), META_INF folder (which contains manifest.xml which has information about the files in the package such as list of all the files, their media types etc.) and several other folders to hold word processing preference data.

In essence to import the math from ODT, we need to identify which objects are MathML objects and then convert them to iTeX to edit it inside AbiWord.