Thursday, 8 November 2012

1st Look: Master's Project

Post GSoC, I've been pretty busy with the current academic semester and more importantly my master's project, which I'm introducing for the first time here:

ProjectX: Mathematical Software for Accurate Surface Representation

Objective: Construction of high-order parameterizations of surfaces from point clouds using Fourier Continuation Analysis

Major Features:
a. GUI-based 3D point cloud manipulation (selection, deletion, rotation, translation, scaling, etc.)
b. XML based project file structure (.prox)
c. Data Import (point clouds, patches, surfaces etc.)
d. Command Line Interface
e. CAD import
f. Generation of high-quality surfaces

Major Applications:
a. Producing Surfaces suitable for use in high-order numerical methods including PDE solvers, integral equation solvers etc.
b. Major Applications in several engineering & scientific projects involving high performance computing including areas like Medical imaging, Geophysical prospecting & Defense technologies.

Technologies Used:
a. Language: C++
b. Framework: Qt
c. 3D Graphics: OpenGL

First Look:

ProjectX: view of a project with an airplane data imported


It's obviously a work in progress, I've already implemented a lot of features into the software and a lot more is to come. I've also implemented the actual surface generation process in Octave, will be implementing the same and in much more detailed and extensively parameterized way in the software.

It's a highly application-rich project and a complex one to implement at that. And although there is no end to a project like this but i plan to get to a high order of capability by the next semester.

Thursday, 16 August 2012

GSoC in AbiWord : Final Report

And here comes the end of this season of Google Summer of Code, it has been simply awesome participating in the program and moreover being a part of the awesome AbiWord community.

About the Project: The OpenDocument Math filters and MathML to LaTeX conversion were the major focus of my project before the MidTerm, where as Post Midterm I majorly took up the Math import and export filters for the DOCX (OOXML) format. Here is a brief summary of the work done post MidTerm :

1. Implemented OMML to MathML & MathML to OMML converters, as AbiWord and Word use MathML and OMML for storing the Math respectively.

2. Created the Math element and Math ListernerState in the OpenXML plugin to implement the import & export of Math.

3. Completely implemented the import/export the Math from/to docx. AbiWord can now read and edit the Math exported by MS Word and similarly MS Word can read and edit the Math exported from AbiWord.

4. With this we now have full Math support for both odt and docx formats \m/

5. Fixed various Windows build specific bugs like the build errors in the Opendocument, MathView plugins. Making the Windows build completely error free, in order to do the next development release.

6. Sorted out various Windows Installer Languages issues, making the Windows Installer much better in terms of localization.

7. Squashed a few bugs listed to be 3.0 blockers and working on others as well.

I'm again quite happy with all the work and the learning that has come out of it. It has been a truly wonderful Summer and I plan to continue to work on AbiWord and contribute as much as possible.

I would like to take this opportunity to thank Google for this wonderful program and initiative. I'll also like to thank my awesome mentor Jean Brefort and other AbiWord developers Marc, Hub, Martin, Chris & Pradeeban for being so helpful and supportive. It's been a pleasure working on AbiWord and I hope to continue to do so.

Sunday, 8 July 2012

GSoC in AbiWord : Midterm Report

Time truly flies & it's Midterm already ! It has been a truly awesome experience working on AbiWord till now and I hope it continues the same way.

The progress made and the work done till now is as follows:

1. Implemented the MathML to LaTeX Converter for the MathML import either directly from MathML/XML or from ODT. It will also be useful for the Math import from DOCX which is going to be the next step.

2. Now we can edit MathML inside AbiWord \m/, which means we can edit the Math Equations from ODT inside AbiWord, as well as those from the MathML/XML files.

3. Fixed Math Object Import (MathML Import) for opendocument which was broken.

4. Fixed errors in Equation Insert from MathML for Windows.

5. Fixed libxslt related issues in the LaTeX plugin, which used to cause a loss of functionality of the plugin especially in Windows.

6. Removed quite a few Windows Build related errors and also implemented the perl script for automatic conversion of PO files to Strings for Windows.

I'm happy with the work done till now and the fact that all the milestones planned till the Mid-Term have been achieved. It has been wonderful working with the AbiWord Community who have been very helpful, especially my Awesome Mentor Jean Brefort.

The next step now is to implement the Math import/export from/to docx.

Sunday, 3 June 2012

XML & it's use in AbiWord

Extensible Markup Language (XML) is a markup language which defines a set of rules for encoding documents in a human & machine readable format. It is a markup language much like HTML but with a completely different goal, HTML was designed to display data with a focus on how data looks whereas XML is about transporting and storing data with a focus on what the data is. Unlike HTML, there are no tags defined for XML and it is designed to be self descriptive. Since it allows users to define their own tags, there is a data definition table (DTD) required to decode the data, which is defined near the top of the file.

Some common constructs that appear in XML:

XML Declaration : <?xml version="1.0" encoding="UTF-8"?> , this is XML declaration which is not required but it identifies the document as XML and indicates the version of XML.

Character : Any XML document is a string of characters and almost every legal Unicode character can appear in it. All XML processors must be able to read entities in both the UTF-8 & UTF-16 encodings.

Markup and Content : The contents in an XML document are divided into Markup and Content, which are distinguished by simple syntactic rules. Like all strings which constitute Markup either begin with the character < and end with >, or begin with & and end with ;  And the strings of characters which are not Markup are the Content.

Tags are the markup construct which begin with < and end with > (it can be a start-tag <block> , end-tag </block> or a empty-element tag <line-break />). And the the document component which starts with a start-tag and ends with the end-tag or consists of an empty-element tag is called Element. And the content within the tags are Element’s content which might contain child elements as well.

Attributes : another markup construct which contains the name/value pair

The processor analyzes the markup and passes the structured information into an application. This processor is often called an XML parser. Many word processing programs have XML as their native document format for e.g. our very own AbiWord (.abw documents are XML)
 
XML in AbiWord : AbiWord uses a straightforward XML document format in which appearance and layout are specified in CSS-like attributes but only as a starting point. An entire XML source of a document (sample.abw) created in AbiWord which contains the text “AbiWord Rocks!) looks like :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abiword PUBLIC "-//ABISOURCE//DTD AWML 1.0 Strict//EN" "http://www.abisource.com/awml.dtd">
<abiword template="false" xmlns:ct="http://www.abisource.com/changetracking.dtd" xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:math="http://www.w3.org/1998/Math/MathML" xid-max="2" xmlns:dc="http://purl.org/dc/elements/1.1/" styles="unlocked" fileformat="1.0" xmlns:svg="http://www.w3.org/2000/svg" xmlns:awml="http://www.abisource.com/awml.dtd" xmlns="http://www.abisource.com/awml.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" version="0.99.2" xml:space="preserve" props="dom-dir:ltr; document-footnote-restart-section:0; document-endnote-type:numeric; document-endnote-place-enddoc:1; document-endnote-initial:1; lang:en-US; document-endnote-restart-section:0; document-footnote-restart-page:0; document-footnote-type:numeric; document-footnote-initial:1; document-endnote-place-endsection:0">
<!-- ======================================================================== -->
<!-- This file is an AbiWord document.                                        -->
<!-- AbiWord is a free, Open Source word processor.                           -->
<!-- More information about AbiWord is available at http://www.abisource.com/ -->
<!-- You should not edit this file by hand.                                   -->
<!-- ======================================================================== -->

<metadata>
<m key="abiword.generator">AbiWord</m>
<m key="dc.creator">Prashant</m>
<m key="dc.format">application/x-abiword</m>
</metadata>
<rdf>
</rdf>
<history version="1" edit-time="14" last-saved="1338761497" uid="0e329dea-adc9-11e1-9005-9b3eee35aa57">
<version id="1" started="1338761497" uid="16910e40-adc9-11e1-9005-9b3eee35aa57" auto="0" top-xid="2"/>
</history>
<styles>
<s type="P" name="Normal" followedby="Current Settings" props="font-family:Times New Roman; margin-top:0pt; color:000000; margin-left:0pt; text-position:normal; widows:2; font-style:normal; text-indent:0in; font-variant:normal; font-weight:normal; margin-right:0pt; font-size:12pt; text-decoration:none; margin-bottom:0pt; line-height:1.0; bgcolor:transparent; text-align:left; font-stretch:normal"/>
</styles>
<pagesize pagetype="Letter" orientation="portrait" width="8.500000" height="11.000000" units="in" page-scale="1.000000"/>
<section xid="1" props="page-margin-footer:0.5in; page-margin-header:0.5in">
<p style="Normal" xid="2"><c>AbiWord Rocks !</c></p>
</section>
</abiword>

The inherent readability of XML makes the interchange and format specification quite easier. Apart from AbiWord, other formats like the Open Document Format  as mentioned in the previous post is XML-based , whereas Microsoft office uses OOXML (a zipped XML based file format) as its default format now.

Saturday, 2 June 2012

OpenDocument Format (ODF) - a look into ODT

The OpenDocument Format is an XML based file format for word processing, spreadsheets, presentations documents and an widely used open alternative to proprietary document formats. It provides an open XML-based document file format for office applications and reuses established standards like HTML, SVG, XML, MathML etc.

The most commonly used filename extensions are :

Word Processing - .odt, .fodt
Spreadsheets - .ods, .fods
Presentations - .odp, .fodp
Databases - .odb
Graphics - .odg, .fodg
Formulae - .odf

ODF supports 2 types of document representation:

1. Single XML document / Flat XML / Uncompressed XML - Even though this is not widely used but the most common file extensions are .xml, .fodt, .fods, etc.

2. ZIP compressed archive - as a collection of several sub documents within a package, each of which stores part of the complete document. Filename extensions used are .odt, .ods, .odp, etc.

The package is a standard ZIP file with different filename extension and with defined structure of sub-documents which in turn has a different document root and store a particular aspect of the XML document. This way it separates the content, styles, metadata & application setting into different XML files

.odt : this is the most common file extension used for text documents and is a zip file container with a ODT extension. It actually contains multiple files instead of just one XML file. The Zipped set of files and directories include XML Files (content.xml, meta.xml, settings.xml, styles.xml) , mimetype, Directories (META-INF, Thumbnails, Objects) etc.

It's capabilities include :

Content: OpenDocument’s text content format supports both typical and advanced capabilities. content.xml is the most important file and carries the actual content of the document (except for binary data, such as images)

Metadata: There are predefined metadata elements as well as an option to custom define metadata to store the data about the data. Some predefined metadata fields are Generator, Title, Description, Subject, etc. (meta.xml contains the file metadata)

Style & Formatting : styles.xml contains the style information, OpenDocument makes heavy use of styles for formatting and layout. Style types include Paragraph styles, Page Styles, Character Styles, Frame Styles etc.

Objects : This is where the Math in ODT comes in. The OpenDocument format can contain 2 types of objects : a) that have an OpenDucument representation (Formulas - MathML, charts etc.) and b) that don’t have an XML representation (these only have a binary representation)

The above shows how the OpenDocument format provides a strong separation between the content, layout and metadata.

Some other files and directories include settings.xml (which contains settings such as zoom factor or the cursor position), mimetype (contains the one line mimetype of the document), Thumbnails folder (contains thumbnail.png which is a representation of the first page), META_INF folder (which contains manifest.xml which has information about the files in the package such as list of all the files, their media types etc.) and several other folders to hold word processing preference data.

In essence to import the math from ODT, we need to identify which objects are MathML objects and then convert them to iTeX to edit it inside AbiWord.

Sunday, 27 May 2012

Math in LaTeX


LaTeX is a widely used document markup language and a document preparation system for high-quality typesetting. Based on Donald E. Kruth’s TeX typesetting, it is used for producing scientific and mathematical documents of high typographic quality. However it is quite different from the word processors such as MS Word or LibreOffice etc. which uses the WYSIWYG approach. 

Getting to the basics of it, every LaTeX document must contain the following 3 components (every thing else being optional):

\documentclass{article}
\begin{document}
\end{document}

Here the 1st line tells LaTeX the type of document (article, report, book, letter), and the body of the document must occur between \begin{document} & \end{document} commands. Any text after \end{document} is ignored. 

Few commonly used commands: \pagestyle (controls page numbering and headings), \title, \author, \date, \section (creating separate sections) \tableofcontents etc. 

Math Mode: LaTeX uses a special math mode to display mathematics, as LaTeX typesets math notations differently than the normal text. Special environments have been declared for this purpose, 3 commonly used environments in math mode :

1. math environment : text formulae are displayed inline (within the body of the text) [ TeX shorthand $....$ ]
2. displaymath environment – displayed formulae are separate from the main text [ TeX shorthand $$....$$ ]
3. equation environment

An example:

\documentclass{article}
\begin{document}
$$ \frac{d}{dx}\left( \int\limits_{0}^{x} f(u)\,du\right)=f(x)$$
\end{document}

will produce :
 

Math symbols : the symbols in Math formula fall into different classes : Ord (simple/ordinary), Op (prefix operator), Bin (binary operator), Rel (relation/comparison), Open (left/opening delimiter), Close (right/closing delimiter), Pun (postfix/punctuation). It comprises of

Latin Letters, Arabic numerals (0-9), Greek letters are simple symbols. Example of Greek letters in LaTeX are \Gamma, \alpha, \beta etc. 
Other alphabetic symbols: \complement, \partial, \daleth etc
Misc. Simple Symbols: \#, \&,  \angle, \infty, \exists, \forall etc.
Binary Operator Symbols: *, +, -, \cdot, \div, \pm, etc.
Relational Symbols: <, =, >, \approx, \gg,  \ll, \prec etc.
Relation Symbols (arrows): \leftarrow, \Leftrightarrow, \rightarrow, \curvearrowleft, \curvearrowright etc.
Relation Symbols (Misc): \parallel, \backepsilon, \because, \in, \mid, \nparallel, \sqsubset, \subset etc.
Cumulative Operators: \int, \oint, \prod, \sum, \bigcap, \bigup, \bigsqup, \bigvee, etc.
Punctuation: . ; / | , ; \colon : ! ?
Pairing delimiters: (, ), [, ], \lbrace, \rbrace, \langle, \rangle, \lceil, \rceil, \lfloor, \rfloor etc.
Non-Pairing Extensible Symbols: \backlash, /, \vert etc.
Extensible vertical arrows: \uparrow, \Uparrow, \downarrow etc.
Accents: \bar{x}, \vec{x}, \dot{x}, \hat{x}, \acute{x}, etc.
Named Operators: \cos, \cot, \det, \lg, \lim, \ln, \log, \inf, \dim, \max, \min, \sin, \sup, \tan,

Another Example:

\documentclass{article}
\begin{document}
$$ \left(1+x\right)^n = 1 + nx + \frac{n\left(n-1\right)}{2!}x^2 + \frac{n\left(n-1\right)\left(n-2\right)}{3!}x^3 + \frac{n\left(n-1\right)\left(n-2\right)\left(n-3\right)}{4!}x^4 + \ldots $$
\end{document}

will produce:





Saturday, 19 May 2012

iTeX vs LaTeX

LaTeX is a widely used document markup language and a document preparation system for the TeX typesetting program. Whereas iTeX can be seen as a downgraded LaTeX. The differences arising because of the way we write a research paper (long and technical) and the way we put stuff on the web (short and snappier). Essentially iTeX is a pure converter whereas LaTeX is a mixture of a converter and renderer (technically LaTeX is the rules to convert the input to TeX which is then rendered by TeX).

iTeX is very similar to the Standard LaTeX but with a few differences keeping in mind that iTeX produces MathML.

There are quite a bit of differences between iTeX and TeX :

1. In iTeX $abc$ would be a single token which when converted to MathML would be <mi>abc</abc>

However $a b c$ would be three tokens which when converted to MathML will be <mi>a</mi><mi>b</mi><mi>c</mi>

but it is important to note that the TeX considers both the above to be the same.

2. Numbers: $10^20$ will be 10^(20)  in iTeX whereas it will be 10^(2)0 in LaTeX , hence it is always safe to use curly brackets to be consistent across like $10^{20}$

3. Whitespace : $a \textrm{ and } b$ will be x and y in LaTeX whereas in iTeX it will be xandy. The reason behind this being the fact that mtext elements in MathMl doesn’t have fore and aft whitespaces.

4. As MathML doesn’t know the difference between unary operators and binary relations it is inconvenient for iTeX to do so.

5. iTeX doesn’t parse math if it includes non-ascii characters

6. It is possible to insert MathML markup inside iTeX equations making “<” and “>” pretty significant. \lt and \gt are used to get less-than and greater-than signs.

A much more detailed look into LaTeX will follow up in the next post.

References:

1. http://www.latex-project.org/guides/

2. http://golem.ph.utexas.edu/~distler/blog/itex2MML.html

Wednesday, 9 May 2012

A look into MathML


Mathematical Markup Language is an application of Extensible Markup Language (XML) for describing mathematical notation and capturing both its structure and content. The main aim of MathML is to integrate math with the World Wide Web. Essentially MathML is for math what HTML is for text. 

As mentioned before MathML deals with both the structure and the content of a mathematical notation. The structure part is called Presentation MathML and as the name suggests it deals with the display of the notation, equation or formula. Whereas the content part is called Content MathML and it focuses on the semantics. 
An example of a Presentation MathML for  is :

 <math>
  <mrow>
    <mi>x</mi>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mrow>
          <mo>-</mo>
          <mi>b</mi>
        </mrow>
        <mo>
          &#xB1;<!--PLUS-MINUS SIGN-->
        </mo>
        <msqrt>
          <mrow>
            <msup>
              <mi>b</mi>
              <mn>2</mn>
            </msup>
            <mo>-</mo>
            <mrow>
              <mn>4</mn>
              <mo>
                &#x2062;<!--INVISIBLE TIMES-->
              </mo>
              <mi>a</mi>
              <mo>
                &#x2062;<!--INVISIBLE TIMES-->
              </mo>
              <mi>c</mi>
            </mrow>
          </mrow>
        </msqrt>
      </mrow>
      <mrow>
        <mn>2</mn>
        <mo>
          &#x2062;<!--INVISIBLE TIMES-->
        </mo>
        <mi>a</mi>
      </mrow>
    </mfrac>
  </mrow>
</math>

As seen above every valid MathML expression is wrapped in outer <math> tags which shows each instance of MathML markup within a document. 

The presentation elements have 2 classes – Token Elements (symbols, numbers, names etc.) and Layout Schemata (which builds expressions out of the parts and have only elements as its content). Here we are using various token elements such as mi – identifier, mo – operator, mn – number. And general layout schemata elements such as mrow (groups any numbers of sub-expressions horizontally), mfrac (fraction of 2 sub-expressions), msqrt (square root)

Also above we can see that we write b^2 using superscript and two letters written side by side will mean two variables multiplied together which shows that the presentation markup just holds the structure and we need content markup to put in meaning into the formula. 

Content MathML for the same formula would be:

<math>
  <apply>
    <eq/>
    <ci>x</ci>
    <apply>
      <divide/>
      <apply>
        <plus/>
        <apply>
          <minus/>
          <ci>b</ci>
        </apply>
        <apply>
          <root/>
          <apply>
            <minus/>
            <apply>
              <power/>
              <ci>b</ci>
              <cn>2</cn>
            </apply>
            <apply>
              <times/>
              <cn>4</cn>
              <ci>a</ci>
              <ci>c</ci>
            </apply>
          </apply>
        </apply>
      </apply>
      <apply>
        <times/>
        <cn>2</cn>
        <ci>a</ci>
      </apply>
    </apply>
  </apply>
</math>

Content MathML represents mathematical objects as expression trees (i.e. applying operator to sub objects). Hence, the terminal nodes represents basic math objects such as numbers, variables etc. and the internal nodes represent mathematical constructions or function applications. 

Token elements – ci (represent variables) ,cn(numbers), Predefined functions elements – divide, minus, plus are used here. And as we can see above that the Apply element groups the function with its arguments syntactically.

About thirty-eight of the MathML tags describe abstract notational structures, while another about one hundred and seventy provide a way of unambiguously specifying the intended meaning of an expression.


Monday, 23 April 2012

and here it comes... GSoC 2012 - Accepted \m/\m/\m/

After a lot of wait and anxious moments, here it comes... I got selected for GSoC 2012. I'll be working for AbiWord, the supercool cross platform open source word processor under the mentorship of Jean Brefort. And my project is to "Implement and Improve the import and export of math from/to odt, doc & docx formats".

A total of 6 students were selected for AbiWord this year :



After I get done with my end semester exams by 2nd of May, I plan to get into the action with full energy and not only complete my project but contribute as much as possible and in the process learn as much as i can.

Looking forward to a Summer full of learning, fun, excitement and a lot of code :)

Friday, 13 April 2012

Let the Fun begin !

This blog is aimed at keeping track on my open source ventures. I've always been awed by the concept of FOSS (free and open source software), but never actually got my own hands dirty. But now with the summer holidays and my ever growing passion in programming, I'm in and I'm in for good.

I'm starting out by working for an awesome open source cross platform word processor AbiWord. I've been aware of its existence and I've actually seen people use it in many low config PCs (those which couldn't afford the heavy requirements of MS Office) but never really contributed to it but in the process of applying for GSoC 2012, I'm looking into it quite deeply. I've worked on bugs & created a few patches, essentially getting a flavor of the code.

Truth be told, I'm totally impressed by how things work in the AbiWord community, with so many people from different continents working in collaboration. Such is their dedication that even after keeping full time jobs and families they spend a lot of time hacking for AbiWord and that too all voluntarily, that i think is the beauty of open-source. And I'm loving it and i think this is something I'm going to sink right in.

The Judgement Day (23rd April) - the day GSoC result comes out (eagerly waiting for it :)), I've applied for the project of improving the math import/export in Abiword with the center of attraction being the MathML to itex convert as AbiWord uses itex as its Math Composition Language. For instance currently AbiWord can import the MathML of odt but we can't edit it inside AbiWord.

Even though getting selected will be a great honor and responsibility, i plan on to dropping all my other internship options (foreign interns - I'm sorry !) and do this no matter what and use this blog to keep a track of my work.