XSLT Expert needed

Closed - This job posting has been filled and work has been completed.

Job Description

I need an XSLT expert helping in some transformation work.

Attached you'll find these document:

1. inparsed.xml:

This is a document that has been exportet by MS Word 2010 as "website, filtered", and than cleaned up to be a valid XML document

2. HtmlToXml.xslt:

an xml style sheet that transforms this document into our interal XML

3. HtmlToXml.xml:

The current XML output using the provided style sheet.

4. HtmlToXml.xsd:

A generated schema describing the result format.

The XSLT is not doing what we want. Because of the weird structure of the HTML exported from Word it's hard to get a robust solution for a variety of input documents.

The style sheet already creates the correct structure, turning the flat HTML into a true hierarchy. It works well (but not perfect) for tables.

Tasks:

A) Handle Listings (word styles named "ListingText")

In the MS Word document we have Listings, which start with a caption, that has the MS Word style "Listungunterschrift". The listing's content is a sequence of paragraph level styles calles "ListingText". The difficulty is, that the "ListingText" styles appear neither as children nor direct siblings. If one goes up the hierarchy and then down to next "ListingText" and gets all children it takes all listing at once. Insteand, it's supposed to stop at the first different item.

Desired result:
<Element Type="Listing" Name="Text in Listinguberschrift">
Content with line breaks and preserved spaces
</Element>

B) Conjunct Text paragraphs.
You see a lot of <Element Type="Text">... in the result file. The source are the many <p> tags. We want to have the content of a sequence of <p> elements, that are not being processed by other templates, in one <Element Type="Text">. So, a paragraph shall be interrupted by Image, Listing, Table only. The <p> shall be presevered and clopied to the output, consequently stripping all elements but those handled by other templates.

C) Handle images properly
This is a "nice to have task". We're not sure that's possible in XSLT 2. If not, just skip. Otherwise provide a solution for this:
The image tags reference image files on disk. We want to have the image file read from disk, and copied as Base64 encoded content into the body of the <Element Type="image"> elements.

D) Sections and Numbering
Sections often start with the numbering. We want to get rid of the numbering, hence removing the 1.2.1. sequences. The same for the captions of Images and Tables.
For instance in the output document you'll find this:
<Element Type="Text">
<strong>Tabelle </strong><strong>4</strong><strong>.</strong><strong>1</strong><strong>  </strong>Schichten des ISO/OSI-Referenzmodells</Element>
<Element Type="Table" Name="Tabelle 4.1  Schichten des ISO/OSI-Referenzmodells">

The complete <Element Type="Text"> element is not necessary here, as the information is already in the <Element Type="Table"> element. We want to get rid of this.

The caption generates by Word "Tabelle 4.1 " shall be removed as well.

We assume that it makes sense to provide the word "Tabelle" through a parameter and the formatting of the numbering as well, because different language versions of Word would produce different naming (e.g. in English it would appear as "Table 4-1 ".

--
Background information:

The intention of the tool is a clean, robust, fast, and extensible way to transform MS Word documents into out internal, special format (similart to HtmlToXml.xsd). The intermediate step using HTML is just a try and a more direct way from WordML, if there any, would be appropriate as well. The more we can achieve in XSLT the easier is the implementation. But to save time and effort we could extend this by custom functions as well.

The environment it's finally running in is .NET/C# 4.5

Open Attachment