Colin's Journal

Colin's Journal: A place for thoughts about politics, software, and daily life.

October 21st, 2003

Writing web pages in OpenOffice

Writing web page content in OpenOffice is a lot easier than writing pages in a text editor, even though I’ve been using HTMLText rather than raw HTML. The PubTal OpenOffice plugin (available here) works well enough that I could convert my remaining pages over to using PubTal.

I’ve been avoiding moving the last of my archived content over to PubTal because it’s stuff that I don’t really care about any more. With OpenOffice I could just drag and drop existing pages out of my web browser, and then clean up a few things like the relative links. The main benefit of having done this is that all of the pages on my site now validate, and they are all produced using the same template.

I haven’t decided yet whether to convert other pages from HTMLText to OpenOffice, but it’s tempting for ease of maintenance.

October 18th, 2003

A change in direction

I am having to reconsider the use of AbiWord as an editor for web page content. The reason for this is not due to a flaw in the idea itself, but rather the quality of the AbiWord software. Even the stable version (2.0) has some significant bugs that make it untrustworthy for handling important content.

The two most serious problems I’ve hit are:

  • AbiWord sometimes generates invalid XML.
  • AbiWord occasionally writes files that it then can not read.

I will continue to maintain and distribute the AbiWord plugin in the hope that future versions of the software will address these fatal defects, but I am now going to look at alternative editors.

The most promising is OpenOffice. The software is well maintained by a large team, is regarded as being of high quality, and the file format is very well documented. My initial impression of the file format is that it will be easier to handle than the AbiWord format turned out to be.

The biggest drawback to attempting an OpenOffice PubTal plugin is the huge numbers of features that OpenOffice has. Most of these features will not translate well into a web page, and so will have to be ignored by the plugin.

October 16th, 2003

First outing of the AbiWord Plugin

I’ve made available the first version of my AbiWord content plugin. This is an experimental release which has undergone light testing. It requires PubTal 2.0: Download AbiWord-Content Plugin.

Features currently supported include:

  • Heading 1,2 and 3.
  • Text styles (bold, italic, etc).
  • Bullet and Number lists.
  • Margin-left offsets.
  • Hyper-links and anchors.
  • Endnotes and Footnotes (which become Endnotes).
  • Tables.
  • PlainText (which is wrapped in <pre><code>)

Things that I haven’t been able to get working yet:

  • Images. There’s nothing particularly sensible I can do with these because the source file isn’t preserved by AbiWord.
  • Different kinds of list (diamond, etc). A bug in Abiword stops good XML being produced for these in the version I’m using.
  • Numbered Headings and Sections. The same XML bug is stopping support for these as well.
  • Page headers and footers. I’m not sure what should be done with these, especially as they can change from page to page and section to section.
  • Font name,colour, etc. I could add support for these but I don’t know whether it’s a good idea. In the web world changing font should really be done through a site level CSS style sheet rather than on a per-page basis. I might make this an option in a future version.

If you download and use this plugin please email me and let me know whether it works for you. I’m using AbiWord 1.99.5, and aside from the unsupported stuff, it seems 100% reliable so far.

October 13th, 2003

Writing content for web pages

Last week I wrote a short article on the importance of a template based solution for web page maintenance, and the sort of innovations that could be made to ease template design. At the end of that article I noted two other problems with web publication tools today: markup of the content, and the handling of non-journal style pages. This article addresses some thoughts I’ve had on the first of these two problems.

The most popular template based systems today are those provided by blogging software. They allow an author to enter new web content either using a web browser (thin client) or a small application (fat client). When the author decides to include a link, make some text bold, or apply some other markup the most common solution is to have them enter HTML codes.

Alternatives to entering the HTML manually include utilising IE specific enhanced textarea widgets, using a different markup language such as Textile, or providing buttons that automatically insert the HTML tags.

The markup in which an author’s content is written, and the markup in which it is published, must be treated separately, even if they happen to be the same. The reason for this is that an evolving web also means evolving markup for web pages, for example the transistion between HTML4 and XHTML1. When an author of a site chooses to move their pages from HTML to XHTML the software they use needs to be able to rebuild old pages using XHTML.

For software to be able to perform transformations of markup from one language to another it needs to be able to parse the original markup perfectly. If the original markup is HTML this poses two problems: writing and parsing correct HTML programmatically is fairly difficult, and if users enter markup by hand then there will be errors in it. The inability to convert cleanly to a new publishing markup language is a major defect in all of the blogging tools today that store and accept content using HTML markup. It is a hole that can be coded out of, but never in a 100% satisfactory way.

The solution to this problem requires a combination of three things:

  1. The adoption of a strict, easy to parse, format for authoring content.
  2. The rejection of any content which does not adhere completely to this strict format.
  3. The use of tools, rather than users, to generate this strict format.

The critical piece missing today of these three items is the third one: a GUI tool that allows the markup of web content in a strict, easy to parse, format. The bare minimum that such a tool should be able to support includes: links, text decoration (bold, italic, etc), lists (bullet and numeric), and images. There are lots of other types of markup which would be very useful (e.g. tables), but for most web content this limited list would suffice. Today there are many weblog authors who have tools and knowledge such that they don’t use the most basic of markup in their content. A GUI application supporting these features, and whoes output is in a strict format, would be enough to bring painless, sustainable content authoring to a much wider audiance.

Writing such a tool, while not technically difficult, does take time and effort. I hope to one day soon find an open source tool to do this. In the meantime however I have a partial solution: AbiWord.

AbiWord is an open source word processor. A word processor isn’t really the best choice of tool for editing web content, simply beacuse it has too many features that are not needed or do not apply to the web. For example AbiWord supports Mail Merge, multiple document sections with different headers and footers on the pages, and other such features that are needed for document creation, but not for editing web content.

Despite these drawbacks the use of AbiWord does bring some significant advantages:

  • It is a full GUI: the user has no opportunity or need to edit markup themselves.
  • It has many useful features such as spell check as you type.
  • The output format is in XML, making it fairly easy to convert into HTML/XHTML or any other markup language.

To see whether or not this can work I’ve written a plugin for PubTal which takes AbiWord documents, converts it to HTML markup, and then publishes it using PubTal templates. There is still much testing to be done, but it now handles: headings, text decoration (bold, italic, underline, strikeout, overline, superscript, subscript), pre-formated text, hyperlinks, bookmarks (anchor’s), bullet lists, numeric lists, footnotes/endnotes, and tables.

The biggest missing feature is the ability to include images in the content. The problem here is that AbiWord doesn’t record the original location of the image file – it just places the binary content (encoded using base64) into the XML file. I can probably live with that restriction for most pages, at least until I can find a better solution.

October 13th, 2003

Spam in blogs

Spam in blog comments was always inevitable because it brings two benefits to spammers:

  1. It gets lots of people seeing their message (in the same way as Usenet spam).
  2. It spams search engines into giving their website a higher ranking.

As is clear from the discussion on Making Light it is a loosing battle to try and block comment spammers based on their IP addresses.

I’m currently thinking that there are two likely approaches to blocking this kind of spam that might stand a chance. The first approach is to show an image of a random letter in a hard to OCR font, and then asking the user to enter the letter (or series of letters) into the form with their comment. This is used on several large sites today, but I don’t know how effective it actually is.

The second approach would be to apply statistical filtering to comments in the same way as it is used for email. This approach has been very successful in reducing email spam getting into in-boxes as can be seen by the technique’s continued roll-out. It seems like an easy enough extension to apply this kind of filtering to comments in weblogs.

I’m sure we’ll hear a lot more about weblog comment spam as time goes on.

October 9th, 2003

Goings on

While it doesn’t strike me as strange that Germany bans heavy lorries from it’s roads on Sundays, it does seem strange that the government is working hard to maintain this ban. Germany is struggling economically and the government has accepted that it needs to reform labour markets. Yet when presented with a politically easy opportunity to remove an obstacle to growth such as this, it fights to keep it.

In other (rather older) news there’s tax competition at work in Denmark, where tax on alcohol has been reduced significantly. This is in an attempt to reduce the amount of booze bought in the rest of the EU and (legally) imported. We can hold out hope that similar pressure will eventually cap the tax we see on alcohol in the UK as well. (In an ideologically inconsistent fashion I don’t care how high tax on cigarettes gets!)

We could also do with some price competition here in Ontario, where the government run monopoly keeps prices significantly higher than the UK (e.g. £3 a pint!).

October 8th, 2003

The right to vote

Until I moved to Canada I had never really considered the question of when someone should be allowed to vote, and when they shouldn’t. When you are a citizen of a country, and almost all of the people you know are also citizens the question of eligibility does not arise.

The issue is of particular importance in Latvia because 21% of Latvian residents are not citizens, and they are currently excluded from all elections including local elections. It seems clear to me that when nearly a quarter of the permanent residents in a country are dis-enfranchised in this way that something needs to change.

As a Brit in Canada I can’t vote in any Canadian election whether National or Provincial. If I “landed” (i.e. became a permanent resident) then I would still be excluded from voting, regardless of how long I lived here, unless I took up Canadian citizenship. Conversely I’m eligible to vote in the UK despite being out of the country for the last 3 years.

In the EU any EU Citizen is allowed to vote (even stand for office) at the local level if they are a resident. This logic hasn’t been extended to voting in national elections, and I doubt it will be any time soon.

With the increased mobility (particularly in Europe) of people between countries I would like to see the right to vote being tied to permanent residency. The latest EU directive on freedom of movement will bring an immediate right to permanent residency for EU citizens after 5 years in a member state. This seems to me like an appropriate length of time before someone is able to make an informed electoral choice in a country.

October 7th, 2003

New release of PubTal and SimpleTAL

PubTal 2.0 and SimpleTAL 3.6 are now available for download! Although the changes to SimpleTAL are minimal I need to do a simultaneous release to support a new feature in PubTal.

PubTal has had many changes made. It now supports XHTML, has a simpler configuration syntax, more content types, and better character set support.

Thanks to Florian Schulze for all of the patches and ideas!

October 6th, 2003

Even static pages are dynamic

Coding web pages is difficult. It has been difficult from the start of the web and has, in some respects, become harder as time has gone on and the technologies involved have grown. The preferred approach to making web site design easier used to be WYSIWYG (what you see is what you get), the idea being that Desktop Publishing was easy for anyone to do, so why shouldn’t web page publishing be the same way?

It is easy to denounce the WYSIWYG approach because of the poor quality HTML that it tends to generate, but this is to ignore it’s biggest flaw. The problem with using WYSIWYG design is not that the resulting code is a mess, but rather that the result of the design is a page.

The problem with building a web page is that at some point you will want to change the content of that page. Maybe you need to change your contact details that are at the bottom of the page. Maybe the site navigation bar down the side now needs another entry. Or it could simply be time to abandon the dark-purple on black colour scheme that looked so good when you first decided that you had something worth putting on the web.

Regardless of the motivation for wanting to update a web page there will certainly come a time when it needs to be done. If you have one page this isn’t a problem, if you have several hundred then it is a problem. Part of the solution is to separate content from design, to keep the HTML in one place so that changes can be made once. This solution has been known for a long time and yet it has not been a technique that many had access to.

The rise of blogging tools has brought this powerful technique to many, at least for journal style web pages such as this. Blogging tools have made the process of publishing on the web easy enough that almost any web reader can now become a web writer, should they choose to do so. There are still however many further improvements that can be made to make the task of publishing on the web easier. As Felix Salmon explains in today’s post, altering the templates of such blogging tools requires a significant technical ability. My own contribution to the ease of web publication, PubTal, certainly requires users to be able to code in HTML in order to generate their own templates.

I think the problem of web page template design can be solved by allowing users to work with components that fit together to form templates. Components can then be designed and built by those who know, or are willing to learn, the technologies behined them. Meanwhile users can mix-and-match components to form individual designs. Here’s an example of how this might work:

  • Have a layout component as the basis for a template design. The layout component defines areas of the screen, for example two columns and a heading, but not the content of those areas. Multiple different layout components can be developed, and users can choose to use any one as the basis for a new template.
  • Multiple “item” components can be produced which interact with the underlying content management system to provide certain pieces of functionality, e.g. links to archived material, or the content of the latest posts. These components can take parameters to allow some limited customisation. Users would then specify which of the item components should go in which parts of the layout (e.g. latest posts goes in the middle column, news snippets followed by my links in the left column, etc)
  • Both the layout and item components would produce HTML with standard class and id attributes, so that the site can be “themed” using CSS (the CSS Zen Garden shows just how far CSS can take you).

Using the scheme outlined here a GUI tool could be developed that allows for easy template design using the drag-n-drop of components. With components being distributed over the ‘net there would soon be a huge variety of template designs possible, without any of the problems of normal WYSIWYG design. The underlying technologies required to develop a system such as this are already in place, it’s just a matter of writing the tools to use them (no small task).

There are at least two other problems with the current crop of web publication tools that I’ve not written about yet: markup of the content, and the handling of non-journal style pages. That’ll have to wait for another day.

September 29th, 2003

Enhancements to PubTal

I’m working on some enhancements to PubTal at the moment, and so far it has been surprisingly easy. The next version will require updates to existing configuration files because I have consolidated a number of the configuration directives.

New features I’ve been able to implement include:

  • XHTML support.
  • Ability to set the template/output-file character set encoding.
  • Ability to specify the template/output content type (e.g. HTML, XHTML)
  • You can now specify a DOCTYPE for XML templates.

The next challenge will be adding the ability to specify an extra plugin directory, and an option to suppress output of the XML declaration for XHTML files (working around a CSS bug in IE 6).

Copyright 2015 Colin Stewart

Email: colin at owlfish.com