vbulletin 3.x parser

A user submitted a board that was not browsable using Owl even though the board uses vBulletin 3.6.0. Even though this is an older version of vBulletin, the Owl vbulletin 3.x parser should support it. After some investigation there was an error found in the vBulletin parser which also required a code change to the Owl code itself.

This fix will be available in the next release of Owl (version 0.7.1) in the next couple weeks. In the meantime some vBulletin 3.x boards will be incompatible with Owl.

For more information on parsers and how they’re used in Owl see our wiki page.

Version 0.7.0

With the release of 0.5.0 on the Mac App Store, some users reported that the Owl would not start and instead would crash immediately. This issue has been addressed and version 0.7.0 is currently waiting for review by Apple and should be published in the next few days.

An unsigned version for OSX and Windows can be downloaded here.

Download version 0.7.0.

Version 0.5.0 on Mac App Store

With the first version of Owl being published on the Mac App Store, development for the next release is already under way. Version 0.7.0 will be an iterative releasing leading up to the release of a full release of 1.0.

We are not planning any major feature additions between versions 0.5.0 and 0.7.0. Mostly version 0.7.0 will include bug fixes and increased stability in the app itself.

In the next several weeks we will be releasing a Road Map for additional features we’d like to add to Owl for version 1.0 and beyond.

 

Parsing HTML in Qt/C++

Parsing HTML is never easy since the markup parsers of commercial browsers are so forgiving of HTML that is simply wrong. Microsoft’s HTML DOM parser worked great but confines the user to Windows. Finding a cross-platform open solution proved to be difficult.

Owl uses three different solutions to accomplish HTML parsing, cleaning and data extraction.


QSGML – The appealing thing about this class were the features that traverse the DOM in search of elements matching a specific query. Hence, your code could easily do something like:
QSgml doc;
doc.parse(html);
doc.getElementsByName(“a”, &tags);
This code returns all the <a> elements in the document into the QList tags. Even better,QSgml::getElementsByName provided overloads that allow you specify elements with specific attributes and attributes with specific values.
However, QSgml often breaks when the HTML parsed has errors such as a mismatched closing.
<html>
    <body>
        Happy <b><i>birthday</b>!!! Click <a href=”…”>here</a>!!</i>
    </body>
</html>
The above HTML any use of QSgml::getElementsByName to always return an empty list. Even worse, there is no way to tell that QSgml doesn’t like the HTML since QSgml::parse still returns true.

Parsing with this library also proved to be quite slow in comparison with a 100k document taking 600 milliseconds to parse.

HTML Tidy – This library turns bad HTML into “good” HTML and can even convert HTML into valid XHTML. The library is relatively easy to use and is easily configurable. The only downside was the size of the library.

Whereas QSgml is four files in total (two .h and two .cpp files) HTML Tidy consists of over 20+ files. Luckily, importing the source into Owl’s existing CMake projects proved to be relatively straightforward though somewhat tedious.

The results, however, were great! Converting a 100k byte HTML document into XHTML averaged about 40-50 milliseconds. This is even more impressive when considering that HTML Tidy also builds a DOM tree while it’s doing it’s “clean and repair”. QSgml on the other hand took nearly 600 milliseconds just to parse the same document. Unfortunately the features available in the DOM were not as strong as those in QSgml.

tinyxml2 – The initial tests with tinyxml2 showed that parsing a ~100k XML document took roughly 150 milliseconds. Combining the time it takes for HTML Tidy to “clean and repair” and for tinyxml2 to parse, this process ended up averaging about 200-250 milliseconds for a 100k document.


Owl currently uses all three classes while only the latter two are truly necessary. Future versions will remove QSgml entirely and use a combination of HTML Tidy and tinyxml2.

For any projects that do not necessarily need to maintain the byte-for-byte structure of HTML, using a combination of HTML Tidy and tinyxml2 might be the best route.

Mission Statement

For the message board enthusiast who reads multiple message boards on a daily or semi-daily basis, Owl is a message board client that provides an email-client-like interface to read multiple message boards running a variety of message board software.

Unlike similar products such as Tapatalk and ForumRunner that require plugins and only offer mobile versions, our application will support a variety message board software without requiring a plugin to be installed on the message board and be available in desktop and mobile platforms.

Additionally, Owl will provide scripting support to enable third-parties to develop support for a variety of message boards.