{"id":6,"date":"2014-03-11T20:31:19","date_gmt":"2014-03-12T01:31:19","guid":{"rendered":"http:\/\/www.owlclient.com\/blog\/?p=6"},"modified":"2014-09-23T08:48:01","modified_gmt":"2014-09-23T13:48:01","slug":"parsing-html-in-qtc","status":"publish","type":"post","link":"https:\/\/www.owlclient.com\/blog\/?p=6","title":{"rendered":"Parsing HTML in Qt\/C++"},"content":{"rendered":"<p>Parsing HTML is never easy since the markup parsers of commercial browsers are so forgiving of HTML that is simply wrong. Microsoft&#8217;s HTML DOM parser worked great but confines the user to Windows. Finding a cross-platform open solution proved to be difficult.<\/p>\n<div><\/div>\n<div>Owl uses three different solutions to accomplish HTML parsing, cleaning and data extraction.<\/p>\n<\/div>\n<hr \/>\n<div><a href=\"http:\/\/sourceforge.net\/apps\/mediawiki\/sgml-for-qt\/index.php?title=Main_Page\" target=\"_blank\"><b>QSGML<\/b><\/a>\u00a0&#8211; The appealing thing about this class were the features that traverse the DOM in search of elements matching a specific query. Hence, your code could easily do something like:<\/div>\n<div>\n<div class=\"p1\"><\/div>\n<\/div>\n<div>\n<div class=\"p1\"><span class=\"s1\">QSgml<\/span> doc;<\/div>\n<div class=\"p1\">doc.<span class=\"s2\">parse<\/span>(html);<\/div>\n<div class=\"p1\">doc.getElementsByName(&#8220;a&#8221;, &amp;tags);<\/div>\n<div class=\"p1\"><\/div>\n<div class=\"p1\">This code returns all the &lt;a&gt; elements in the document into the QList tags. Even better,QSgml::getElementsByName provided overloads that allow you specify elements with specific attributes and attributes with specific values.<\/div>\n<div class=\"p1\"><\/div>\n<div class=\"p1\">However, QSgml often breaks when the HTML parsed has errors such as a mismatched closing.<\/div>\n<div class=\"p1\"><\/div>\n<div class=\"p1\">&lt;html&gt;<\/div>\n<div class=\"p1\">\u00a0 \u00a0 &lt;body&gt;<\/div>\n<div class=\"p1\">\u00a0 \u00a0 \u00a0 \u00a0 Happy &lt;b&gt;&lt;i&gt;birthday&lt;\/b&gt;!!! Click &lt;a href=&#8221;&#8230;&#8221;&gt;here&lt;\/a&gt;!!&lt;\/i&gt;<\/div>\n<div class=\"p1\">\u00a0 \u00a0 &lt;\/body&gt;<\/div>\n<div class=\"p1\">&lt;\/html&gt;<\/div>\n<div class=\"p1\"><\/div>\n<div class=\"p1\">The above HTML any use of QSgml::getElementsByName to always return an empty list. Even worse, there is no way to tell that QSgml doesn&#8217;t like the HTML since QSgml::parse still returns true.<\/p>\n<p>Parsing with this library also proved to be quite slow in comparison with a 100k document taking 600 milliseconds to parse.<\/p><\/div>\n<div class=\"p1\">\n<a href=\"http:\/\/tidy.sourceforge.net\/\" target=\"_blank\"><b>HTML Tidy<\/b><\/a><b>\u00a0<\/b>&#8211; This library turns bad HTML into &#8220;good&#8221; HTML and can even convert HTML into valid XHTML. The library is relatively easy to use and is easily configurable. The only downside was the size of the library.<\/p>\n<p>Whereas QSgml is four files in total (two .h and two .cpp files) HTML Tidy consists of over 20+ files. Luckily, importing the source into Owl&#8217;s existing CMake projects proved to be relatively straightforward though somewhat tedious.<\/p>\n<p>The results, however, were great! Converting a 100k byte HTML document into XHTML averaged about 40-50 milliseconds. This is even more impressive when considering that HTML Tidy also builds a DOM tree while it&#8217;s doing it&#8217;s &#8220;clean and repair&#8221;. QSgml on the other hand took nearly 600 milliseconds just to parse the same document.\u00a0Unfortunately the features available in the DOM were not as strong as those in QSgml.<\/p>\n<\/div>\n<div class=\"p1\"><a href=\"http:\/\/www.grinninglizard.com\/tinyxml2docs\/index.html\" target=\"_blank\"><b>tinyxml2<\/b><\/a><b>\u00a0<\/b>&#8211; The initial tests with tinyxml2 showed that parsing a ~100k XML document took roughly 150 milliseconds. Combining the time it takes for HTML Tidy to &#8220;clean and repair&#8221; and for tinyxml2 to parse, this process ended up averaging about 200-250 milliseconds for a 100k document.<\/p>\n<hr \/>\n<p>Owl currently uses all three classes while only the latter two are\u00a0truly\u00a0necessary. Future versions will remove QSgml entirely and use a combination of HTML Tidy and tinyxml2.<\/p>\n<p>For any projects that do not necessarily need to maintain the byte-for-byte structure of HTML, using a combination of HTML Tidy and tinyxml2 might be the best route.<\/p><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Parsing HTML is never easy since the markup parsers of commercial browsers are so forgiving of HTML that is simply wrong. Microsoft&#8217;s HTML DOM parser worked great but confines the user to Windows. Finding a cross-platform open solution proved to &hellip;<\/p>\n<p class=\"read-more\"><a href=\"https:\/\/www.owlclient.com\/blog\/?p=6\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/6"}],"collection":[{"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6"}],"version-history":[{"count":1,"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/6\/revisions"}],"predecessor-version":[{"id":7,"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/6\/revisions\/7"}],"wp:attachment":[{"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.owlclient.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}