![](/style/images/good.png)
![](/style/images/bad.png)
HTML5 parsed as 4 with uppercase DOCTYPE in HTML Tidy
source link: https://www.ctrl.blog/entry/tidy-html5-doctype.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
![html5-tidy.544x306.png](https://www.ctrl.blog/media/hero/html5-tidy.544x306.png)
HTML5 parsed as 4 with uppercase DOCTYPE in HTML Tidy
- Updated: 2021-02-28
- Published: 2019-05-16
HTML Tidy (libtidy) is a small program for identifying problems with, cleaning up, and producing consistent HTML formatting. It has full support for HTML5 parsing mode as well as the many legacy HTML 4 parsing modes. It can be a great aid for HTML experts and novices alike.
You may remember that I ran into another issue with php-tidy (also built on libtidy) when I tried using it with my old PHP based content management system. I’ve since migrated to a static-file content management system so that should have been the end of my problems with libtidy. However, I kept running into a strange bug with some of my document that were detected as HTML 5 but processes as they were HTML 4.
None of the legacy HTML 4 parsing modes allow modern practices like wrapping a block element like <h1>
inside what was formerly defined as an inline element like an <a>
element. It should work in most old web browsers as web authors have been doing this years before HTML 5 became a formal standard. But HTML Tidy in HTML 4 mode would try to clean up the mess and break the document in the process.
An HTML document’s parsing mode is detected from the DOCTYPE declaration on the very first line of the document. In standard compliant HTML 5, the DOCTYPE should always be the case-insensitive string <!DOCTYPE HTML>
. HTML 4 has multiple modes and many variations on the DOCTYPE declaration which are all distinctly different from the HTML 5 DOCTYPE.
After some digging I realized that HTML Tidy version 5.6.0 didn’t do case-insensitive matching of the DOCTYPE string. <!DOCTYPE HTML>
(uppercase “HTML”) would always be parsed as HTML 4 whereas <!DOCTYPE html>
(lowercase “html”) would be parsed as HTML 5. The HTML 5 standard is very clear that the DOCTYPE should be matched case-insensitively. Tidy would correctly indicate that both variants had been detected as HTML 5, but incorrectly state that both variants had used HTML 5 parsing. The casing of “DOCTYPE” had no impact on this issue.
The obvious work-around for this bug is to always use a lowercase DOCTYPE when passing markup through HTML Tidy. I wrote a little program for myself that lowercases the DOCTYPE of any HTML files that I pass through it to help me avoid this particular issue in the future. It shouldn’t make any different to web browsers or any other software that will parse the document.
This was a annoying issue and I spent way too much time narrowing down why some documents behaved correctly and some reformatted in strange and unexpected ways. I later realized that this “strange” reformatting was the expected result for HTML 4 strict parsing mode.
On a more positive note: I’m happy with HTML Tidy overall. It has helped me identify and correct dozens of issues with this site already. I’ve also found, reported, or fixed some other minor issues using Tidy with things like HTML5+RDFa extension attributes.
You can follow tidy-html5 issue #815 if you’re interested in updates on the DOCTYPE issue. A tiny patch for the issue is already available. HTML Tidy releases are few and far between so I don’t expect that we’ll see this patch make its way to a release and users any time soon.
You’re only required to use an uppercase DOCTYPE string when conforming to XML (e.g. XHTML.) In all other situations, it’s case-insensitive.
Sources
- The HTML syntax, 2017-12-14, HTML 5.2, W3C Recommendation, W3C
- HTML Tidy version 5.6.0, 2017-11-25, HTACG
Recommend
-
90
travis-ci - Free continuous integration platform for GitHub projects.
-
59
(This article was first published on Econometrics and Free Software , and kindly contributed toR-bloggers) Introd...
-
4
How many Doctypes can a Doctype Type 21 Sep 2013 So let’s say you’re wasting time on a Saturday night (as one does) clicking on t.co links tweeted by people you pretend to know and love (hoping that the t(a)co url server hasn’...
-
7
Java JPS commands under Linux are parsed in detail Time:2019-10-7 To display information about a process in Linux environment, you may always use PS commands, such as the following commands to display...
-
16
Copy link Member jyn514 commented...
-
3
@@ -187,16 +187,21 @@ def parse_requirements( req: Optional[Union[requirements.Requirement, UnparsedRequirement]] = None ...
-
4
细节决定成败之-浏览器doctype标签丢失导致浏览器样式变化 作者: dreamfly 分类: html 发布时间: 2020-05-24 12:58 最近在使用php进行html解析的时候,遇...
-
0
一、DOCTYPE标签的定义与作用 <!DOCTYPE>是一个用于声明当前HTMl版本,用来告知web浏览器该文档使用是哪种 HTML 或者 XHTML 规范来解析页面,以便浏览器更加准确的理解页面内容,更加良好地展现内容效果! 二、DOCT...
-
4
Allow .. to be parsed as let initializer #105701
-
2
冷门知识<!DOCTYPE html>在富文本编辑器中的应用如果您发现本文排版有问题,可以先点击下面的链接切换至老版进行查看!!!
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK