In the past few days I’ve been analysing a couple of web apps and websites to check if tools can really help identify whether a website is accessible. I’ve tested Lighthouse, Claude, ARCToolkit, axe DevTools and did a quick manual check myself (<3mins) as comparison. The conclusion: They can be helpful but are not complete and I was almost always more reliable and significantly faster overall.
This shall not be a praise of myself. But if Lighthouse reports a 95% score for accessibility when I identified more than 10 WCAG violations in under three minutes it shows the problem of these scores. The tool says 5% missing and shows a green indicator which means people interpret this as “we’re doing good enough here” (understandably).
I’m going to explain my test with an example page that I don’t want to link to publicly. It looks great and in parts accessibility is there.
Jump directly down to the summary.
Google Lighthouse
Lighthouse reports an overall of 1 issue on the page: aria-hidden=true elements contain focusable descendants.
This is correct to call out but also wrong in its details: Lighthouse tries to be smart here and suggest that a descendant of the element is the real issue. However, the element in question is an empty a element.
axe DevTools
axe DevTools reports an overall of 1 issue on the page: ARIA hidden element must not be focusable or contain focusable elements.
ARC Toolkit
The ARC Toolkit reports 2 errors and 3 alerts:
aria-hiddenused on focusable: The same issue as the other tools, correctly described.- Non-active element in tab order: There’s a
tabindex=0attribute set to the navigation wrapper element, without aroleassigned or being an interactive element. This is also a clear violation of the standards (4.1.2) and should be identified by all tools. - No
imagerole: For two SVGs acting as images there is no image role set. This refers to section 1.1.1 non-text content of WCAG 2.1. The elements in question are SVG icons with no title or other information attached. - Missing bypass methods: The page is missing bypass option, like a "skip to main content" link.
Claude AI
As expected AI is chatty and a bit more complex to handle than classical tools. Since I’m doing an external audit here, I’m relying on the classic Claude AI with Sonnet 4.5 model instead of Claude Code. The results were quite good when I provided the actual rendered HTML code (their web parser cannot handle the markup):
- Overall identified 9 critical issues:
- Improper landmark usage, 1.3.1, Level A, 1x
- Missing heading hierarchy, 2.4.6, Level AA, 11x
- Non-descriptive badges, 1.1.1 Level A, 11x
- Decorative logo marked functional, 1.1.1, Level A
- Missing focus indicators, 2.4.7, Level AA
- MultipleKeyboard trap potential, 2.1.2, Level A
- Missing skip link, 2.4.1, Level A
- Insufficient link context, 2.4.4, Level A, 11x
- External link indication, 3.2.4, Level AA, 11x
It was able to prioritise these issues, provide the evidence for it (showing the code snippet where the issue is along with the specification criteria).
So overall it struggles with parsing the entire website, could not detect the actually existing title element but apart from that identified a couple of issues and brought up a lot of generic advice I didn’t ask for.
The output relevance highly depends on your ability to ask / specify the prompt. If you simply ask to check the accessibility of the website, the results are okayish but not great. You need to be very specific here, ask for evidence, and to not hallucinate. It is still necessary to re-evaluate the correctness of the problems and the relevance. AI does not always interpret things correctly.
Qwen3-vl-4b (Local LLM)
I also tested the same query with a local model. Due to the fact that I’m feeding a lot of data (and with that context) to the LLM, I wasn’t able to choose from many models and went for a qwen model — they’re usually quite fast and lean (4b model) to run locally. In this case, it wasn’t the case: I aborted the task after nearly an hour and it was only starting to evaluate the code.
It found nearly hundred issues, and the results were not very useful. The first 52 results are about color contrast, telling me that it matches the contrast ratio but may not for small font sizes. It was hard to find out which issues could be real ones, therefore it was more time consuming than adding value to my evaluation.
Allen AI Olmo 3 32b Think (Local LLM)
This is another, relatively new model with a lot more context and such, much heavier to run. But it’s more powerful and may understand my question better. It is more capable and when looking at its thinking protocol, it’s interesting how much it thinks what’s relevant and what’s not. There you can see that we’re still talking about Large Language Models, not brain-like models. For example, it evaluates whether a custom data attribute on a link element is relevant for accessibility or not — nothing a human would do.
Overall, it was not great. It aborted the thinking after a while due to reached context length of 4096 tokens. On my Macbook Air M3 it’s not a good idea to run a 32b model with a much larger context length. What it discovered so far are two possible accessibility issues:
- »Missing ARIA roles or labels for interactive elements: But there are no form controls or interactive elements in the provided code snippet, so this isn't applicable here.«
- »Viewport meta tag missing 'initial-scale=1': While not a WCAG violation per se, best practices recommend including all three attributes: width=device-width, initial-scale=1, and user-scalable=yes/no. The absence of these might not break compliance but could affect mobile usability. However, since the user is strictly asking for WCAG 2.1 violations, this might not count unless it leads to a specific criterion failure.«
The first one is indeed an issue, the conclusion is wrong as this model is not executing the code and therefore was not able to identify interactivity of elements properly.
The second one is just a thought so far without a conclusion. Therefore, let’s not go further into details here. Overall, I can’t say whether this model would be suitable or not, but given that it was still evaluating after 10 minutes, I’d say it’s not worth it as a local assistant.
Qwen3 Coder 30b (Local LLM)
This was quite fast and not too bad. The model returned the following recommendations after only about 10 seconds:
- »Remove the visually hidden but screen-reader accessible content from Citation 1«
- »Implement proper semantic HTML structure with landmarks and heading hierarchy«
- »Add
altattributes to all images in the content« - »Ensure proper keyboard navigation and focus management«
- »Implement fallback fonts for accessibility«
- »Add proper language attributes throughout the content«
The first one is a weird one I found manually, too. It’s a visually hidden ARIA live region to announce the company’s name.
Second one is correct, nearly no semantic structure there, this is one of the biggest issues on the site.
The rest is good but the last two not very relevant or even correct in all cases.
Overall, I was impressed by the performance and quality of the LLM here and would say it’s a good way to get quick validation feedback. Compared to Claude AI it was not performing worse but running locally without additional costs and with way better privacy.
Manual check
My manual check took a bit longer than the AI (1min32secs) but not much (6 minutes). Actually, reading the AI’s answer and checking it took me almost as long as my manual review.
I easily identified missing landmarks and proper element usage, missing skip link, missing image declarations, insufficient semantic markup structure (in many places, not only 1x as AI and others suggest), too low contrast for focus indicators, missing focus indicators, missing relationships (navigations), and many more.
What’s hard for me to come up with the exact success criteria for each of these issues in such a quick review, and to be clear, I also did not write this all up but simply identified these issues.
Summary
Accessibility tools are crucial for extensive reports and deep-dive analysis of the matter. However, they only assist your individual knowledge and can’t replace a proper, manual review entirely.
For a quick check of a website, tools are secondary to me. I first check the markup and do a test with a screenreader software — that tells me more about the status quo than any result of automated tools. But they can be a great assistance, if chosen and used properly.
To be honest, I found the result from the Claude LLM pretty good and useful. It’s great to let AI assist you, be the backup or data reference for you.