User Tools

Site Tools

Translations of this page:

broken_links_report

Broken Links Report - Link Checker Issues - Overview & Recommendations

Preface

This document was prepared in order to make the BizPaL partners aware of the efforts by BizPaL Support to find a suitable replacement to the software currently being used to produce the bi-weekly Broken Links Report.

Some BizPaL partners have experienced frustration with the report and possibly feel that BizPaL Support is unaware of or does not appreciate their concerns in this matter.

History

BizPaL Support has been using Integrity Link Checker for Mac by PeacockMedia since moving from the legacy BizPaL system with its own built-in link checker to ExpressionEngine in April 2011. Integrity Link Checker was one of the few solutions that would properly check the URLs in the ExpressionEngine entries, however there were and still are known issues with it.

One of these issues is the fact that there are a number of false positives consistently reported on the output which make it to the final Broken Links Report because there is no way of determining which errors are false positives just by looking at the output1. The only way to determine which URL may be a false positive is to manually verify it in a web browser.

Starting in August of 2012, in an effort to try to eliminate some of the false positives, the output from a second link checker Xenu Link Sleuth was compared to the output from Integrity Link Checker in order to identify any URLs that were being falsely recognized as broken in only one link checker. The logic being that if multiple link checkers indicated a similar status then the URL likely was a problem and should be investigated by the BizPaL partner. If the second link checker did not report the same status, it might be a false positive and would be verified by BizPaL Support.

This typically resulted in a discrepancy of 300-400 URLs which were then manually checked by BizPaL Support in an internet browser to determine whether or not they should be on the Broken Links Report. This required a significant amount of time and manual effort during each Broken Link reporting cycle and did not improve the situation as many false positives were erroneously reported by both link checkers. This double link checking process was ended in October of 2013.

Since that time, BizPaL Support has been relying on the single link checker: Integrity Link Checker for Mac.

Some efforts were made over the years to find another solution. In 2014, a cloud-based solution, LinkTiger was tried for a month however it was quickly found that many of the links in BizPaL were not even being recognized and checked by this program. When the question as to why this might be was put to the LinkTiger support team, it went unanswered. The test use of LinkTiger was discontinued after a month and it was never used to produce raw data for the Broken Links Report as it was considered to be too unreliable.

In March 2017, one of the BizPaL partners had their IT department look into the reason why they were receiving so many SSL errors for their links on the report. After BizPaL Support provided them with information on the hardware and software being used to produce the Broken Links Report, they concluded that the software and OS versions were outdated and probably responsible for the SSL errors they were receiving. The computer used to run the link checker was updated (OS as well as Integrity Link Checker software) and while this reduced the number of SSL errors to almost nothing, there were still a number of other false positives on the report.

In May of 2017, BizPaL Support again started looking into alternatives to the link checking solution being used.

Management at ISED indicated a preference for a cloud-based solution that would not be affected by either the speed of the ISED internal network, or the frequent lost connections experienced when using the department's internal Wi-Fi connection. These lost connections would usually necessitate the restart of a process that could take 2 to 3 hours to complete just to get the raw data used to create the final Broken Links Report. Despite the preference for a cloud-based solution, other locally-installed solutions were investigated as well and BizPaL Support asked a limited segment of the BizPaL partners what solution was being used by their own internal IT departments. Five responses were received which included the four listed here, plus Xenu Link Sleuth.

  1. LinkTiger - [cloud-based]
  2. Screaming Frog - SEO Spider - [local install]
  3. Deep Cognition Ltd. - Deep Trawl - [local install]
  4. Inspyder Software Inc. - InSite - [local install]
    • This solution also has the added benefit of being able to perform spell-check and would be a benefit when preparing the annual IQM reports.
    • InSite identified several errors with a URL field which had never been picked up by Integrity and these will be investigated by BizPaL Support.
SOURCE TYPE OF ERROR OUTCOME
LinkTiger 404-Not found Reported 77 fewer than Integrity; however some links
were actually broken and should have been reported.
LinkTiger 500-Internal server error Correctly identified several as 404 Not found.
LinkTiger 510-Server error Largely ignored.
LinkTiger Time-out Ignored
LinkTiger Too many HTTP redirects Ignored
LinkTiger Non-HTTP status code (1, 3, 10) Almost 400 links were reported with one of these
unrecognized codes and there appeared to be no
consistency in these particular results; some were valid
while others were actually broken so it could not be
said that all links reported with one of these
unrecognized codes should be eliminated from the
report.
ALL would need to be examined manually.
Screaming Frog - SEO Spider 302-Found
302-Moved temporarily
302-Object moved
302-Redirected
Approximately 1800 links were reported as “302-
xxxxxx” by Screaming Frog. There were several links
identified as “302-xxxxxx” which were not identified by
Integrity and which defaulted to the website's Home
page when the target was not found. There were over
100 “302-Found” which should have actually been a
“404-Not found” error because it was redirected to a
custom error page. There did not seem to be any way
to identify these without manually checking them and
for obvious reasons, the entire lot could not just be left
off the report.
Screaming Frog - SEO Spider 303-See other Some 200 URLs were reported as “303-See other”
where Integrity either reported them as “503-Service
unavailable” or did not report them at all because
there was actually nothing wrong with them.
Screaming Frog - SEO Spider 404-Not found The output of the Screaming Frog SEO Spider replaced
actual spaces in otherwise good URLs with another
character which would then appear as an error on the
output report. (It should be noted that best practice is
to not use spaces in URLs and partners should
encourage their municipalities to follow this rule
wherever possible).
Deep Trawl any The results did not provide the level of detail that is
currently provided by Integrity for use in the Broken
Links Report;
1. The status of the links reported was not clearly
stated (ie: 404-Not found, 500-Internal server error,
502-Bad gateway, etc).
2. The actual location of the problem link was not
provided (ie: Permit Cost URL - EN, Permit Form URL
- FR, etc).
InSite any Where the identical link was broken in more than one
field in a permit, or in more than one permit, it was
often but not always reported only one single time by
InSite.
It was not consistent and it appeared that while Insite
reported only 1 single instance of a particular broken
link in Halifax permits where Integrity found 20 (which
were manually confirmed as existing and being
broken), Insite reported the same identical 6 false
positives in Prince Rupert permits as Integrity.
InSite 404-Not found Insite reported fewer than Integrity however this
appeared to be due to the fact that it reported some
links only once when it should have been multiple
times across multiple permits. (see above)
InSite 404-Not found Some false positives incorrectly reported by Integrity
were also reported by Insite (princerupert.ca).
InSite Time out Some false positives incorrectly reported by Integrity
were not reported by Insite (pcsp.ca, torbay.ca).
InSite mal-formed URL;
(http:éé)
Insite correctly identified this error which had not
been previously identified by Integrity despite having
existed for several weeks.

Conclusion and Steps Forward

In parallel with the tests conducted on alternative link checking solutions, the team also reviewed the ProcessWire template fields where several of the URLs consistently reported by Integrity as being a problem were located. This was to determine if there might be a previously unknown issue with the actual link checker template which could be responsible for the errors. What we found was that the templates were fine, but in some cases the web servers were actually reporting an HTTP status other than '200 – OK' even though these URLs opened properly in a web browser.

Link-checker software can never provide 100% accurate results. We're using software that provides the most reliable results of the solutions we've tested.

There are many factors that can impact the results of an automated link-checker, and they can produce what seem to be false-positive results for many reasons:

  • Connectivity or network conditions could make it impossible to access a site.
  • The website may be undergoing updates/maintenance at the time the link-checker makes the request.
  • The website's server could be overloaded, too busy to handle the link-checker's request (this happens often on websites that have lower-end or shared hosting)
  • The website's server could be attempting to actively block link-checker bots.
  • Linked PDF files could be malformed or partly corrupted (viewable by a visitor with a browser, but not correct enough for evaluation by the link-checker)
  • There could be problems or misconfiguration of the website's server that cause errors behind the scenes (for example, every page on http://www.clarenville.net/ returns a 500 server error, which should prevent the website from loading at all, but the site is still viewable in a browser)

In light of all of the above, it was determined that no single link checking solution will be able to report on the status of the links in BizPaL with 100% accuracy and that no matter which solution is used, there will always be links reported as having a problem where none actually exists. Or, even worse, links that are truly broken that are not found and not reported to BizPaL partners.

Going forward, BizPaL Support is returning to using two link checkers to generate the raw data for the Broken Links Report which is published on the BizPaL Partner Site. Xenu Link Sleuth will be started a short time after Integrity Link Checker to re-check the same URLs. Instead of BizPaL Support manually checking the 300-400 links which appear as discrepancies between the two outputs, a new column will be added to the existing Broken Links Report to also indicate the status returned by Xenu Link Sleuth for each URL.

This will increase the time it takes to generate the Broken Links Report and may result in delays publishing it on the BizPaL Partner Site.

It will remain the responsibility of the partner to confirm any URLs on the Broken Links Report in their jurisdiction(s), but they will now have the additional information provided by the second link checker regarding the status of a particular URL.

There is of course the possibility that the second link checker might also report the same false positives as Integrity but this is a situation beyond the control of BizPaL Support.

Partners who find links that appear repeatedly on the Broken Links Report may want to approach the municipality involved and advise them of the issue. If the municipality is unable to assist in correcting the issue with the municipal web server, the partner may request that URLs which appear on the report repeatedly and are known to be working properly be omitted from BizPaL Support's internal reporting and noted on the Broken Links Report as such. These links will still appear on the Broken Links Report posted on the BizPaL Partner Site however and it is still the responsibility of partners to verify any URLs listed on the Report.

Partners have to request that these URLs be “excluded” as BizPaL Support will not be seeking out these links to identify on the Broken Links Report.

Please review 'Appendix A' for the procedure to be followed as described in the original email sent 2016-07-16.


Appendix A

Sent: July-18-16 10:07 AM
Subject: [FYI] Change to Broken Links Report | [PVI] Changement au rapport sur les liens brisés

BizPaL partners have often noted that some links which appear repeatedly on the Broken Links Report do actually work when tested in a browser or through the BizPaL client application.

In response to recent requests to remove these links from the Broken Links Report, the report will be modified as follows-

BizPaL partners can request (via email to support@bizpal.ca) that these links be flagged on the Broken Links Report. The link will not actually be removed from the report, but will be flagged as “Omitted from BizPaL internal reporting at the request of the partner. Partner is still responsible for verifying the validity of this link”. The “# Days Broken” column for that link will also be set to “n/a” in the same manner as the following status codes which are not included in our internal reporting:

  • The request timed out
  • Too many HTTP redirects
  • The certificate for this server is invalid2

This is the procedure that will be followed-

Partners-

  • Will be responsible for requesting that specific links be flagged on the report (if the partner finds that a link appears repeatedly on the Broken Links Report but works properly in a browser and through the BizPaL client application).
  • Will have to include the line number(s) from the Broken Links Report as well as the permit ID(s) when making their request.
  • Will still be responsible for checking the validity of any flagged link(s) on the Broken Links Report as BizPaL Support will not be manually verifying these links.

BizPaL Support-

  • Will not actually remove the link from the report as we have a responsibility to report any links with problems to the partners, but we will not include those links in our internal reporting and we will not repeatedly remind the partners about them.
  • Will not go looking for links to be flagged on the Broken Links Report and will be depending on partners to report these to us, but we will test those links that are requested by partners in order to verify that they do work in a browser and in the BizPaL client application.
  • Will not flag any link on the report which is found by BizPaL Support to actually be broken.
  • Will not flag any link on a report already generated, but will flag them on reports “going forward” from the time they are reported to us by the partner.

Footnotes

1A false positive is a URL that reports some sort of problem to the link checker (ie: 400 Bad request, 403 Forbidden, 404 Not found, 500 Internal server error, 502 Bad gateway, Time-outs, etc), but works fine when tested in a web browser. False positives can be the result of a web server not accepting requests from web crawlers (like a link checker), or a number of other reasons.

2This particular section is no longer valid as these links are not automatically set to “n/a”


Broken Links Report-Link Checker Issues-Overview & Recommendations - Sep 12, 2017 / 47 Kb.

broken_links_report.txt · Last modified: 2017/09/22 11:42 by Douglas Winmill