This document was prepared in order to make the BizPaL partners aware of the efforts by BizPaL Support to find a suitable replacement to the software currently being used to produce the bi-weekly Broken Links Report.
Some BizPaL partners have experienced frustration with the report and possibly feel that BizPaL Support is unaware of or does not appreciate their concerns in this matter.
BizPaL Support has been using Integrity Link Checker for Mac by PeacockMedia since moving from the legacy BizPaL system with its own built-in link checker to ExpressionEngine in April 2011. Integrity Link Checker was one of the few solutions that would properly check the URLs in the ExpressionEngine entries, however there were and still are known issues with it.
One of these issues is the fact that there are a number of false positives consistently reported on the output which make it to the final Broken Links Report because there is no way of determining which errors are false positives just by looking at the output1. The only way to determine which URL may be a false positive is to manually verify it in a web browser.
Starting in August of 2012, in an effort to try to eliminate some of the false positives, the output from a second link checker Xenu Link Sleuth was compared to the output from Integrity Link Checker in order to identify any URLs that were being falsely recognized as broken in only one link checker. The logic being that if multiple link checkers indicated a similar status then the URL likely was a problem and should be investigated by the BizPaL partner. If the second link checker did not report the same status, it might be a false positive and would be verified by BizPaL Support.
This typically resulted in a discrepancy of 300-400 URLs which were then manually checked by BizPaL Support in an internet browser to determine whether or not they should be on the Broken Links Report. This required a significant amount of time and manual effort during each Broken Link reporting cycle and did not improve the situation as many false positives were erroneously reported by both link checkers. This double link checking process was ended in October of 2013.
Since that time, BizPaL Support has been relying on the single link checker: Integrity Link Checker for Mac.
Some efforts were made over the years to find another solution. In 2014, a cloud-based solution, LinkTiger was tried for a month however it was quickly found that many of the links in BizPaL were not even being recognized and checked by this program. When the question as to why this might be was put to the LinkTiger support team, it went unanswered. The test use of LinkTiger was discontinued after a month and it was never used to produce raw data for the Broken Links Report as it was considered to be too unreliable.
In March 2017, one of the BizPaL partners had their IT department look into the reason why they were receiving so many SSL errors for their links on the report. After BizPaL Support provided them with information on the hardware and software being used to produce the Broken Links Report, they concluded that the software and OS versions were outdated and probably responsible for the SSL errors they were receiving. The computer used to run the link checker was updated (OS as well as Integrity Link Checker software) and while this reduced the number of SSL errors to almost nothing, there were still a number of other false positives on the report.
In May of 2017, BizPaL Support again started looking into alternatives to the link checking solution being used.
Management at ISED indicated a preference for a cloud-based solution that would not be affected by either the speed of the ISED internal network, or the frequent lost connections experienced when using the department's internal Wi-Fi connection. These lost connections would usually necessitate the restart of a process that could take 2 to 3 hours to complete just to get the raw data used to create the final Broken Links Report. Despite the preference for a cloud-based solution, other locally-installed solutions were investigated as well and BizPaL Support asked a limited segment of the BizPaL partners what solution was being used by their own internal IT departments. Five responses were received which included the four listed here, plus Xenu Link Sleuth.
SOURCE | TYPE OF ERROR | OUTCOME |
---|---|---|
LinkTiger | 404-Not found | Reported 77 fewer than Integrity; however some links were actually broken and should have been reported. |
LinkTiger | 500-Internal server error | Correctly identified several as 404 Not found. |
LinkTiger | 510-Server error | Largely ignored. |
LinkTiger | Time-out | Ignored |
LinkTiger | Too many HTTP redirects | Ignored |
LinkTiger | Non-HTTP status code (1, 3, 10) | Almost 400 links were reported with one of these unrecognized codes and there appeared to be no consistency in these particular results; some were valid while others were actually broken so it could not be said that all links reported with one of these unrecognized codes should be eliminated from the report. ALL would need to be examined manually. |
Screaming Frog - SEO Spider | 302-Found 302-Moved temporarily 302-Object moved 302-Redirected | Approximately 1800 links were reported as “302- xxxxxx” by Screaming Frog. There were several links identified as “302-xxxxxx” which were not identified by Integrity and which defaulted to the website's Home page when the target was not found. There were over 100 “302-Found” which should have actually been a “404-Not found” error because it was redirected to a custom error page. There did not seem to be any way to identify these without manually checking them and for obvious reasons, the entire lot could not just be left off the report. |
Screaming Frog - SEO Spider | 303-See other | Some 200 URLs were reported as “303-See other” where Integrity either reported them as “503-Service unavailable” or did not report them at all because there was actually nothing wrong with them. |
Screaming Frog - SEO Spider | 404-Not found | The output of the Screaming Frog SEO Spider replaced actual spaces in otherwise good URLs with another character which would then appear as an error on the output report. (It should be noted that best practice is to not use spaces in URLs and partners should encourage their municipalities to follow this rule wherever possible). |
Deep Trawl | any | The results did not provide the level of detail that is currently provided by Integrity for use in the Broken Links Report; 1. The status of the links reported was not clearly stated (ie: 404-Not found, 500-Internal server error, 502-Bad gateway, etc). 2. The actual location of the problem link was not provided (ie: Permit Cost URL - EN, Permit Form URL - FR, etc). |
InSite | any | Where the identical link was broken in more than one field in a permit, or in more than one permit, it was often but not always reported only one single time by InSite. It was not consistent and it appeared that while Insite reported only 1 single instance of a particular broken link in Halifax permits where Integrity found 20 (which were manually confirmed as existing and being broken), Insite reported the same identical 6 false positives in Prince Rupert permits as Integrity. |
InSite | 404-Not found | Insite reported fewer than Integrity however this appeared to be due to the fact that it reported some links only once when it should have been multiple times across multiple permits. (see above) |
InSite | 404-Not found | Some false positives incorrectly reported by Integrity were also reported by Insite (princerupert.ca). |
InSite | Time out | Some false positives incorrectly reported by Integrity were not reported by Insite (pcsp.ca, torbay.ca). |
InSite | mal-formed URL; (http:éé) | Insite correctly identified this error which had not been previously identified by Integrity despite having existed for several weeks. |
In parallel with the tests conducted on alternative link checking solutions, the team also reviewed the ProcessWire template fields where several of the URLs consistently reported by Integrity as being a problem were located. This was to determine if there might be a previously unknown issue with the actual link checker template which could be responsible for the errors. What we found was that the templates were fine, but in some cases the web servers were actually reporting an HTTP status other than '200 – OK' even though these URLs opened properly in a web browser.
Link-checker software can never provide 100% accurate results. We're using software that provides the most reliable results of the solutions we've tested.
There are many factors that can impact the results of an automated link-checker, and they can produce what seem to be false-positive results for many reasons:
In light of all of the above, it was determined that no single link checking solution will be able to report on the status of the links in BizPaL with 100% accuracy and that no matter which solution is used, there will always be links reported as having a problem where none actually exists. Or, even worse, links that are truly broken that are not found and not reported to BizPaL partners.
Going forward, BizPaL Support is returning to using two link checkers to generate the raw data for the Broken Links Report which is published on the BizPaL Partner Site. Xenu Link Sleuth will be started a short time after Integrity Link Checker to re-check the same URLs. Instead of BizPaL Support manually checking the 300-400 links which appear as discrepancies between the two outputs, a new column will be added to the existing Broken Links Report to also indicate the status returned by Xenu Link Sleuth for each URL.
This will increase the time it takes to generate the Broken Links Report and may result in delays publishing it on the BizPaL Partner Site.
It will remain the responsibility of the partner to confirm any URLs on the Broken Links Report in their jurisdiction(s), but they will now have the additional information provided by the second link checker regarding the status of a particular URL.
There is of course the possibility that the second link checker might also report the same false positives as Integrity but this is a situation beyond the control of BizPaL Support.
Partners who find links that appear repeatedly on the Broken Links Report may want to approach the municipality involved and advise them of the issue. If the municipality is unable to assist in correcting the issue with the municipal web server, the partner may request that URLs which appear on the report repeatedly and are known to be working properly be omitted from BizPaL Support's internal reporting and noted on the Broken Links Report as such. These links will still appear on the Broken Links Report posted on the BizPaL Partner Site however and it is still the responsibility of partners to verify any URLs listed on the Report.
Partners have to request that these URLs be “excluded” as BizPaL Support will not be seeking out these links to identify on the Broken Links Report.
Please review 'Appendix A' for the procedure to be followed as described in the original email sent 2016-07-16.
Sent: July-18-16 10:07 AM
Subject: [FYI] Change to Broken Links Report | [PVI] Changement au rapport sur les liens brisés
BizPaL partners have often noted that some links which appear repeatedly on the Broken Links Report do actually work when tested in a browser or through the BizPaL client application.
In response to recent requests to remove these links from the Broken Links Report, the report will be modified as follows-
BizPaL partners can request (via email to support@bizpal.ca) that these links be flagged on the Broken Links Report. The link will not actually be removed from the report, but will be flagged as “Omitted from BizPaL internal reporting at the request of the partner. Partner is still responsible for verifying the validity of this link”. The “# Days Broken” column for that link will also be set to “n/a” in the same manner as the following status codes which are not included in our internal reporting:
This is the procedure that will be followed-
Partners-
BizPaL Support-
1A false positive is a URL that reports some sort of problem to the link checker (ie: 400 Bad request, 403 Forbidden, 404 Not found, 500 Internal server error, 502 Bad gateway, Time-outs, etc), but works fine when tested in a web browser. False positives can be the result of a web server not accepting requests from web crawlers (like a link checker), or a number of other reasons.
2This particular section is no longer valid as these links are not automatically set to “n/a”
Broken Links Report-Link Checker Issues-Overview & Recommendations - Sep 12, 2017 / 47 Kb.