Advisory: Measures to Improve Stability of eduroam in UK
June 2011
Action to be taken in response to recent service stability issues caused by problematic behaviour by badly set up eduroam participants
Executive Summary
To reflect the increasing importance of the eduroam service to the community, Janet is making enhancements to the national proxy servers which will benefit organisations with well set up eduroam deployments. The enhancements have two components; an improvement to the performance of the servers (multi-threading) and new logic that will automatically send rejects to roaming users belonging to organisations whose ORPS are offline. We are also emphasizing the following recommendations on how organisations can improve the reliability of their own eduroam systems - which have an impact on the stability of service as a whole.
- Ensure that your systems conform to the guidance on the eduroam(UK) website;
- Use the eduroam(UK) QA test to check the operational resilience of your service;
- Consider implementing a resilient RADIUS infrastructure;
- Employ own-service operation monitoring systems.
Increasing Importance of Stability of eduroam Service
eduroam is becoming increasingly important to the Janet community. The majority of the major Higher Education Institutions have now adopted it and actual usage has expanded significantly over the past year. The stability of the service is therefore becoming critical to the community. Furthermore a number of Janet services are going to depend on the availability of the eduroam service, eg Janet 3G and there is growing interest in eduroam within local authorities, health and other areas of the public sector.
The service is federated, which means that participating organisations co-operate in order to provide the near pervasive service and a fabric of trust exists through which such participants provide the eduroam service to nationally set standards; the federated nature of the service also means that Janet Roaming has limited direct control over final delivery of the service at end sites, although of course we have total control of the national infrastructure of RADIUS servers.
Recent Problems Experienced
As most will be aware, in recent months the national infrastructure of RADIUS servers has experienced episodes of poor performance to the point where Janet Roaming has been unable to provide a responsive proxy service. These events have been caused by a number of major participating organisations creating problems that have compromised the national service we have been able to provide to the other, non-problematic, participants. Faced with an effective denial of service scenario, on a number of occasions eduroam(UK) has had to immediately suspend service to the sites causing the problems (some of them large universities). There have been two main causes of problems:
- Due in part to the massive growth in the service over the past year and consequently the sheer number of individuals wanting to use the service, the NRPS are now constantly being hit by huge amounts of malformed authentication requests from eduroam users (bad realm elements of usernames). It is unavoidable that users will on occasion try to use incorrect credentials and this error is compounded by certain devices that ‘auto-correct’ words or predicatively complete words with the unfortunate result that frequently realm components of usernames are malformed. Usually these bad authentication requests can be handled just fine by the NRPS (they drop them and send a response back to the relevant ORPS and thence to the NAS/client). There is a minor issue which is a little more severe that occurs when a non-UK bad realm is encountered. Since the NRPS have to wait until they hear back from the ETLRs since the NRPS have no way of knowing that non-UK realms are valid or invalid the impact is more severe. The NRPS depend on the response of the ETLRs and others in the chain so they can be tied up for considerable periods. (Eg fred-the-student@ucl.ac.edu gets forwarded to the ETLR and thence to USA/Canada). The net effect of the above is a steady growth in unnecessary load being placed on the NRPSs.
- By far and away the more serious problem occurs when an organisation's ORPS goes offline and stops responding to authentication requests. The NRPS however continue trying to send authentication requests to such a server. Since up until now we have had to run the NRPS in full debug and hence single threading mode, and the NRPS uses standard RADIUS and hence UDP, each auth-request results in a UDP socket being held open in the UDP buffer. This can rapidly result in a huge amount of NRPS resources being tied up and consequently the capacity of the NPRS to handle the rest of the community's authentication traffic becomes severely degraded. Implications for Janet Roaming Participants
As a consequence of the increasing importance of stability of the service, Janet Roaming is having to adopt a much more robust approach towards ensuring that the service can be provided reliably for the majority of eduroam participants. Please be advised that, as per the terms of the eduroam(UK) Service Policy, eduroam(UK) will immediately suspend national RADIUS proxy service to sites causing problems that affect the performance of the service. An email notification will be sent to the ‘service alerts/notifications registered contact’ regarding the suspension. We will then await a response and corrective action. Suspension of service is something that we are reluctant to do and is a measure not undertaken lightly, however it is hoped that eduroam participants will accept that in certain situations it is a necessary step.
Measures to Reduce Service-Affecting Incidents
In order to reduce the need for us to take such drastic action and to try to help participants avoid becoming the cause of problems that affect the national service, the following measures are proposed:
- eduroam(UK) is implementing a series of measures at national level to boost the processing capability of the national RADIUS servers. This comprises the implementation of new RADIUS traffic statistics gathering tools and the consequent opportunity for us to move to multi-threading operation.
- Automated authentication reject logic on the NRPS that deals with transitory instances of Home site realms not responding from any of their ORPS to authentication requests will be implemented. The effect will be that users will be sent Access-Rejects by the NRPS when the Home site ORPS are unresponsive. This will prevent the NRPS from being swamped by authentication requests for which there is no chance of being processed by the Home site. Nb. It should be emphasized that the effect of this measure will not be to deprive users of access to eduroam since the relevant ORPS, being offline, would not have been able to process the request in any case. The NRPS will retry the ORPS after a one minute back off for subsequent authentication attempts.
- eduroam Technical Administrators at participating organisations are advised to review the guidance published on the eduroam(UK) website that help to ensure your eduroam implementation conforms to the eduroam(UK) Technical Specification and best practice. In particular you should ensure your system meets the eduroam(UK) Q.A. test; consider the resilience of your ORPS (dual ORPS strongly recommended) and employ your own ORPS-NRPS operational monitoring system.
For further information please see:
Edward Wincott
Janet Roaming Service Manager
Janet