The root cause of the DNS problems is that “they” (for some value of “they”) have changed not only the entry giving the address for sotadata.org.uk, but also the identity of the name servers for the domain.
The address record has a TTL of 3600, so you would expect it to propagate within an hour - perfectly reasonable.
But the NS record is being served from the parent zone org.uk with a TTL of 172800, which is 2 days. The old nameservers are still serving the old address records (or at least one of them is - the other appears to be down). It is quite legitimate to carry on using these servers until the TTL of the NS record obtained from the parent expires.
I would therefore expect this to take up to 2 days plus 1 hour from the time of the original change to sort itself out. In the meantime, some people will get the old address, some will get the new. (Not counting any additional client-side caching outwith the DNS protocol proper).
Doing a DNS change by changing the identity of the name servers is not a clever way to do it Changing address records is routine; changing the nameservers themselves is not something to do lightly.
Indeed, but you might expect a service provider to be prepared for it. In my work life we never publish the real IP addresses of the machines offering DNS service. It is always a secondary address used for no other purpose. If we have a catastrophic failure of a server, we would just move the secondary address to a different machine and not have to trouble our parent with a change of nameserver.
And of the new nameservers, only one answers authoritatively for sotadata.org.uk, while the other (ns10.aspnethosting.co.uk) does not seem to know about the zone
If you start from the main SOTA web site http://www.sota.org.uk/ you are led directly to the database in one hop, but there is no direct reference to this reflector at all. You can only get here in two hops via SOTAwatch. How can anybody wanting to do something on the database be reasonably expected to come here first on the off chance that there is an important notice telling them not to?
If I want to use the database, I go straight there. I don’t read the reflector first just in case. I won’t typically see a reflector posting until it comes in the mail digest the next day. Sometimes I have a backlog. In the event I wasn’t caught out, but I could have been quite easily.
The database front end web site was up and running throughout. Could it not have a “maintenance mode” that you could trigger to block user activity when you don’t want it?
I hope this doesn’t sound too grumpy. I know you must have put a huge amount of effort into sorting this problem out, and along with everybody else I am most grateful for that. But please don’t blame users for using something that appears to be working!
What really annoyed me about this outage is I’m in the middle of a bunch of work to separate out the presentation and app layers and because it was down, I was stuck twiddling my thumbs. If this had happened two or three weeks from now, then yep, we’d have been sweet.
NO!! Please don’t as I will gaurentee that will get left in there and forgotten!
When any new combined services for SOTA come along IP addresses will have to change, DNS entries will have to change, even URLs might have to change - all of that happening with users with manual DNS records in a HOSTS file is going to cause real confusion!
Please wait everyone until Andy says everything is ready for use again - I’m not part of the MT but I have worked in IT for over 40 years and while I understand people’s wish to get using the system again, doing so before it’s 100% released by the MT will only cause problems!
Several years ago when 123REG had a major outage many SOTA users used a similar insertion and to the best of my knowledge no-one forgot to remove it once the problem was fixed.
I have had other experiences (in business situations) where local IT “gurus” have taken it upon themselves to set up or change HOST files - perhaps it wouldn’t happen with more technical SOTA users, but then again …
Indeed… Let’s hope that is because they are using some kind of firewall/NAT/load balancer in front of the servers, and are trying to save on IP addresses.
I think the last setup was the same, with 2 (different) nameservers but only 1 serving names! I’m happy to admit I’m a DNS user… I know the things I have to do to use someone else’s DNS and I understand the broad principles. The only one I’ve ever set up is in a Raspberry Pi and that consisted of “apt-get install dnsmasq” and edit the conf file to set the IP base and range handed out. So whether this setup makes sense or not I leave to those who know.
I read the crash as someone applied one of these quality Intel Meltdown/Spectre microcode updates and it goosed the server. MS withdrew one today because Intel’s microcode patch caused more issues than it fixed and Red Hat did likewise a few days back. I think both MS and Red Hat were very sensible in washing their hands of Intel’s awesome failure.
This badness even has a name: it’s called a lame delegation. If you tell the service provider that sotadata.org.uk has a lame delegation to ns10.aspnethosting.co.uk they should be embarrassed and say they’ll fix it right away. If they say “what’s that?”, you know they are clueless.
DNS is incredibly resilient, but as with so many things, it works best if you follow the rules. If you don’t, you tend to get apparently random iffyness which gets characterised as “oh, just another DNS glitch”.