Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's the norm in network ops. Automated testing is pretty much impossible, easy rollback may be possible depending on exactly what was screwed, but not always.

Take this for example, looks like the problem was an unplanned recalculation of routing tables. That's not going to be the case on a small scale test network, and rolling back won't help, indeed in this case it likely would cause more problems.



One of the reasons I got out of network engineering was how frequently the work I was required to do would cause unintended consequences. You can do all your due diligence, get your work blessed by vendor support, and still get blown up by a bug or undocumented behaviors on a regular basis. The conspiratorial part of my brain says these network device makers intentionally provide unreliable software and terrible documentation to bolster their support contract profits. I was just the guy typing in the commands and getting all the blame.


I remember the first time I got access to an employers production Cisco router. It’s pretty scary how easy it is to majorly fuck something up.

There isn’t a concept of a transaction or a rollback. You just enter a command, press enter and it’s live.

To counter this we’d write all the commands we planned on executing and peer review it. Nothing was to be done “on the fly” (at least in theory)

In short, coming from a developer perspective with ample version controls and gated releases… networking is a very wild ride.


> There isn’t a concept of a transaction or a rollback.

Yeah, Cisco gear is bonkers.

Mikrotik has "Safe Mode", which undoes all commands since you entered "Safe Mode" if the connection that created the shell gets interrupted. It has saved my bacon on several occasions, but there are several obvious situations in which you can get yourself locked out.

Juniper gear has "commit confirmed $NUMBER_OF_MINUTES", which will roll back everything since your last commit if you don't do a "commit" within $NUMBER_OF_MINUTES. It will also, apply all of the changes you've staged all at once (and do configuration sanity checking before it performs the commit).

I do have no idea how Juniper's rollback works when multiple users are doing simultaneous config editing... maybe don't do that?


> I do have no idea how Juniper's rollback works when multiple users are doing simultaneous config editing... maybe don't do that?

You get a warning

    Users currently editing the configuration:
      bob termainal p0...."
But the failure here is actually sshing to a network switch in the first place.

Some cisco kit has restconf which is better for automation, but it's buggy.


Modern router operating systems have this.

It’s been a long time since I’ve touched IOS-XE (Cisco enterprise gear) but Cisco IOS-XR, Junos, Arista EOS and the Nokia SRs all support some combination of configuration transactions with rollback and commit confirm on a timer

This definitely doesn’t stop you shooting yourself in the foot, similar to how you can still push broken config to a k8s controller, but it’s some level of protection for certain types of changes.


>"There isn’t a concept of a transaction or a rollback. You just enter a command, press enter and it’s live."

This hasn't been true for a very long time. Juniper router's have rollbacks, commits and revisions:

https://www.juniper.net/documentation/us/en/software/junos/c...

and

https://www.juniper.net/documentation/us/en/software/junos/c...

Cisco has similar:

https://www.cisco.com/c/en/us/td/docs/ios/ios_xe/fundamental...


Except Cisco doesn’t have a commit feature in any of their OS and the rollback feature is not implemented everywhere as well - NXOs doesn’t have it for example. Still, it’s better than ‘reload in 5’ that we had to use back then.


That's not entirely true, you can rollback a change on modern switches/routers, either via a rollback command, or with a revert timer (configure terminal revert timer X) (because the new configuration might have made the router unreachable, so you're never sure you'll be able to rollback manually if you're working remotely).


Interesting. There's also some stuff in Cisco that can't be done both atomically and remotely, so you may have to push a change as a file to the router and then source the file into the running config with some permutation of `copy`.


Hadn't thought about it from the perspective of support contract profits, but they also have their friendship stick firmly planted in technicians via the semi-required training since as you indicate the manuals are deficient.

At some point network vendors switched manuals from engineers documenting features whitebox to educated techs documenting features blackbox.

There's a clear transition for docs produced after 2008, prior to which more care went into tech notes and interpreting technologies -- after you're lucky to even get a complete set of steps and caveats without having to cross-reference bugs, release notes, old-manuals, new-manuals, draft manuals, reference manuals, licensing manuals, the inevitable errors that appear in the logs, and of course the configuration guide where this should all be in the first place.

In short, yes, this.


> The conspiratorial part of my brain says these network device makers intentionally provide unreliable software and terrible documentation to bolster their support contract profits.

As a dev who has worked at one of the major networking vendors, I can assure you that is the not the case. You’d be surprised by how major bugs are handled internally, especially if the bug affects “important” customers.


Networking and storage changes are always butt clenching affairs. Way more stressful than anything else in IT due to their blast radius if something shits the bed.


> That's the norm in network ops. Automated testing is pretty much impossible, easy rollback may be possible depending on exactly what was screwed, but not always.

Ansible/Napalm is a thing in NetOps in some places. Some folks use Eve-ng / GNS3 to spin up virtual networks to test config changes, and it may be possible to do CI/CD changes if you track things in a repo.

Juniper JunOS has auto-rollback if you don't confirm the change after "x" minutes:

* https://www.juniper.net/documentation/us/en/software/junos/c...

So if you did something that causes breakage and disconnection from the router, you (ideally) don't have to do anything but wait it out.


Emulating even a mid-sized network in GNS3 requires massive resources, and my cisco account manager doesn't seem to even get why I'd want to deploy a test system of 50 different multi-vendor switches (and key supporting services like syslog and tacacs) with terraform, run some tests, apply a configuration change, and run more tests.

And virtual switches aren't the same as physical switches in any case, they have different bugs, different features, different responsiveness.


commit confirmed is such a life-saver. I ran a production network which spanned multiple continents and even though I probably only ever actually needed commit confirmed a single digit number of times, the fact that it was there made every change I did 99% less stressful. I knew that even if I made a mistake, all I had to do was wait 5-10 minutes and it would all revert.

Compare this to my cisco/foundry/other experience where I would delay changes until I was in the office (physically colocated with main routers) or calling people to be onsite for what was 99% of the time an innocuous change. The stress of it led to me deferring changes or just skipping them entirely which led to more issues/stress/etc.

I'm really not sure there is a single software feature which improved my life as much as "commit confirmed"


So instead of one ripple across your BGP network, you have two as it rollsback the change?

The problem is that the state in routing tables isn't stored in a single location, it's dynamically built over time. Breaking a single router in the wrong way can break the state, and there's no rollback of that state


> So instead of one ripple across your BGP network, you have two as it rollsback the change?

It's possible, it depends on what the nature of the change is. If you use super short commit confirmed intervals (commit confirmed 1) then yes you can cause a situation where you revert a "good" commit and cause a second disturbance. You need to intelligently reason about commit confirmed times to consider this when you're making such changes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: