Friday, November 4, 2022

Netflix model for ammonia control

(posted Apr 20, 2023)

I'll explain here why I think the "Netflix/SaaS" model for industrial control is disastrous.

With Netflix, the "brain" of the system is the software that runs in "the cloud" -- on "someone else's computer", on a giant farm of computers in Seattle, WA, or Ireland, or who knows where. The software orchestrates the experience for hundreds of millions of subscribers relying on a enormous database of movies and people who watch the movies and interact with each other. 

The only hardware you need besides your TV display is a device connected to the internet, maybe a Roku stick, that listens to what the brain in the cloud says and streams the pixels and audio to your TV. That device combined with your TV is like a dumb terminal for the mainframe.

The upsides are huge. To begin with, there is no DVD player you need to hook up to your TV and no DVDs to keep around. But more than that, you have a near infinite selection of movies available with a few clicks, the company can update the movies database in the cloud behind the scenes and add new navigation and search features in the cloud and on your locally running software. You can see what other people watch, get recommendations, rate, read reviews, comment and so on. Its a huge network of movie watchers consuming content and interacting through the cloud with each other by the hundreds of millions. You don't have to do anything to get all the latest and greatest stuff (except pay monthly for the software service). You can even stream through your Xbox or Nintendo Switch if for whatever reason you'd want to do it. This model is well suited here because the environment which consists of countless people who make and advertise and distribute movies and countless people who watch different movies (almost) every time and talk about the movies is, just like with music and video games, extremely dynamic. 

The downsides are small: if your internet is out you won't be able to watch a movie (and don't count on Netflix for DVDs come September). If the company releases faulty software by accident, either in the cloud, or in an update to your local hardware, you may not be able to log in or stream right then, but that's not a big deal, for obvious reasons; you'll go do something else until the service is back.

Now consider what happens when you apply the Netflix model to industrial cold storage facilities.

Today, these facilities typically have a Windows PC -- or several, so that if one crashes another takes over -- on which they run robust, time tested, battle hardened industrial control software. From these PCs network wires run into a network hub and from the hub more wires go into controllers that send electrical signals to giant compressors, blast fans, ammonia tanks and other hazardous equipment. When you press a button on the screen to open or close some valve, there is a reliable electrical connection between your PC and the end equipment which is often in the same building as you are.

(Often, control goes from the PC to a smaller, tighter, ultrareliable controller called a PLC -- programmable logic controller, which breaks down the PCs more complex requests into smaller ones, so instead of the PC talking to the equipment directly, the PC tells the PLC what to do and the PLC tells the equipment what to do.)

The control software is updated rarely and carefully; the electrical equipment is what it is, often years and decades old, and the refrigeration process by which you thaw or blast freezing air does not change much either -- food storage is a conservative enterprise.

This arrangement is not ideal: while the operators can log in remotely, typically via TeamViewer, it would be nice if they could do that from their phone in case they need to check the status of their equipment or initiate an action in an emergency when they are not next to their PC. More than that, you would like to be able to optimize the way these machines, which consume huge amounts of electricity, operate, based on your business patterns, and thus save energy and reduce costs to be more competitive. Of course that comes far distant second to the safety of the facility people and the food. (You don't want to damage the food either by fluctuating the freeze temperatures more than a couple of degrees.)

This is the "DVD player" model of the facility: 


(what looks like a monkey stealing a rocket is a symbol for a piece of equipment called a "compressor").


With the Netflix model, that Windows PC is taken out of the equation, or rather relegated to being used only as a browser. There is an advantage to it: you don't have to maintain the PC, a Chromebook fresh out of the box can do the job (as your phone or tablet can). All you need is a browser; you can throw your "DVD collection" along with the "player" in the trash.

Of course when you use only the browser, there still needs to be some piece of hardware -- the equivalent of a Roku stick -- placed in the facility to talk to the equipment but you do not control it anymore: it receives its orders from the cloud. What is worse, you have absolutely no control over which software runs in the cloud. Whereas previously you knew that the software is entirely contained within your PC, and you had that direct connection from the PC to the equipment nearby, now there are hundreds and thousands of components on which hundreds and thousands of people -- most of them completely unknown to you, working for companies you have never heard of -- are constantly updating, changing, reversing, and doing who knows what to the software that is the lifeblood of your facility.

(And if there was a PLC between the PC and the equipment, the PLC now has to forward equipment commands from the program defined in the cloud onto the equipment. Everyone has to submit to the cloud.)

This is the "Cloud Controlled" model.

The "mainframe" -- the "brain" and the database -- is now in the cloud. Sometimes you literally cannot even log in to turn off that valve that may be in your eyesight because something called "Okta" has failed somewhere "in the cloud". (This happened, several times.) Or, something called "github", also in the cloud, can't be reached and your system cannot boot. With the cloud in the critical path, you may easily end up locked out of being able to control hazardous equipment in the room next to you, and the only thing you can do is call support. 

(Which, has to be said, was great, very responsive, day and night, but you would much rather have essential things work like they used to and not have to call someone at 11pm. And support can't do much if Okta or any number of 3rd party components needed for the cloud system to work is experiencing outages.)

Or, simply, your internet is out. You have been given an alternative to connect to the "Roku stick" the company has installed in your facility directly in case of emergency, but that is not the default way because it has limited functionality: the Roku stick is tiny. And unless you use this approach all the time, an outage disrupts your flow at best and makes you vulnerable in an emergency at worst, when it may be dark and hazard lights are flashing and you have to remember how to use this alternative way. (Also happened.)

With the cloud, you depend on countless third party providers who depend on yet other providers and you have no control over any of it. And this in facilities that often even have their own electricity should the power grid fail.

More dangerous still is that some software guy in the vendor company who has no idea what you are doing and what your business needs are at the moment has decided to "push" a software update because he decided some cool new feature needs to be in, and your system stops working (also happened), or is doing something differently than it used to (also happened, and of course you don't notice it immediately, sometimes it takes days). Yes, the support notifies the software people promptly, but how easy is it to figure out at midnight what exactly caused the bug? And, unlike with the PC software, in a complex cloud system you often cannot restore the system to the previous working state (also happened). You have to find the bug and fix it while time is running out.

To top it off, because of the Netflix cloud model, adding one more "movie watcher" -- another facility -- to the single shared database can wreck the database and/or the "brain" in the cloud and cause the software to misbehave in infinite and unpredictable ways -- goes offline, or stops running, or turns off all the alarms. But unlike Netflix, even when everything works great you get zero benefits from this sharing. You don't care what another cold storage facility 500 miles away (or 50 miles away) is doing, you are not going to rate a software feature and review other ratings by other facilities.

That is the "completely unnecessary, architecturally wrong interdependence of facilities and overdependence on the cloud" mentioned earlier in the blog. Even your Office 365 subscription allows you to use Word on your PC independently of the Microsoft cloud (unless you want a fancy new template). You update your Word when you want.

The Netflix model is like a beehive; it opens up the software so that countless people can keep constructing and deconstructing it while countless other people feed it with data and consume from it:

(source: goodtechthings.com) 

This beehive nature of the model which allows the software to change rapidly at the cost of instabilities and failures absorbed by individual "bees" is incompatible with industrial control where processes require stability. Midnight heroics happen when software people prevent disasters made by other (or same!) software people who messed things up in the perpetual quest for new features. But disaster at Netflix is very different from a disaster in an industrial facility.

This does not mean facilities should use their Windows 7 PCs forever and without ever going online; progress is inevitable, and there is certainly room for collecting facility equipment data and sending it to the cloud for people and machines to analyze it to optimize the business, as long as it doesn't endanger the essential function. There is also room for getting limited feedback from the cloud, in which the cloud, optionally, sends suggestions for optimization strategies that the local software can apply to save energy. But the integrity and independence of the local facility must be respected at all times.

This would be the "Cloud Assisted" model:


(The images were made on my home computer and outside of working hours. They merely describe the obvious architectural choices.)

The problem with this alternative approach is, as far as investors are concerned, it's not that much "SaaS" anymore; SaaS becomes a sideshow. You still have to have considerable hardware on the premises -- a powerful PC or multiple ones for failover -- and the company can't update its software whenever it wants; its business doesn't "scale" as much, which investors don't like, even if it could provide valuable service to the customer, because everything is slowed down a lot more. 

Except there are cases where slow is good, and this is one of them: industrial cold storage, not to mention food processing, is a dangerous activity. You have to roll out features conservatively and test extensively. There is no other way. "Move fast and break things" will break things.

"Cloud-assisted" is just one alternative approach to "Cloud-controlled", there are others and certainly better ones, like the local cloud -- an on-premises cloud. But whatever the more appropriate solution, anything other than the 
"cloud-controlled" approach was never discussed.

(All this doesn't take into account security, which is a topic I do not have expertise in.)

* * *

Less obvious and maybe the most concerning aspect of this kind of scheme is, it would take the power away from the facility people and put it in the hands of the software people outside of your company. To save some money -- maybe -- you give those software people the keys to your kingdom. And that's the best case scenario.

My opinion is, if companies are going to use a software service in a way that affects their livelihood, not to mention life and limb, they should make sure that control remains in their hands at all times. 



No comments:

Post a Comment

"Move fast and break things" meets industrial facilities

(updated 10/1/23) "Start by being nice to every person you meet. But if someone tries to exercise power over you, exercise power over h...