A Peek into Generative Control

By Xiang Zhang July 06, 2024

Chinese version: 一窥生成式控制论 - 知乎 (zhihu.com)

As a peek into some of my most recent work in applying generative modeling into various industries, I'm proudly presenting an idea that illustrates how powerful generative models can be for industrial controling. My team and I are working to rapidly expand this idea into many different areas, and we still haven't seen its limit yet.

A General Formulation of Generative Models

\[y = g(z), \quad z \sim \mathbb{D}\]

where \(y\) is a sample from data, \(g\) is a generator neural network written as a function, and \(z\) follows a pre-defined distribution \(\mathbb{D}\).

It is easy to identify that generative adversarial networks (GANs) naturally result in models in the above fashion. For autoregressive language models (transformers or otherwise), \(z\) is the concatenation of all the random sampling variables used during the decoding process. For stable diffusion, \(z\) is the concatenation of all the noise added in the diffusion process.

A Generative Formulation of Industrial Control

\[[u,v,o] = g(z), \quad z \sim \mathbb{D}\]

where \(u\) is the collection of output variables that can be used to control some complex system, \(v\) represents the input variables that can be observed (by sensors or otherwise), and \(o\) is some optimization objective that we want to achieve by controling. We collect a large amount of samples in the form of \([u, v, o]\) to train the generative model \(g\).

In industrial control, we usually need to answer the question of given \(v=v_0\), find the \(u\) that optimizes \(o \rightarrow o_0\) (read as "\(o\) towards \(o_0\)"). To obtain this, we use a fixed-point algorithm on \(g\):

\[\begin{align}& \text{Initialize } u_0 \\& \text{While } t = 0 \rightarrow T \\& \quad \quad z_t = \underset{z}{\text{argmin}} ~ \| g(z) - [u_{t-1}, v_0, o_0] \| \\& \quad \quad [u_t, \_, \_] = g(z_t) \\& \text{Output } u_T \text{.}\end{align}\]

Another common question in industrial control is whether it is possible to move the complex system from state \([u_1, v_1, o_1]\) to \([u_2, v_2, o_2]\) by changing only \(u\). Given \(z_1= \underset{z}{\text{argmin}} ~ \| g(z) - [u_1, v_1, o_1] \|\) and \(z_2= \underset{z}{\text{argmin}} ~ \| g(z) - [u_2, v_2, o_2] \|\). We find that this can be achieved by interpolation between \(g(z_1)\) and \(g(z_2)\) by controlling how far we go and verify the difference between the system and the model at every step. This is essentially walking on the system state manifold, and in practice much easier to succeed than alternatives such as reinforcement learning.

Generative Control Can Disambiguate

To reveal the power of generative control, we consider the alternative formulation \(u = h(v, o)\). The obtain \(h\), we can train it as a regression or classification problem. However, simple regression or classification would collapse with ambiguous \(u\) presenting in the data.

Using self-driving as an example in which \(u\) is the steering direction, if there is a vehicle presenting in front of it on a multi-lane road, it might be okay to steer either left or right. Since both directions are acceptable, they will both present in the data, resulting in the model being regressed into straight rather than left or right -- which is disastrous. One can perhaps extend \(u\) into a distribution of steering directions and sample during inference, but for more complex systems this is not acceptable because of the curse of dimensionality. In fact, achieving a balance between full regression and full probabilistic modeling is a common reason for why many systems these days are so complex.

Generative Control Can Disentangle Viariable Dependencies

Let's continue from the last section. One improvement one may imagine is \(u = h(v, o, z), ~ z \sim \mathbb{D}\). When \(h\) is trained successfully, it will neither suffer from ambiguity or curse of dimensionality. Indeed, many models are built this way. Two examples are Dall-E and Sora from OpenAI, which \(u\) is the output image or video, \(v\) is the text prompt, and \(o\) is absent. \(z\) in this case is the concatenation of diffusion noise or sampling random variables. This is also an effective model for industrial control.

We observe that there is a strong assumption on \(h\), that is, \(u\) is dependent on \(v\) and \(o\). In the real universe, such dependency can usually be obvious and strict, but in many cases it is not. Taking Dall-E or Sora as an example, the fact that only long prompt texts work is a strong indication that such variable dependency becomes stricter as the prompt text becomes longer.

On the other hand, the formulation of \(g\) does not suffer from such uncertainty in variable dependency. It can learn to model the dependencies between \(u\), \(v\), and \(o\) however strict they are, and disentangle such variable dependencies at the presence of \(z\).

The Key Question

The deterministic ambguity and the uncertain variable dependency are the only two examples of what \(g\) can disentangle. We believe that \(g\) can disentangle much more kinds of mathematical constructs, such probabilistic conditions, time and space relations, etc. In fact, we believe that \(g\) can even disentangle constructs that are not yet discovered by current generation math and physics. The key question is then

Who can make \(g\) work? 😜

Please do not hesitate to contact me if you have a problem where there is an abundance of data.

General Generatives