skip to content
LSY Blog logo LSY Blog

[FAN] Efficiency vs. Expressivity in Offline Reinforcement Learning

/ 8 min read

Table of Contents

Online vs. Offline RL

Reinforcement learning (RL) is about learning through trial and error. An agent takes an action, observes the result, and receives a reward signal. Over time, a policy is learned to maximize reward. This setting is known as online RL\textcolor{green}{\texttt{online RL}}, where the agent continuously interacts with the environment while learning. Online RL is flexible, but can be expensive, slow, or unsafe in real-world, e.g., deep-sea exploration or space environments.

In offline RL\textcolor{green}{\texttt{offline RL}}, the agent does not interact with the environment during training. Instead, it learns from a fixed pre-collected dataset. This setup is generally safer and more cost-effective, especially when real-world interaction is risky. However, it is also more challenging; the agent cannot gather new data to correct its errors and must generalize solely based on the dataset’s support.

Therefore, the key challenge in offline RL is optimizing the learning policy while constraining it to the available dataset. Actions not in the dataset may appear good during offline training, but can actually perform poorly online. To tackle this challenge,

We need effective "policy", "constraint", "value" estimators.

Expressive Offline RL

How does prior work make these three components effective? Given the use of neural networks, these function approximators should be “expressive” enough, just as in other fields of machine learning (ML).

(1) Expressive Policy

Let’s start with a simple yet expressive formulation of the policy. The agent’s deterministic action aθa_\theta can be sampled using the deterministic function fθ(s,ϵ)f_\theta(s,\epsilon), where ss is the state and ϵ\epsilon is any vector modeling the stochasticity of the policy. One standard approach is to sample ϵ\epsilon from a normal distribution, i.e., ϵN(0,Id)\epsilon\sim\mathcal{N}(0,I_d), where dd is the dimension of the action space. Samples from aθ=fθ(s,ϵ)a_\theta=f_\theta(s,\epsilon) forms the learning policy distribution πθ\pi_\theta, i.e., aθπθ(s)a_\theta\sim\pi_\theta(\cdot|s). Here, the push-forward function fθf_\theta maps the distribution of ϵ\epsilon to the state-conditional distribution of actions, i.e., πθ(s)\pi_\theta(\cdot|s). Even with deterministic outputs, our policy πθ\pi_\theta becomes stochastic thanks to the noise input ϵ\epsilon.

Policy

This has been widely used since DPG and DDPG. Here, πθ\pi_\theta becomes more expressive than the standard Gaussian policies, since it can model all possible action distributions. Gaussian policies, i.e., πθ(s)=N(μθ,σθ2)\pi_\theta(\cdot|s)=\mathcal{N}(\mu_\theta,\sigma_\theta^2), are limited to unimodal distributions.

(2) Expressive Constraint

Now, we need πθ\pi_\theta not to deviate too much from the behavior of the dataset. Among various approaches, diffusion and flow models have proven highly effective for setting up such constraints. With these methods, we train a behavior policy πβ\pi_\beta, which represents the policy that originally generated the offline dataset. This πβ\pi_\beta is then used as the policy constraint (e.g., DKL(πθπβ)D_{KL}(\pi_\theta||\pi_\beta)), forcing πθ\pi_\theta to output actions similar to the dataset. Learning πβ\pi_\beta is equivalent to a standard generative modeling problem, which explains why diffusion and flow matching are so successful here. Traditional Gaussian policies often assume a single mode and struggle to model πβ\pi_\beta, whereas flow-based policies can easily capture complex, multimodal behaviors.

Policy

FQL is one good example. Their constraint is W2(πθ,πβ)W_2(\pi_\theta,\pi_\beta), the Wasserstein-2 distance between the learning policy πθ\pi_\theta and the behavior policy πβ\pi_\beta trained with flow matching.

(3) Expressive Value

Next, is there a better way to estimate future returns other than with expectations? Yes, indeed. Distributional RL is one line of work that models more expressive values. Instead of simply estimating the expected sum of rewards, distributional critics estimate the full distribution of returns. Consider two scenarios: a guaranteed return of 0, versus a 50% chance of 100 with a 50% chance of -100. A standard critic evaluates these as equivalent (1\cdot0 = 0.5\cdot100 + 0.5\cdot(-100)), but a distributional critic captures the information of a higher potential maximum return in the second scenario.

A common way to represent these return distributions is by using finite discrete bins or quantiles that correspond to different levels of the cumulative distribution function (CDF). But, yeah… quite hard to understand… Let’s walk through a simple example to get some sense. Consider the following environment:

Simple MDP

Here, starting from the initial state S0S_0, the number on each edge represents the reward that the agent gets for each action (left or right). Let’s first estimate the expected return, given a fixed policy going left and right with the same probabilities:

Expected Return

In this case, the high sparse reward of +30 is not captured effectively. However, for the distributional critics:

Distributional Return

The highest possible return (+24) is captured through high δ\delta values using distributional quantile critics. Here, δ\delta determines the coefficient of the return CDF.

Inefficiencies

While sampling actions from πθ\pi_\theta is highly effective and efficient, relying on the behavior flow policy (πβ\pi_\beta) for constraints and using distributional critics is compute heavy.

(Policy) one-step πθ sampling.\textcolor{green}{\texttt{(Policy) one-step }\pi_\theta\texttt{ sampling.}} Great efficiency.

(Constraint) πβ with Flow Matching.\textcolor{green}{\texttt{(Constraint) }\pi_\beta\texttt{ with Flow Matching.}} Behavior flow policies typically require multiple forward iterations to generate a single dataset action. Consequently, the cost of setting up the constraint scales proportionally with the number of flow steps.

(Value) Distributional Critics.\textcolor{green}{\texttt{(Value) Distributional Critics.}} These critics process multiple return samples (such as multiple bins or quantiles), which causes the critic’s computational cost to scale linearly with the number of these samples.

Key Observations

To summarize, higher performance in expressive offline RL comes with more computation:

Efficiency vs. Expressivity trade-off exists in offline RL.

However, can these expressive mechanisms be made more efficient? Since the expressive policy πθ\pi_\theta is already modeled very efficiently, let’s focus on how to improve the efficiencies of (1) the flow-based constraint and (2) the distributional value.

(Constraint) Is Flow Iteration Necessary?

Flow-based policy constraints require solving the ordinary differential equations (ODEs).

Observation for the behavior flow policy

For instance, FQL directly compares action outcomes aβa_\beta and aθa_\theta, which are sampled from the policy distributions πβ\pi_\beta and πθ\pi_\theta, respectively. Therefore, during training, FQL must iterate over vβv_\beta multiple times to obtain aβa_\beta, leading to more computation. But,

What if we compare directions within the flow, rather than the action outputs directly?

Our intuition is that accurate sampling of aβa_\beta is actually not a strict requirement for successful πθ\pi_\theta learning. aβa_\beta is merely used for “regularizing” aθa_\theta to the dataset.

(Value) Should we model CDFs?

There is actually an alternative way to model return distributions other than with CDFs.

Observation for the distributional critic

Consider the problem of generative modeling. The goal is to find a function that maps the prior distribution to the target distribution. This function can be viewed as the push-forward function ff, forwarding the prior distribution (e.g., N(0,Id)\mathcal{N}(0,I_d)) to our target distribution (e.g., pdatap_\text{data}). The distribution is modeled through this push-forward function ff, which is distinct from the CDF-based modeling. Then, similarly,

What if we model return distributions with push-forwards?

Our intuition is that if we use push-forwards, all distributional critic samples have a similar meaning (i.e., each is a possible return outcome sampled randomly). Consequently, using a single critic sample may be sufficient. In contrast, different samples from the distributional quantile critic have different meanings, since each sample models a different part of the return CDF.

Actually, Value Flows is one work that does exactly this. However, they solve ODEs for sampling the critic values, making it hard to say that efficiency has improved.

Proposed Method

Based on these insights, we propose:

Flow Anchored Noise-conditioned Q-Learning (FAN)
FAN

We use “Flow Anchoring” for expressive policy constraints, and apply “Noise-conditioned Q-Learning” for expressive values.

Constraint using Single Flow Iteration

(1) Flow Anchoring.\textcolor{green}{\texttt{(1) Flow Anchoring.}} For the behavior regularization part, we modify the prior objectives as follows:

(FQL) : EsDϵN(0,Id)[aθaβ22](FAN) : EsDϵN(0,Id)tUnif([0,1])[(aθϵ)vβ(s,t,aθt)22]\texttt{(FQL) : }\mathbb{E}_{\substack{s\sim\mathcal{D} \\ \epsilon\sim\mathcal{N}(0,I_d)}} \left[ \lVert a_\theta - a_\beta \rVert_2^2 \right] \\ \texttt{(FAN) : }\mathbb{E}_{\substack{s\sim\mathcal{D} \\ \epsilon\sim\mathcal{N}(0,I_d) \\ t\sim\text{Unif}([0,1])}} \left[ \lVert (a_\theta - \epsilon) - v_\beta(s, t, a^t_{\theta}) \rVert_2^2 \right]

Here is the variable breakdown:

  • sDs \sim \mathcal{D} : Offline dataset state
  • ϵN(0,Id)\epsilon \sim \mathcal{N}(0,I_d) : Prior noise vector
  • tUnif([0,1])t \sim \text{Unif}([0,1]) : Random time step
  • aθ:=fθ(s,ϵ)a_\theta := f_\theta(s, \epsilon) : Learning policy action
  • aβ:=ϵ+01vβ(s,τ,aτ)dτa_\beta := \epsilon + \int_0^1\,v_\beta(s, \tau, a_\tau)\,d\tau : Dataset action (Iterative ODE)
  • aθt:=(1t)ϵ+taθa^t_{\theta} := (1 - t)\,\epsilon + t\,a_\theta : Interpolated point at time tt
  • aθϵa_\theta - \epsilon : Policy direction
  • vβ(s,t,aθt)v_\beta(s, t, a^t_{\theta}) : Behavior flow velocity field (Single-step)

Our work only requires a single flow step, and is guaranteed to minimize the Wasserstein-2 distance between πθ\pi_\theta and πβ\pi_\beta!

Distributional Critic with Push-Forwards

(2) Noise-conditioned Q-Learning.\textcolor{green}{\texttt{(2) Noise-conditioned Q-Learning.}} For the distributional value estimates, we propose the following novel Bellman operator:

(TnπQ)(s,a,ϵ):=dr+γesssupϵN(0,Id)Q(s,π(s,ϵ),ϵ)(\mathcal{T}^{\pi}_{n} Q)(s, a, \epsilon') :\overset{d}{=} r + \gamma \operatorname{ess \sup}_{\epsilon \sim \mathcal{N}(0, I_d)} Q(s', \pi(s', \epsilon'), \epsilon)

Don’t get afraid with the esssup\operatorname{ess \sup}! It stands for “essential supremum”, which is just taking the maximum value, ignoring rare, zero-probability edge cases. With the simple environment example above, the critic values are calculated as follows:

Observation for the Noise-conditioned Q-Learning

Understand this as the distributional extension of Q-Learning!! i.e., (TπQ)(s,a)=r+γmaxaQ(s,a)(\mathcal{T}^\pi Q)(s, a) = r + \gamma \max_{a'} Q(s', a'). In Q-Learning, the maximum operation is taken over the “next” action, but we are doing it for ϵ\epsilon, which is highly related to the “next-next” action. In our case, the next action is tied to our value function input ϵ\epsilon'.

Result

We tested FAN on numerous offline RL benchmarks such as D4RL and OGBench. Please check out our paper, FAN, for more details.

FLOPs and Success Rates

To summarize, FAN achieves the best success rates on average with the highest computational efficiency. Specifically, its computational efficiency is similar to that of highly efficient non-distributional offline RL algorithms (e.g., ReBRAC, FQL), while outperforming relatively inefficient distributional approaches (e.g., CODAC, Value Flows).

Takeaways

Just remember these about FAN:

  • High Task Performance\textcolor{green}{\texttt{High Task Performance}} through:

    1. Flow policy constraint
    2. Distributional critic
  • High Efficiency\textcolor{green}{\texttt{High Efficiency}} through:

    1. Flow Anchoring for the offline dataset constraint.
    2. Noise-conditioned Q-Learning for return distributions.

I hope you enjoyed this blog post!! 🤩

Acknowledgments

Huge thanks to Professor Pingali and Eshan for organizing the content together for easier understanding. 🙏