skip to content
LSY Blog logo LSY Blog

IQL vs. NQL

Setup

Let γ(0,1)\gamma\in(0,1), Q:S×ARQ:\mathcal S\times\mathcal A\to\mathbb R, Qn:S×A×ERQ_n:\mathcal S\times\mathcal A\times\mathcal E\to\mathbb R, and π:S×EA\pi:\mathcal S\times\mathcal E\to\mathcal A. We write one-step sampled transitions as (s,a,r,s)(s,a,r,s'). Let μ(as)\mu(a\mid s) denote the behavior/data action distribution.

Expectile τ1ess sup\tau\to1 \Rightarrow \operatorname*{ess\,sup}

For a random variable ZZ, define the expectile:

Expectileτ(Z):=argminv  E ⁣[τ1(Zv<0)(Zv)2].\operatorname{Expectile}_\tau(Z) :=\arg\min_v\;\mathbb E\!\left[\left|\tau-\mathbf 1(Z-v<0)\right|(Z-v)^2\right].

As τ1\tau\to1,

Expectileτ(Z)ess supZ.\operatorname{Expectile}_\tau(Z)\to\operatorname*{ess\,sup} Z.

Applied to actions at state ss' (with aμ(s)a'\sim\mu(\cdot\mid s')):

Expectileτ ⁣(Q(s,a))ess supaμ(s)Q(s,a).\operatorname{Expectile}_\tau\!\big(Q(s',a')\big) \to \operatorname*{ess\,sup}_{a'\sim\mu(\cdot\mid s')} Q(s',a').

IQL

IQL value target:

Vτ(s):=Expectileτ ⁣(Q(s,a)),aμ(s).V_\tau(s') :=\operatorname{Expectile}_\tau\!\big(Q(s',a')\big),\quad a'\sim\mu(\cdot\mid s').

IQL Bellman update:

(TIQLτQ)(s,a):=r(s,a)+γVτ(s).(\mathcal T_{\mathrm{IQL}}^\tau Q)(s,a) := r(s,a)+\gamma V_\tau(s').

In the limit τ1\tau\to1:

(TIQLτQ)(s,a)r(s,a)+γess supaμ(s)Q(s,a).(\mathcal T_{\mathrm{IQL}}^\tau Q)(s,a) \approx r(s,a)+\gamma\,\operatorname*{ess\,sup}_{a'\sim\mu(\cdot\mid s')} Q(s',a').

NQL

NQL in FAN uses a noise-conditioned essential-supremum target:

(TnπQn)(s,a,ϵ):=r(s,a)+γess supϵQn(s,π(s,ϵ),ϵ).(\mathcal T_n^\pi Q_n)(s,a,\epsilon') := r(s,a) + \gamma\,\operatorname*{ess\,sup}_{\epsilon} Q_n(s',\pi(s',\epsilon'),\epsilon).

Connection Between IQL and NQL

Define fixed points:

QIQLτ:=TIQLτQIQLτ,Qnπ:=TnπQnπ.Q_{\mathrm{IQL}}^\tau := \mathcal T_{\mathrm{IQL}}^\tau Q_{\mathrm{IQL}}^\tau, \qquad Q_n^\pi := \mathcal T_n^\pi Q_n^\pi.

At the target level:

IQL (τ1): ess supaμ(s)Q(s,a),NQL: ess supϵQn(s,π(s,ϵ),ϵ).\text{IQL }(\tau\to1):\ \operatorname*{ess\,sup}_{a'\sim\mu(\cdot\mid s')}Q(s',a'), \qquad \text{NQL}:\ \operatorname*{ess\,sup}_{\epsilon}Q_n(s',\pi(s',\epsilon'),\epsilon).

Therefore, when π(s,ϵ)\pi(s',\epsilon') is expressive enough to realize near-maximizing actions and the induced targets are aligned,

ess supϵQnπ(s,a,ϵ)QIQLτ1(s,a).\operatorname*{ess\,sup}_{\epsilon'}Q_n^\pi(s,a,\epsilon')\approx Q_{\mathrm{IQL}}^{\tau\to1}(s,a).