Let γ∈(0,1),
Q:S×A→R,
Qn:S×A×E→R, and
π:S×E→A.
We write one-step sampled transitions as (s,a,r,s′).
Let μ(a∣s) denote the behavior/data action distribution.
For a random variable Z, define the expectile:
Expectileτ(Z):=argvminE[∣τ−1(Z−v<0)∣(Z−v)2].
As τ→1,
Expectileτ(Z)→esssupZ.
Applied to actions at state s′ (with a′∼μ(⋅∣s′)):
Expectileτ(Q(s′,a′))→a′∼μ(⋅∣s′)esssupQ(s′,a′).
IQL value target:
Vτ(s′):=Expectileτ(Q(s′,a′)),a′∼μ(⋅∣s′).
IQL Bellman update:
(TIQLτQ)(s,a):=r(s,a)+γVτ(s′).
In the limit τ→1:
(TIQLτQ)(s,a)≈r(s,a)+γa′∼μ(⋅∣s′)esssupQ(s′,a′).
NQL in FAN uses a noise-conditioned essential-supremum target:
(TnπQn)(s,a,ϵ′):=r(s,a)+γϵesssupQn(s′,π(s′,ϵ′),ϵ).
Define fixed points:
QIQLτ:=TIQLτQIQLτ,Qnπ:=TnπQnπ.
At the target level:
IQL (τ→1): a′∼μ(⋅∣s′)esssupQ(s′,a′),NQL: ϵesssupQn(s′,π(s′,ϵ′),ϵ).
Therefore, when π(s′,ϵ′) is expressive enough to realize near-maximizing actions and the induced targets are aligned,
ϵ′esssupQnπ(s,a,ϵ′)≈QIQLτ→1(s,a).