【深度强化学习 DRL 快速实践】Deep Q-learning

CODE_RabbitV

已于 2025-04-26 16:53:32 修改

阅读量1k

点赞数 23

分类专栏： DRL 文章标签：算法

于 2025-04-26 14:08:55 首次发布

本文链接：https://blog.csdn.net/code_rabbitv/article/details/147531266

版权

DRL 专栏收录该内容

8 篇文章

订阅专栏

在这里插入图片描述

文章目录

Deep Q-learning (2015，DeepMind）核心改进点

Q-learning 是一种经典的强化学习算法，通过学习一个动作-价值函数（Q函数），从而在未知环境中找到最优的策略

model-free, off-policy, Value-based

核心改进点	说明
Q 函数近似	通过深度神经网络来近似 Q(s, a) 函数，解决传统 Q-learning 在大规模状态空间中的存储问题
经验回放（Experience Replay）	缓存智能体的历史经验，打破时间相关性，使得训练更稳定
目标网络（Target Network）	通过使用一个延迟更新的 Q 网络来减少 Q 值的震荡，增强算法稳定性

Deep Q-learning 网络更新

$Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]$

【重要特点】Q-learning 收敛性：依据是贝尔曼最优方程（Bellman optimality equation），只要每对状态-动作对被无限次访问，并且学习率衰减得当，Q-learning 可以保证收敛到最优策略

Deep Q-learning 问题分析和解决方案 Tips

Q 值的估计通常是过高的： Q-learning tends to select the action that is over-estimated

解决方法: Double DQN – 核心思想是选择 action 和评估这个 action 分开，用不同的 Q 网络处理，也就是说：选择用 $Q$ ，评估用 $Q^{'}$ (tips: $Q^{'}$ 可以直接用 Q-learning 里面的 Q target network，这样改动最少)
$\text{\textcolor{red}{原 DQN}: \ } r_t + \gamma \max_a Q(s_{t+1}, a) \to \text{\textcolor{red}{改进后的 Double DQN}: \ } r_t + \gamma Q'(s_{t+1}, \argmax_a Q(s_{t+1}, a))$

数据利用率低，没有被选择过的 $a$ 对应的 Q 值很难更新

解决方法：Dueling DQN – 核心思想是优化采样到的数据利用，采样数据 (s, a) 不仅用来更新 a 对应的 Q 值，还用来更新 s 下的其他 a‘ 对应的 Q 值，具体来说是修改了 network 架构，把原来的 Q 预测分成 V 和 Advantage 分开预测再相加

【细节】限制 s 下的所有 a 的 Advantage 的和为 0 来促进网络学习 V (不然可能直接把 V 全学为 0, 导致退化为了原来的 DQN)

实际上各类优化方法很多，参考 rainbow

快速代码示例

import numpy as np
import random

class QLearningAgent:
    def __init__(self, n_actions, n_states, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.n_actions = n_actions
        self.n_states = n_states
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.Q = np.zeros((n_states, n_actions))  # 初始化 Q 表

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.randint(0, self.n_actions - 1)  # 探索
        else:
            return np.argmax(self.Q[state])  # 利用

    def update(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.Q[next_state])
        self.Q[state, action] = self.Q[state, action] + self.alpha * (
            reward + self.gamma * self.Q[next_state, best_next_action] - self.Q[state, action]
        )

# 示例：应用 Q-learning 解决简单问题
def q_learning_example():
    n_states = 5  # 状态数量
    n_actions = 2  # 动作数量
    
    agent = QLearningAgent(n_actions, n_states)
    n_episodes = 1000  # 训练轮数

    for episode in range(n_episodes):
        state = 0  # 每轮从状态0开始
        done = False
        while not done:
            action = agent.choose_action(state)

            # 假设某些简单的奖励和转移逻辑
            if action == 0:  # 向左
                next_state = max(state - 1, 0)
                reward = -1
            else:  # 向右
                next_state = min(state + 1, n_states - 1)
                reward = 1

            if next_state == n_states - 1:
                done = True

            agent.update(state, action, reward, next_state)
            state = next_state
            
    return agent.Q
    
# 训练并打印结果
Q_table = q_learning_example()
print("训练后的 Q 表：")
print(Q_table)