<style>img { display: block; margin-left: auto; margin-right: auto;}</style>> [Paper link](https://arxiv.org/abs/2202.01771) | [Note link](https://blog.csdn.net/m0_48948682/article/details/131722172) | [Code link](https://github.com/ShuangLI59/Pre-Trained-Language-Models-for-Interactive-Decision-Making) | NeurIPS 2022:::success**Thoughts**This study aims to use a language model as an agent to predict the next action in a reinforcement learning task.Additionally, they have designed a data collection approach to gather data during exploration.:::## AbstractPre-trained language models (PLMs) are highly versatile in handling various language processing tasks.This study aims to leverage PLMs for general sequential decision-making problems.In this approach, goals and observations are represented as a sequence of embeddings, and a policy network, initialized with a pre-trained language model, predicts the next action.## BackgroundLanguage models (LMs) are capable of handling various tasks such as:1. Instruction following2. Vision-language navigation3. Visual question answeringThis study poses an intriguing question:Can LMs be utilized as a general framework, even for tasks that do not involve language at all?## MethodThis study proposes LID, a framework that uses Pre-Trained **L**anguage Models for **I**nteractive **D**ecision-Making.Additionally, the study introduces an active data gathering (ADG) procedure into pre-trained LMs to address the challenge where expert data is not readily available. In such cases, agents must actively gather their own data instead.Before diving into the method, it's important to understand the concepts of Decision-Making and Language Modeling.### POMDPs and Policy LearningPartially Observable Markov Decision Processes, commonly known as POMDPs, are defined by the following components:1. **A set of states**: These represent the different possible conditions or situations in which the agent can find itself.2. **A set of observations**: These are the signals or data that the agent receives, which provide partial information about the current state.3. **A set of actions**: These are the decisions or moves that the agent can make in response to its observations.4. **A transition model**: Denoted as $\mathcal{T}(s_{t+1} \mid s_t, a_t)$, this is a probability distribution that defines the likelihood of transitioning from state $s_t$ to state $s_{t+1}$ given that action $a_t$ was taken.In their policy, they use parametric models denoted as $\pi_\phi (a_t \mid g, h_t, o_t)$, which output the probability of selecting an action $a_t$ given the goals $g$. Here:- $o_t$ represents the current observation.- $h_t$ refers to the history information, which includes all past observations and actions up to time $t$, specifically $\{ o_1, a_1, \dots, o_{t-1}, a_{t-1} \}$.![image](https://hackmd.io/_uploads/rkSdkdNq0.png)The proposed method can be summarized as follows:1. **Sequence Conversion**: All policy inputs, including goals, observations, and history, are converted into a sequence and used as input for a transformer encoder.2. **Action Prediction**: The representations generated by the encoder are then fed into a task-specific decoder, which predicts the actions.The data for this procedure consists of $N$ training trajectories, represented as $\mathcal{D} = \{ d^i \}_{i=1}^N$, where each trajectory $d^i$ is defined as $\{ g^i, o_1^i, a_1^i, \dots, o_T^i, a_T^i \}$, with $T_i$ being the length of the trajectory.The loss function used for training the model is defined as:$$ \phi^\ast = \underset{\phi}{\arg \min} \left( - \sum_{i=1}^N \sum_{t=1}^{T_i} \ln \pi_\phi \left( a_t^i \mid g^i, h_t^i, o_t^i \right) \right)$$This loss function aims to optimize the model parameters $\phi$ by minimizing the negative log-likelihood of the actions given the goals, history, and observations across all training trajectories.### Language models as policy initializersThe language models used in this study are autoregressive and transformer-based. These models are trained to fit a probability distribution over a text sequence $\boldsymbol{y} = \{ y_i \}_{i=1}^n$ using the chain rule:$$p(\boldsymbol{y}) = p(y_1) \prod_{i=1}^n p(y_i \mid y_1, \dots, y_{i-1})$$In this framework, each word or token in the sequence is predicted based on all the preceding tokens. The model used in this study is GPT-2, a widely known autoregressive transformer-based language model.![image](https://hackmd.io/_uploads/rJ6O1dE9R.png)For data collection, the approach involves iteratively repeating the processes of exploration, hindsight relabeling, and policy update.It combined with active data gathering, enables the learning of an effective policy without relying on pre-collected expert data.## ExperimentThe table below shows the success rates on BabyAI tasks.![image](https://hackmd.io/_uploads/BkBTku45C.png)The table below shows the success rates on VirtualHome tasks.![image](https://hackmd.io/_uploads/H1TAk_Nc0.png)
Study Group for any kind of Machine Learning Topic. Now focus on Natural Language Generation.
0 10
Read more
Read more from Machine_S
Published on HackMD
Sign in
or
By clicking below, you agree to our terms of service.
Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox
Sign in with Wallet Wallet ( )Connect another wallet
New to HackMD? Sign up