Notes

Notes - notes.io

CraftAssist Instruction Parsing: Semantic Parsing For A Minecraft Assistant
We propose a large scale semantic parsing dataset focused on instruction-driven communication with an agent in Minecraft. We describe the data collection process which yields additional 35K human generated instructions with their semantic annotations. We report the performance of three baseline models and find that while a dataset of this size helps us train a usable instruction parser, it still poses interesting generalization challenges which we hope will help develop better and more robust models.

Semantic parsing is used as a component for natural language understanding in human-robot interaction systems Tellex et al. (2011); Matuszek et al. (2013), and for virtual assistants Kollar et al. (2018). Recently, researchers have shown success with deep learning methods for semantic parsing, e.g. Dong and Lapata (2016); Jia and Liang (2016); Zhong et al. (2017). However, to fully utilize powerful neural network approaches, it is necessary to have large numbers of training examples. In the space of human-robot (or human-assistant) interaction, the publicly available semantic parsing datasets are small. Furthermore, it can be difficult to reproduce the end-to-end results (from utterance to action) because of the wide variety of robot setups and proprietary nature of personal assistants.

In this work, we introduce a new semantic parsing dataset for human-bot interactions. Our “robot” or “assistant” is embodied in the sandbox construction game Minecraft111https://minecraft.net/en-us/. We limit ourselves to creative mode for this work, a popular multiplayer open-world voxel-based crafting game. We also provide the associated platform for executing the logical forms in game.

Situating the assistant in Minecraft has several benefits for studying task oriented natural language understanding (NLU). Compared to physical robots, Minecraft allows less technical overhead irrelevant to NLU, such as difficulties with hardware and large scale data collection. On the other hand, our bot has all the basic in-game capabilities of a player, including movement and placing or removing voxels. Thus Minecraft preserves many of the NLU elements of physical robots, such as discussions of navigation and spatial object reference.

Furthermore, working in Minecraft may enable large scale human interaction because of its large player base, in the tens of millions. Although Minecraft’s simulation of physics is simplified, the task space is complex. There are many atomic objects in Minecraft, such as animals and block-types, that require no perceptual modeling. For researchers interested in the interactions between perception and language, collections of voxels making up a “house” or a “hill” are not atomic objects and the assistant cannot apprehend them without a perceptual system.

Our contributions in the paper are as follows: Grammar: We develop a set of action primitives and grammar over these primitives that comprise a mid-level interface to Minecraft, for machine learning agents. See Section 3. Data: Using a collection of language templates to convert logical forms over the primitives into pseudo-natural language, we build a dataset of language instructions with logical form annotation by having crowd-sourced workers rephrase the language outputs, as in Wang et al. (2015). We also collect a test set of crowd-sourced annotations of commands generated independent of our grammar. In addition to the natural language commands and the associated logical forms, we also make available the code to execute these in the game, allowing the reproduction of end-to-end results. See Section 4. Models: We show the results of several neural semantic parsing models trained on our data. See Section 5 and 6. We also will provide access to an interactive bot using these models for parsing222Instructions will be available at http://craftassist.io/acl2019demo.

There have been a number of datasets of natural language paired with logical forms to evaluate semantic parsing approaches, e.g. Price (1990); Tang and Mooney (2001); Cai and Yates (2013); Wang et al. (2015); Zhong et al. (2017). The dataset presented in this work is an order of magnitude larger than those in Price (1990); Tang and Mooney (2001); Cai and Yates (2013) and is similar in scale to Wang et al. (2015); Zhong et al. (2017). We use the data collection strategy in Wang et al. (2015) to build the pairings between logical forms and natural language: first building the grammar, then generating from the grammar via templates, and then using crowd-sourced workers to rephrase the templated generations. However, we also collect a test set of “free” commands and use crowd-sourced workers to annotate these.

In addition to connecting natural language to logical forms, our dataset connects both of these to a dynamic environment. In Tellex et al. (2011); Matuszek et al. (2013) semantic parsing has been used for interpreting natural language commands for robots. In our paper, the “robot” is embodied in the Minecraft game instead of in the physical world.

Semantic parsing in a voxel-world recalls Wang et al. (2017), where the authors describe a method for building up a programming language from a small core via interactions with players.

We demonstrate the results of several neural parsing models on our dataset. In particular, we show the results of a reimplementation of Dong and Lapata (2016) adapted to our grammar. There have been several other papers proposing neural architectures for semantic parsing, for example Jia and Liang (2016); Zhong et al. (2017). In those papers, as in this one, the models are trained with full supervision of the mapping from natural language to logical forms, without considering the results of executing the logical form (in this case, the effect on the environment of executing the actions denoted by the logical form). There has been MINECRAFT towards “weakly supervised” semantic parsing Artzi and Zettlemoyer (2013); Liang et al. (2016); Guu et al. (2017) where the logical forms are hidden variables, and the only supervision given is the result of executing the logical form. There are now approaches that have shown promise without even passing through (discrete) logical forms at all Riedel et al. (2016); Neelakantan et al. (2016). We hope that the dataset introduced here, which has supervision at the level of the logical forms, but whose underlying grammar and environment can be used to generate essentially infinite weakly supervised or execution rewards, will also be useful for studying these models.

Minecraft, especially via the MALMO project Johnson et al. (2016) has been used as a base environment for several machine learning papers. Often Minecraft is used as a testbed for reinforcment learning Shu et al. (2017); Udagawa et al. (2016); Alaniz (2018); Oh et al. (2016); Tessler et al. (2017). In these papers, the agent is trained to complete tasks by issuing low level actions (as opposed to our higher level primitives) and receiving a reward on success. Some of these papers(e.g. Oh et al. (2017)) do consider simplified, templated language as a method for composably specifying tasks, but training an RL agent to execute the scripted primitives in our grammar is already nontrivial, and so the task space and language is more constrained than what we use here. Nevertheless, our work may be useful to researchers interested in RL- using our grammar and executing in game can supply (hard) tasks and descriptions. Another set of papers Kitaev and Klein (2017); Yi et al. (2018) have used Minecraft for visual question answering with logical forms. Our work extends these to interactions with the environment. Finally, Allison et al. (2018) is a more focused study on how a human might interact with a Minecraft agent; our collection of free generations (see 4.2.2) includes annotated examples from similar studies of players interacting with a player pretending to be a bot.

3 A Natural Language Interface

We want to interpret natural language commands given to an agent with a pre-defined set of capabilities. We start by providing an overview of these capabilities and the action space that they entail, then define a grammar to capture this action space.

3.1 Agent Action Space

The goal of the proposed agent is to help a player create structures and mechanisms in a voxelized world by moving around, and placing and removing blocks. To this end, the agent needs to be able to understand a number of high-level commands, which we present here.

Basic action commands

First, we need commands corresponding to high level actions of the agent. For example, we may ask it to build an object from a known schematic or to copy an existing structure at a given location, or to destroy one. Similarly, it might be useful to be able to ask the agent to dig a hole of a given shape at a specified location, or on the contrary to fill one up. The agent can also be asked to complete an already structure however it sees fit (this action is called freebuild), or to spawn game mobs. Finally, we need to be able to direct the agent to move to a location.

Teaching and querying the bot

In order to understand most of the above commands, the agent needs to have an internal representation of the world. We want to be able to add to this representation by allowing the user to tag existing objects with names or properties. This can be considered a basic version of the self-improvement capabilities in Kollar et al. (2013); Thomason et al. (2015); Wang et al. (2016, 2017). Conversely, to query this internal state, we can ask the agent to answer questions about the world. This part of the grammar is similar to the visual question-answering in Yi et al. (2018)

Control commands

Additionally, we want to be able to ask the agent to stop or resume an action, or to undo the result of a recent command. Finally, the agent needs to be able to understand when a sentence does not correspond to any of the above mentioned actions, and map it to a noop command.

3.2 Parsing Grammar

All of the above commands are represented as trees encoding all of the information necessary for their execution. Figure 1 presents an example parse tree for the build command “Make three oak wood houses to the left of the dark grey church.”

Internal nodes

Each action has a set of possible arguments, which themselves have a recursive argument structure. Each of these action types and complex arguments corresponds to an internal node (blue rounded rectangles in Figure 1), with its children providing more specific information. For example, the build action can specify a schematic (what we want to build) and a location child (where we want to build it). In turn, the schematic can specify a general category (house, bridge, temple, etc…), as well as a set of properties (size, color, building material, etc…), and in our case also has a repeat child subtree specifying how many we want to build. Similarly, the location can specify an absolute location, a distance, direction, and information about the location reference object stored in a child subtree.

One notable feature of this representation is that we do not know a priori which of a node’s possible children will be specified. For example, build can have a schematic and a location specified (“Build a house over there.”), just a schematic (“Build a house.”), just a location (“Build something next to the bridge.”), or neither (“Make something.”).

The full grammar is specified in Figure 3. In addition to the various location, reference object, schematic, and repeat nodes which can be found at various levels, another notable subtree is the action’s stop condition, which essentially allows the agent to understand “while” loops (for example: “dig down until you hit the bedrock” or “follow me”).

Leaf nodes

Eventually, arguments have to be specified in terms of values which correspond to agent primitives. We call these nodes categorical leaves (green rectangles in Figure 1). The root of the tree has a categorical leaf child which specifies the action type, build in our example. There are also nodes specifying the repeat type in the repeat sub-tree (”make three houses” corresponds to executing a for loop), the location type (the location is given in reference to the block object that is the “dark grey church”), and the relative direction to the reference, here left.

However, there are limits to what we can represent with a pre-specified set of hard-coded primitives, especially if we want our agent to be able to learn new concepts or new values. Additionally, even when there is a pre-specified agent primitive, mapping some parts of the command to a specific value might be better left to an external module (e.g. mapping a number string to an integer value). For both of these reasons, we also have span leaves (red ovals in Figure 1). This way, a model can learn to generalize to e.g. colors or size descriptions that it has never seen before. The schematic is specified by the command sub-string corresponding to its name (“houses”) and the requested block type (“oak wood”). The range of the for loop is specified by the repeat’s for value (“three”), and the reference object is denoted in the command by its generic name and specific color (“church” and “dark grey”).

4 The CAIP Dataset

This paper introduces the CraftAssist Instruction Parsing (CAIP) dataset of English-language commands and their associated ”action trees”, as defined in Section 3 (see Appendix A for examples and a full grammar specification). CAIP is a composite dataset containing a combination of algorithmically generated commands and human-written natural language commands.

4.1 Generated Data

We start by algorithmically generating action trees (logical forms over the grammar) with associated surface forms through the use of templates. To that end, we first define a set of template objects, which link an atomic concept in the game world to several ways it can be described through language. For example the template object Move links the action type move to the utterances go, walk, move,…Likewise, the template object RelativeDirection links all of the direction primitives to their names. Some template objects also have purely linguistic functions in order to make the sentence more natural but without referring to any information relevant to the tree. For example, the object ALittle can be realized into a bit, a little, somewhat, …

Then, we build recursive templates for each action as recursive sequences of templates and template objects. For each of these templates, we can then sample a game value and its corresponding string. By concatenating these, we obtain an action tree and its corresponding language description. Consider for example the template [Move, ALittle, RelativeDirection] made up of the template objects described above. One possible realization could be the description go a little to the left paired with an action tree specifying the action type as move, and an action location sub-tree with which a child relative direction categorical node which has value left. Finally, in addition to the action-specific templates, we also generate training data for the noop action type by sampling dialogue lines from the Cornell Movie Dataset Danescu-Niculescu-Mizil and Lee (2011).

We wrote 3,900 templates in total. We can create a training example for a parsing model by choosing one of them at random, and then sampling a (description, tree) pair from it, which, given the variety and modularity of the template objects, yields virtually unlimited data (for practical reasons, we pre-generate a set of 800K training, 5K validation, and 5K test examples for our experiments). The complete list of templates and template objects is included in the Supplementary Material.

4.2 Collected Data

To supplement the generated data, natural language commands written by crowd-sourced workers were collected in a variety of settings.

4.2.1 Rephrases

While the template generations yield a great variety of language, they cannot cover all possible ways of phrasing a specific instruction. In order to supplement them, we asked crowd-sourced workers to rephrase some of the produced instructions into commands in alternate, natural English that does not change the meaning of the sentence. This setup enables the collection of unique English commands whose action trees are already known. Note that a rephrased sentence will have the same action tree structure, but the positions of the words corresponding to span nodes may change. To account for this, words contained in a span range in the original sentence are highlighted in the task, and crowd-sourced workers are asked to highlight the corresponding words in their rephrased sentence. Then the action tree span values are substituted for the rephrased sentence to get the corresponding tree. This yields a total of 32K rephrases. We use 30K for training, 1K for validation, and 1K for testing.

4.2.2 Image and Text Prompts

We also presented crowd-sourced workers with a description of the capabilities of an assistant bot in a creative virtual environment (which matches the set of allowed actions in the grammar), and (optionally) some images of a bot in a game environment. They were then asked to provide examples of commands that they might issue to an in-game assistant. We refer to these instructions as “prompts” in the rest of this paper. The complete instructions shown to workers is included in appendix 19.

4.2.3 Interactive Gameplay

We asked crowd-sourced workers to play creative-mode Minecraft with an assistant bot, and they were instructed to use the in-game chat to direct the bot in whatever way they chose. The exact instructions are included in appendix B.2. Players in this setting had no prior knowledge of the bot’s capabilities or the parsing grammar.

4.2.4 Annotation Tool

Both prompts and interactive instructions come without a reference tree and need to be annotated. To facilitate this process, we designed a web-based tool which asks users a series of multiple-choice questions to determine the semantic content of a sentence. The responses to some questions will prompt other more specific questions, in a process that mirrors the hierarchical structure of the grammar. The responses are then processed to produce an action tree. This allows crowd-sourced workers to provide annotations with no knowledge of the specifics of the grammar described above. For each sentence annotated with the tool, three responses from distinct users were collected, and a sentence was included in the dataset only if at least two out of three responses matched exactly. This yields 1265 annotated prompts, and 817 annotated interactive instructions. A screenshot of the tool is included in Appendix 21.

4.3 Dataset Statistics

Action Frequencies

Since the different data collection settings described in Section 4.2 imposed different constraints and biases on the crowd-sourced workers, the distribution of actions in each subset of data is therefore very different. For example, in the Interactive Gameplay scenario, workers were given no prior indication of the bot’s capabilities, and spent much of their time asking the bot to build things. The action frequencies of each subset are shown in Figure 2.

Grammar coverage

Some crowd-sourced commands describe an action that is outside the scope of the grammar. To account for this, users of the action tree annotation tool are able to mark that a sentence is a command to perform an action that is not listed. The resulting action trees are labelled OtherAction, and their frequency in each dataset in shown in Figure 2. Note that annotators that choose OtherAction still have the option to label other nodes in the action tree like location and reference object.

5 Baseline Models

In order to assess the challenges of the dataset, we implement several baseline models which read a sentence and output an Action Tree, including an adaptation of the Seq2Tree model of (Dong and Lapata, 2016) to our grammar.

Sentence Encoder

All of our models rely on a sentence encoder. In this work, we use a bidirectional GRU encoder (Cho et al., 2014) which encodes a sentence of length T𝑇Titalic_T 𝐬=(w1,…wT)𝐬subscript𝑤1…subscript𝑤𝑇mathbfs=(w_1,ldots w_T)bold_s = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) into a sequence of T𝑇Titalic_T dimension d𝑑ditalic_d vectors:

fGRU(𝐬)=(𝐡1,…,𝐡T)∈ℝd×Tsubscript𝑓𝐺𝑅𝑈𝐬subscript𝐡1…subscript𝐡𝑇superscriptℝ𝑑𝑇f_GRU(textbfs)=(mathbfh_1,ldots,mathbfh_T)inmathbbR^d% times Titalic_f start_POSTSUBSCRIPT italic_G italic_R italic_U end_POSTSUBSCRIPT ( s ) = ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_T end_POSTSUPERSCRIPT

Multi-Headed Attention

Our models also use multi-head attention over the sentence representation. We use the implementation of Klein et al. (2017), with a residual connection. Given K𝐾Kitalic_K matrices 𝐌α=(M1α,…,M1α)∈ℝd×d×Ksuperscript𝐌𝛼superscriptsubscript𝑀1𝛼…superscriptsubscript𝑀1𝛼superscriptℝ𝑑𝑑𝐾textbfM^alpha=(M_1^alpha,ldots,M_1^alpha)inmathbbR^d% times dtimes KM start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , … , italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d × italic_K end_POSTSUPERSCRIPT, we define:

αnk=softmax(𝐱TMkα(𝐡1,…,𝐡T)d)subscriptsuperscript𝛼𝑘𝑛softmaxsuperscript𝐱Tsubscriptsuperscript𝑀𝛼𝑘subscript𝐡1…subscript𝐡𝑇𝑑displaystylealpha^k_n=textsoftmaxBig(fracmathbfx^textTM^% alpha_k(mathbfh_1,ldots,mathbfh_T)sqrtdBig)italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = softmax ( divide start_ARG bold_x start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )

𝐱α=∑k=1KαnkT(𝐡1,…,𝐡T)superscript𝐱𝛼superscriptsubscript𝑘1𝐾superscriptsubscriptsuperscript𝛼𝑘𝑛Tsubscript𝐡1…subscript𝐡𝑇displaystylemathbfx^alpha=sum_k=1^Kalpha^k_n^textT(% mathbfh_1,ldots,mathbfh_T)bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

attn(𝐱,(𝐡1,…,𝐡T);𝐌α)=𝐱+𝐱αattn𝐱subscript𝐡1…subscript𝐡𝑇superscript𝐌𝛼𝐱superscript𝐱𝛼displaystyletextattn(mathbfx,(mathbfh_1,ldots,mathbfh_T);% textbfM^alpha)=mathbfx+mathbfx^alphaattn ( bold_x , ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ; M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) = bold_x + bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT

5.1 Node Predictions

The output tree is made up of internal, categorical, and span nodes. We denote each of these sets by ℐℐmathcalIcaligraphic_I, 𝒞𝒞mathcalCcaligraphic_C and 𝒮𝒮mathcalScaligraphic_S respectively, and the full set of nodes as 𝒩=ℐ∪𝒞∪𝒮𝒩ℐ𝒞𝒮mathcalN=mathcalIcupmathcalCcupmathcalScaligraphic_N = caligraphic_I ∪ caligraphic_C ∪ caligraphic_S. Given a sentence, our aim is to predict the state of each of the nodes n∈𝒩𝑛𝒩ninmathcalNitalic_n ∈ caligraphic_N in the corresponding Action Tree.

Each node in an Action Tree is either active or inactive. We denote the state of a node n𝑛nitalic_n by an∈0,1subscript𝑎𝑛01a_nin,1italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ 0 , 1 . All the descendants of an inactive internal node n∈ℐ𝑛ℐninmathcalIitalic_n ∈ caligraphic_I are considered to be inactive. Additionally, each categorical node n∈𝒞𝑛𝒞ninmathcalCitalic_n ∈ caligraphic_C has a set of possible values Cnsuperscript𝐶𝑛C^nitalic_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Thus, in a specific Action Tree, each active categorical node has a category label cn∈Cnsubscript𝑐𝑛1…superscript𝐶𝑛c_nin1,ldots,italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ 1 , … , . Finally, active span nodes n∈𝒮𝑛𝒮ninmathcalSitalic_n ∈ caligraphic_S for a sentence of length T𝑇Titalic_T have a start and end index (sn,en)∈1,…,T2subscript𝑠𝑛subscript𝑒𝑛superscript1…𝑇2(s_n,e_n)in1,ldots,T^2( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ 1 , … , italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We take the following approach to predicting the state of a tree. First, we compute a node representation 𝐫nsuperscript𝐫𝑛textbfr^nr start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each node n∈𝒩𝑛𝒩ninmathcalNitalic_n ∈ caligraphic_N based on the input sentence s:

(𝐫1,…,𝐫|𝒩|)=fREP((h1,…,hT))subscript𝐫1…subscript𝐫𝒩subscript𝑓𝑅𝐸𝑃subscriptℎ1…subscriptℎ𝑇displaystyle(textbfr_1,ldots,textbfr_)=f_REP((h_1% ,ldots,h_T))( r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , r start_POSTSUBSCRIPT | caligraphic_N | end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_R italic_E italic_P end_POSTSUBSCRIPT ( ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )
Then, we compute the probabilities of each of the labels as:

∀n∈𝒩,for-all𝑛𝒩displaystyleforall ninmathcalN,∀ italic_n ∈ caligraphic_N , p(an)𝑝subscript𝑎𝑛displaystyle p(a_n)italic_p ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =σ(⟨𝐫n,𝐩n⟩)absent𝜎subscript𝐫𝑛subscript𝐩𝑛displaystyle=sigma(langlemathbfr_n,mathbfp_nrangle)= italic_σ ( ⟨ bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ ) (1)

∀n∈𝒞,for-all𝑛𝒞displaystyleforall ninmathcalC,∀ italic_n ∈ caligraphic_C , p(cn)𝑝subscript𝑐𝑛displaystyle p(c_n)italic_p ( italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =softmax(Mnc𝐫n)absentsoftmaxsubscriptsuperscript𝑀𝑐𝑛subscript𝐫𝑛displaystyle=textsoftmax(M^c_nmathbfr_n)= softmax ( italic_M start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (2)

∀n∈𝒮,for-all𝑛𝒮displaystyleforall ninmathcalS,∀ italic_n ∈ caligraphic_S , p(sn)𝑝subscript𝑠𝑛displaystyle p(s_n)italic_p ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =softmax(𝐫nTMns(𝐡1,…,𝐡T))absentsoftmaxsuperscriptsubscript𝐫𝑛Tsubscriptsuperscript𝑀𝑠𝑛subscript𝐡1…subscript𝐡𝑇displaystyle=textsoftmax(mathbfr_n^textTM^s_n(mathbfh_1% ,ldots,mathbfh_T))= softmax ( bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )

p(en)𝑝subscript𝑒𝑛displaystyle p(e_n)italic_p ( italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =softmax(𝐫nTMne(𝐡1,…,𝐡T))absentsoftmaxsuperscriptsubscript𝐫𝑛Tsubscriptsuperscript𝑀𝑒𝑛subscript𝐡1…subscript𝐡𝑇displaystyle=textsoftmax(mathbfr_n^textTM^e_n(mathbfh_1% ,ldots,mathbfh_T))= softmax ( bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) (3)
where the following are model parameters:

∀n∈𝒩,for-all𝑛𝒩displaystyleforall ninmathcalN,∀ italic_n ∈ caligraphic_N , 𝐩n∈ℝdsubscript𝐩𝑛superscriptℝ𝑑displaystylequadmathbfp_ninmathbbR^dbold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT

∀n∈𝒞,for-all𝑛𝒞displaystyleforall ninmathcalC,∀ italic_n ∈ caligraphic_C , Mnc∈ℝd×dsubscriptsuperscript𝑀𝑐𝑛superscriptℝ𝑑𝑑displaystylequad M^c_ninmathbbR^dtimes ditalic_M start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT

∀n∈𝒮,for-all𝑛𝒮displaystyleforall ninmathcalS,∀ italic_n ∈ caligraphic_S , (Mns,Mne)n∈ℝd×d×2subscriptsubscriptsuperscript𝑀𝑠𝑛subscriptsuperscript𝑀𝑒𝑛𝑛superscriptℝ𝑑𝑑2displaystylequad(M^s_n,M^e_n)_ninmathbbR^dtimes dtimes 2( italic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d × 2 end_POSTSUPERSCRIPT
Our proposed baselines differ from each other by how we compute the node representations 𝐫nsubscript𝐫𝑛mathbfr_nbold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from the sentence. We present three implementations fREPsubscript𝑓𝑅𝐸𝑃f_REPitalic_f start_POSTSUBSCRIPT italic_R italic_E italic_P end_POSTSUBSCRIPT in Section 5.2.

5.2 Node Representation

Independent predictions

Our first model computes 𝐫nsubscript𝐫𝑛mathbfr_nbold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT independently for each node by attending over the sentence representation. More specifically, each node n∈𝒩𝑛𝒩ninmathcalNitalic_n ∈ caligraphic_N has a parameter 𝐯n∈ℝnsubscript𝐯𝑛superscriptℝ𝑛mathbfv_ninmathbbR^nbold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We compute 𝐫nsubscript𝐫𝑛mathbfr_nbold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT by simply using 𝐯nsubscript𝐯𝑛mathbfv_nbold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to attend over the sequence encoding (𝐡1,…,𝐡T)subscript𝐡1…subscript𝐡𝑇(mathbfh_1,ldots,mathbfh_T)( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) using K𝐾Kitalic_K headed attention parameterized 𝐌ν∈ℝd×d×Ksuperscript𝐌𝜈superscriptℝ𝑑𝑑𝐾textbfM^
uinmathbbR^dtimes dtimes KM start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d × italic_K end_POSTSUPERSCRIPT:

𝐫nsubscript𝐫𝑛displaystylemathbfr_nbold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =attn(𝐱,(𝐡1,…,𝐡T);𝐌ν)absentattn𝐱subscript𝐡1…subscript𝐡𝑇superscript𝐌𝜈displaystyle=textattn(mathbfx,(mathbfh_1,ldots,mathbfh_T);% textbfM^
u)= attn ( bold_x , ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ; M start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ) (4)

Seq2Tree

We also implement the recurrent node representation function from Seq2Tree model of (Dong and Lapata, 2016). It uses a recurrent decoder to compute representations for the children of a node in sequence based on the previously predicted siblings and the parent’s representation. So let np∈ℐsuperscript𝑛𝑝ℐn^pinmathcalIitalic_n start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∈ caligraphic_I be an internal node, let (c1,…,cm)subscript𝑐1…subscript𝑐𝑚(c_1,ldots,c_m)( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) be its children. Let the recurrent hidden state of a node n𝑛nitalic_n be noted as 𝐠nsuperscript𝐠𝑛mathbfg^nbold_g start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and ∘circ∘ be the concatenation. We then compute:

𝐫ct=attn(𝐯ct+𝐠ct-1,(𝐡1,…,𝐡T);𝐌σ)subscript𝐫subscript𝑐𝑡attnsubscript𝐯subscript𝑐𝑡superscript𝐠subscript𝑐𝑡1subscript𝐡1…subscript𝐡𝑇superscript𝐌𝜎displaystylemathbfr_c_t=textattn(mathbfv_c_t+mathbfg^c_% t-1,(mathbfh_1,ldots,mathbfh_T);textbfM^sigma)bold_r start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = attn ( bold_v start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + bold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ; M start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) (5)

𝐠ct=frec(𝐠ct-1,𝐯′ct∘𝐠np),if act=1𝐠ct-1,elsesuperscript𝐠subscript𝑐𝑡casessubscript𝑓𝑟𝑒𝑐superscript𝐠subscript𝑐𝑡1subscriptsuperscript𝐯′subscript𝑐𝑡superscript𝐠superscript𝑛𝑝if subscript𝑎subscript𝑐𝑡1superscript𝐠subscript𝑐𝑡1elsedisplaystylemathbfg^c_t=begincasesf_rec(mathbfg^c_t-1,% mathbfv^prime_c_tcircmathbfg^n^p),&textif a_c_t=1\ mathbfg^c_t-1,&textelseendcasesbold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( bold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ bold_g start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , end_CELL start_CELL if italic_a start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL bold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL start_CELL else end_CELL end_ROW (6)
Where 𝐌ν∈ℝd×d×Ksuperscript𝐌𝜈superscriptℝ𝑑𝑑𝐾textbfM^
uinmathbbR^dtimes dtimes KM start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d × italic_K end_POSTSUPERSCRIPT is a tree-wise parameter (as in the independent prediction case), frecsubscript𝑓𝑟𝑒𝑐f_recitalic_f start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT is the GRU recurrence function, and 𝐯′ctsubscriptsuperscript𝐯′subscript𝑐𝑡mathbfv^prime_c_tbold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a node parameter (one per category for categorical nodes).

SentenceRec

One possible limitation of the Seq2tree model predicted above is that the tree-side recurrent update do not directly depend on the input sentence. This can be addressed by a simple modification: we simply add the node representation 𝐫ctsubscript𝐫subscript𝑐𝑡mathbfr_c_tbold_r start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT to the input for the recurrent update:

𝐫ct=attn(𝐯ct+𝐠ct-1,(𝐡1,…,𝐡T);𝐌σ)subscript𝐫subscript𝑐𝑡attnsubscript𝐯subscript𝑐𝑡superscript𝐠subscript𝑐𝑡1subscript𝐡1…subscript𝐡𝑇superscript𝐌𝜎displaystylemathbfr_c_t=textattn(mathbfv_c_t+mathbfg^c_% t-1,(mathbfh_1,ldots,mathbfh_T);textbfM^sigma)bold_r start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = attn ( bold_v start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + bold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ; M start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) (7)

𝐠ct=frec(𝐠ct-1,(𝐯′ct+𝐫ct)∘𝐠np),if act=1𝐠ct-1,elsesuperscript𝐠subscript𝑐𝑡casessubscript𝑓𝑟𝑒𝑐superscript𝐠subscript𝑐𝑡1subscriptsuperscript𝐯′subscript𝑐𝑡subscript𝐫subscript𝑐𝑡superscript𝐠superscript𝑛𝑝if subscript𝑎subscript𝑐𝑡1superscript𝐠subscript𝑐𝑡1elsedisplaystylemathbfg^c_t=begincasesf_rec(mathbfg^c_t-1,(% mathbfv^prime_c_t+mathbfr_c_t)circmathbfg^n^p),&text% if a_c_t=1\ mathbfg^c_t-1,&textelseendcasesbold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( bold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ( bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + bold_r start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∘ bold_g start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , end_CELL start_CELL if italic_a start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL bold_g start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL start_CELL else end_CELL end_ROW (8)
We refer to this model as SentenceRec.

5.3 Sequential Prediction

We predict the sate of the Action Tree given a sentence in a sequential manner, by predicting the state of the nodes (an;∀n∈𝒩subscript𝑎𝑛for-all𝑛𝒩a_n;;forall ninmathcalN italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; ∀ italic_n ∈ caligraphic_N , cn;∀n∈𝒞subscript𝑐𝑛for-all𝑛𝒞c_n;;forall ninmathcalC italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; ∀ italic_n ∈ caligraphic_C , and (sn,en);∀n∈𝒮subscript𝑠𝑛subscript𝑒𝑛for-all𝑛𝒮(s_n,e_n);;forall ninmathcalS ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ; ∀ italic_n ∈ caligraphic_S ) in Depth First Search order. Additionally, since an inactive node’s descendant are all inactive, we can skip the sub-trees rooted at n𝑛nitalic_n if we predict an=0subscript𝑎𝑛0a_n=0italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0. Let us thus note the parent of a node n𝑛nitalic_n as π(n)𝜋𝑛pi(n)italic_π ( italic_n ). Given Equations 1 to 3, the log-likelihood of a tree with states (𝐚,𝐜,𝐬,𝐞)𝐚𝐜𝐬𝐞(textbfa,textbfc,textbfs,textbfe)( a , c , s , e ) given a sentence s can be written as:

ℒℒdisplaystylemathcalLcaligraphic_L =∑n∈𝒩aπ(n)log(p(an))absentsubscript𝑛𝒩subscript𝑎𝜋𝑛𝑝subscript𝑎𝑛displaystyle=sum_ninmathcalNa_pi(n)log(p(a_n))= ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_N end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_π ( italic_n ) end_POSTSUBSCRIPT roman_log ( italic_p ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

+∑n∈𝒞anlog(p(cn))subscript𝑛𝒞subscript𝑎𝑛𝑝subscript𝑐𝑛displaystylequad+sum_ninmathcalCa_nlog(p(c_n))+ ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_C end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_p ( italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

+∑n∈𝒮an(log(p(sn))+log(p(en)))subscript𝑛𝒮subscript𝑎𝑛𝑝subscript𝑠𝑛𝑝subscript𝑒𝑛displaystylequad+sum_ninmathcalSa_nBig(log(p(s_n))+log(p(e_% n))Big)+ ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_S end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_log ( italic_p ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) + roman_log ( italic_p ( italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ) (9)
Not that since in all of our models the representation 𝐫nsubscript𝐫𝑛textbfr_nr start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of a node n𝑛nitalic_n only depends on nodes that have been seen before it in a DFS search, this loss lends itself well to beam search prediction.

6 Experiments

Training Data

We train our model jointly on the (virtually unlimited) template generations and set of 33K training rephrases. Early experiments showed that a model trained exclusively on templated generations failed to reach accuracies better than 40% on the validation rephrases. Training on rephrases did a little better (up to 65%) but still trailed behind models trained on both (around 80%, see Tale 1).

The action types represented in all three test datasets (rephrases, prompts and interactive) are very different, as shown in Figure 2. In order to address both of these issues, we sample training examples evenly between templates and rephrases according to each of the test setting distributions (no replacement til all examples of a subset have been seen).

Modeling Details

We use a 2-layer GRU sentence encoder and all hidden layers in our model have dimension d=256𝑑256d=256italic_d = 256. We use pre-trained word embeddings computed with FastText with subword information (Bojanowski et al., 2017), to which we concatenate free learnable dimensions (these are initialized to be 0, and we tried adding 0, 8, 32 and 64 free dimensions). All models are trained with Adagrad, using label smoothing, dropout, and word dropout for regularization. In all settings, we selected the model which reached the best accuracy on the validation rephrases to evaluate on the test sets. We provide our model and training code.

Overview of Results

Table 1 presents tree-level accuracies for the proposed training settings. First, we notice that all models are able to reach near-perfect accuracy on generations from our templates, which means they can invert the generation process described in Section 4.1. The accuracy on the validation and test rephrased data is also high, up to 80.7% for the SentenceRec model. However, the worse performance on instructions from both prompts and interactive shows that our setting poses significant generalization challenges. In particulars, all models have significant trouble with the prompts, which come from crowd-sourced workers asked to imagine general game commands and may not fit the exact Minecraft setting. Still, 86% of the annotations are valid under our grammar, and we hope that future work will be better able to address the domain shift to be able to predict those.

On the “interactive” commands , the models do a little better. In general, the SentenceRec seems to have a small edge over the base Seq2Tree model, but the main difference seems to be between the independent prediction and recurrent models. While the latter do much better when trained in-distribution (12% absolute gap), the former does seem to adapt better to the distribution shift when trained using the rephrases or prompts sampling.

Analysis

Table 2 gives insights into model behaviors on CATegorical, INTernal and SPAN nodes. Accurate prediction of a categorical or span node depends on having predicted all of the internal nodes on the path to the root, which explains why CAT and SPAN P/R/F numbers are lower than INT. Additionally, both models have more trouble predicting span than categorical nodes.

We also computed confusion matrices for the best SentenceRec model (see Appendix C). For internal nodes, both models seem to have trouble identifying the scope of some location and repeat nodes: i.e. even when identifying that the command specifies a location, is it the location where the command needs to be executed, or the action where the command’s argument is located? There is also a confusion between schematic and action reference objects, which we assume comes from the difficulty of interpreting whether the speaker is asking the model to build an object it knows (schematic) or another copy of an object in the world (reference object), a prediction which must rely on an understanding of the context.

Finally, the internal parent being absent seems to account for most of the CAT and SPAN mistakes (aside from the action type, which is a child of the root). For action types, the model seems to have trouble recognizing questions and Fill requests mostly. The model also seems to often confuse Mobs (animated creatures in the game) with Objects, which is indeed difficult to disambiguate without some background knowledge. For spans, the model mostly makes the mistake of predicting a node as inactive when it is present. It should be noted that span mis-match are especially rare, except for the model sometimes confusing the depth and height of an object when both are present.

7 Conclusion

In this work, we have described a grammar over a control system for a Minecraft assistant. We then discussed the creation of a dataset of natural language utterances with associated logical forms from this grammar that can be executed in-game. Finally, we showed the results of using this new dataset to train several neural models for parsing natural language instructions. We find that the models we trained were able to fit the templated data nearly perfectly and the rephrased data with some accuracy, but struggled to adapt to the human-generated data. In our view, the problem of using the small number of annotated (grammar-free) human data with the infinite generations of our grammar to improve results on human-distribution to be an exciting area of research.

My Website: https://minecraft-wiki.net/

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes