Home / translate symbols to text / Language Models are Unsupervised Multitask Learners

Language Models are Unsupervised Multitask Learners - translate symbols to text

Language Models are Unsupervised Multitask Learners-translate symbols to text

Language Models are Unsupervised Multitask Learners
Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1
Abstract competent generalists. We would like to move towards more
general systems which can perform many tasks - eventually
Natural language processing tasks, such as ques- without the need to manually create and label a training
tion answering, machine translation, reading com- dataset for each one.
prehension, and summarization, are typically
approached with supervised learning on task- The dominant approach to creating ML systems is to col-
specific datasets. We demonstrate that language lect a dataset of training examples demonstrating correct
models begin to learn these tasks without any ex- behavior for a desired task, train a system to imitate these
plicit supervision when trained on a new dataset behaviors, and then test its performance on independent
of millions of webpages called WebText. When and identically distributed (IID) held-out examples. This
conditioned on a document plus questions, the an- has served well to make progress on narrow experts. But
swers generated by the language model reach 55 the often erratic behavior of captioning models (Lake et al.,
F1 on the CoQA dataset - matching or exceeding 2017), reading comprehension systems (Jia & Liang, 2017),
the performance of 3 out of 4 baseline systems and image classifiers (Alcorn et al., 2018) on the diversity
without using the 127,000+ training examples. and variety of possible inputs highlights some of the short-
The capacity of the language model is essential comings of this approach.
to the success of zero-shot task transfer and in- Our suspicion is that the prevalence of single task training
creasing it improves performance in a log-linear on single domain datasets is a major contributor to the lack
fashion across tasks. Our largest model, GPT-2, of generalization observed in current systems. Progress
is a 1.5B parameter Transformer that achieves towards robust systems with current architectures is likely
state of the art results on 7 out of 8 tested lan- to require training and measuring performance on a wide
guage modeling datasets in a zero-shot setting range of domains and tasks. Recently, several benchmarks
but still underfits WebText. Samples from the have been proposed such as GLUE (Wang et al., 2018) and
model reflect these improvements and contain co- decaNLP (McCann et al., 2018) to begin studying this.
herent paragraphs of text. These findings suggest
a promising path towards building language pro- Multitask learning (Caruana, 1997) is a promising frame-
cessing systems which learn to perform tasks from work for improving general performance. However, mul-
their naturally occurring demonstrations. titask training in NLP is still nascent. Recent work re-
ports modest performance improvements (Yogatama et al.,
2019) and the two most ambitious efforts to date have
trained on a total of 10 and 17 (dataset, objective)
1. Introduction pairs respectively (McCann et al., 2018) (Bowman et al.,
2018). From a meta-learning perspective, each (dataset,
Machine learning systems now excel (in expectation) at objective) pair is a single training example sampled
tasks they are trained for by using a combination of large from the distribution of datasets and objectives. Current
datasets, high-capacity models, and supervised learning ML systems need hundreds to thousands of examples to
(Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei induce functions which generalize well. This suggests that
et al., 2016). Yet these systems are brittle and sensitive to multitask training many need just as many effective training
slight changes in the data distribution (Recht et al., 2018) pairs to realize its promise with current approaches. It will
and task specification (Kirkpatrick et al., 2017). Current sys- be very difficult to continue to scale the creation of datasets
tems are better characterized as narrow experts rather than and the design of objectives to the degree that may be re-
*, **Equal contribution 1OpenAI, San Francisco, Califor- quired to brute force our way there with current techniques.
nia, United States. Correspondence to: Alec Radford This motivates exploring additional setups for performing
. multitask learning.
The current best performing systems on language tasks
Language Models are Unsupervised Multitask Learners
Figure 1. Zero-shot task performance of WebText LMs as a function of model size on many NLP tasks. Reading Comprehension results
are on CoQA (Reddy et al., 2018), translation on WMT-14 Fr-En (Artetxe et al., 2017), summarization on CNN and Daily Mail (See et al.,
2017), and Question Answering on Natural Questions (Kwiatkowski et al., 2019). Section 3 contains detailed descriptions of each result.
utilize a combination of pre-training and supervised fine- symbols as the product of conditional probabilities (Jelinek
tuning. This approach has a long history with a trend to- & Mercer, 1980) (Bengio et al., 2003):
wards more flexible forms of transfer. First, word vectors
were learned and used as inputs to task-specific architec- n
tures (Mikolov et al., 2013) (Collobert et al., 2011), then p(x) = p(sn|s1, ..., sn-1) (1)
the contextual representations of recurrent networks were i=1
transferred (Dai & Le, 2015) (Peters et al., 2018), and re-
cent work suggests that task-specific architectures are no This approach allows for tractable sampling from and es-
longer necessary and transferring many self-attention blocks timation of p(x) as well as any conditionals of the form
is sufficient (Radford et al., 2018) (Devlin et al., 2018). p(sn-k, ..., sn|s1, ..., sn-k-1). In recent years, there have
These methods still require supervised training in order been significant improvements in the expressiveness of mod-
to perform a task. When only minimal or no supervised els that can compute these conditional probabilities, such as
data is available, another line of work has demonstrated self-attention architectures like the Transformer (Vaswani
the promise of language models to perform specific tasks, et al., 2017).
such as commonsense reasoning (Schwartz et al., 2017) and Learning to perform a single task can be expressed in a
sentiment analysis (Radford et al., 2017). probabilistic framework as estimating a conditional distri-
In this paper, we connect these two lines of work and con- bution p(output|input). Since a general system should be
tinue the trend of more general methods of transfer. We able to perform many different tasks, even for the same
demonstrate language models can perform down-stream input, it should condition not only on the input but also
tasks in a zero-shot setting - without any parameter or archi- on the task to be performed. That is, it should model
tecture modification. We demonstrate this approach shows p(output|input, task). This has been variously formalized
potential by highlighting the ability of language models to in multitask and meta-learning settings. Task conditioning
perform a wide range of tasks in a zero-shot setting. We is often implemented at an architectural level, such as the
achieve promising, competitive, and state of the art results task specific encoders and decoders in (Kaiser et al., 2017)
depending on the task. or at an algorithmic level such as the inner and outer loop
optimization framework of MAML (Finn et al., 2017). But
2. Approach as exemplified in McCann et al. (2018), language provides
a flexible way to specify tasks, inputs, and outputs all as a
At the core of our approach is language modeling. Lan- sequence of symbols. For example, a translation training
guage modeling is usually framed as unsupervised distri- example can be written as the sequence (translate to
bution estimation from a set of examples (x1, x2, ..., xn) french, english text, french text). Like-
each composed of variable length sequences of symbols wise, a reading comprehension training example can
(s1, s2, ..., sn). Since language has a natural sequential or- be written as (answer the question, document,
dering, it is common to factorize the joint probabilities over question, answer). McCann et al. (2018) demon-
strated it was possible to train a single model, the MQAN,
Language Models are Unsupervised Multitask Learners
to infer and perform many different tasks on examples with "I'm not the cleverest man in the world, but like they say in
this type of format. French: Je ne suis pas un imbecile [I'm not a fool].
Language modeling is also able to, in principle, learn the In a now-deleted post from Aug. 16, Soheil Eid, Tory candidate
tasks of McCann et al. (2018) without the need for explicit in the riding of Joliette, wrote in French: "Mentez mentez,
il en restera toujours quelque chose," which translates as,
supervision of which symbols are the outputs to be pre- "Lie lie and something will always remain."
dicted. Since the supervised objective is the the same as the
unsupervised objective but only evaluated on a subset of the "I hate the word `perfume,"' Burr says. `It's somewhat better
sequence, the global minimum of the unsupervised objective in French: `parfum.'
is also the global minimum of the supervised objective. In If listened carefully at 29:55, a conversation can be heard
this slightly toy setting, the concerns with density estimation between two guys in French: "-Comment on fait pour aller
as a principled training objective discussed in (Sutskever de l'autre cote?? -Quel autre cote??", which means "- How
et al., 2015) are side stepped. The problem instead becomes do you get to the other side? - What side?".
whether we are able to, in practice, optimize the unsuper- If this sounds like a bit of a stretch, consider this ques-
vised objective to convergence. Preliminary experiments tion in French: As-tu aller au cine?ma?, or Did you go to
the movies?, which literally translates as Have-you to go to
confirmed that sufficiently large language models are able to movies/theater?
perform multitask learning in this toy-ish setup but learning
is much slower than in explicitly supervised approaches. "Brevet Sans Garantie Du Gouvernement", translated to
English: "Patented without government warranty".
While it is a large step from the well-posed setup described
above to the messiness of "language in the wild", Weston
(2016) argues, in the context of dialog, for the need to Table 1. Examples of naturally occurring demonstrations of En-
develop systems capable of learning from natural language glish to French and French to English translation found throughout
directly and demonstrated a proof of concept - learning a the WebText training set.
QA task without a reward signal by using forward prediction
of a teacher's outputs. While dialog is an attractive approach,
we worry it is overly restrictive. The internet contains a vast Common Crawl. Trinh & Le (2018)'s best results were
amount of information that is passively available without achieved using a small subsample of Common Crawl which
the need for interactive communication. Our speculation is included only documents most similar to their target dataset,
that a language model with sufficient capacity will begin the Winograd Schema Challenge. While this is a pragmatic
to learn to infer and perform the tasks demonstrated in approach to improve performance on a specific task, we
natural language sequences in order to better predict them, want to avoid making assumptions about the tasks to be
regardless of their method of procurement. If a language performed ahead of time.
model is able to do this it will be, in effect, performing Instead, we created a new web scrape which emphasizes
unsupervised multitask learning. We test whether this is the document quality. To do this we only scraped web pages
case by analyzing the performance of language models in a which have been curated/filtered by humans. Manually
zero-shot setting on a wide variety of tasks. filtering a full web scrape would be exceptionally expensive
so as a starting point, we scraped all outbound links from
2.1. Training Dataset Reddit, a social media platform, which received at least 3
Most prior work trained language models on a single do- karma. This can be thought of as a heuristic indicator for
main of text, such as news articles (Jozefowicz et al., 2016), whether other users found the link interesting, educational,
Wikipedia (Merity et al., 2016), or fiction books (Kiros or just funny.
et al., 2015). Our approach motivates building as large and The resulting dataset, WebText, contains the text subset
diverse a dataset as possible in order to collect natural lan- of these 45 million links. To extract the text from HTML
guage demonstrations of tasks in as varied of domains and responses we use a combination of the Dragnet (Peters &
contexts as possible. Lecocq, 2013) and Newspaper1 content extractors. All re-
A promising source of diverse and nearly unlimited text is sults presented in this paper use a preliminary version of
web scrapes such as Common Crawl. While these archives WebText which does not include links created after Dec
are many orders of magnitude larger than current language 2017 and which after de-duplication and some heuristic
modeling datasets, they have significant data quality issues. based cleaning contains slightly over 8 million documents
Trinh & Le (2018) used Common Crawl in their work on for a total of 40 GB of text. We removed all Wikipedia
commonsense reasoning but noted a large amount of doc- documents from WebText since it is a common data source
uments "whose content are mostly unintelligible". We ob- for other datasets and could complicate analysis due to over-
served similar data issues in our initial experiments with 1https://github.com/codelucas/newspaper


Title: Language Models are Unsupervised Multitask Learners
Subject: Proceedings of the International Conference on Machine Learning 2019
Keywords: Machine Learning, ICML
Author: Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei**, Ilya Sutskever**
Creator: LaTeX with hyperref package
Producer: pdfTeX-1.40.18
CreationDate: Thu Feb 14 23:21:39 2019
ModDate: Thu Feb 14 23:21:39 2019
Tagged: no
Form: none
Pages: 24
Encrypted: no
Page size: 612 x 792 pts (letter) (rotated 0 degrees)
File size: 582775 bytes
Optimized: yes
PDF version: 1.5

Online Preview Download

Hot Searches