设为首页收藏本站

深圳大学论坛

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 319|回复: 0

[标题党] 在十亿字基准的语言模型:Language Model on One Billion Word Benchmark

[复制链接]
发表于 2017-8-16 14:39:02 | 显示全部楼层 |阅读模式
在十亿字基准的语言模型

作者:

Oriol Vinyals(vinyals@google.com,github:OriolVinyals),Xin Pan(xpan@google.com,github:panyx0718)

论文作者:

Rafal Jozefowicz,Oriol Vinyals,Mike Schuster,Noam Shazeer,Yonghui Wu

TL; DR

这是十亿字基准的一个预先训练的模型。如果您在出版物中使用此模型,请引用原始论文:

@article {jozefowicz2016exploring,title = {Exploring the Limits of Language Modeling},author = {Jozefowicz,Rafal and Vinyals,Oriol and Schuster,Mike and Shazeer,Noam and Wu,Yonghui},journal = {arXiv preprint arXiv:1602.02410} year = {2016}}

介绍

在这个版本中,我们开源了一个在十亿字基准上训练的模型(http://arxiv.org/abs/1312.3005),这是一种英文的大语言语料库,于2013年发布。该数据集包含约十亿字,并且具有大约800K字的词汇大小。它主要包含新闻数据。由于训练集中的句子被洗牌,模型可以忽略上下文并集中于句子级语言建模。

在原始版本和后续工作中,人们使用相同的测试集来训练该数据集上的模型,作为语言建模的标准基准。最近,我们写了一篇文章(http://arxiv.org/abs/1602.02410),描述了字符CNN,一个大而深的LSTM和一个特定的Softmax架构之间的模型混合,使我们能够在这个数据集上训练最好的模型远远超过了其他人以前获得的最好的困惑。

代码发布

开源组件包括:

TensorFlow GraphDef原始缓冲区文本文件。
TensorFlow预培训的检查点碎片。
用于评估预训练模型的代码。
词汇文件。
测试仪从LM-1B评估。
代码支持4种评估模式:

给定提供的数据集,计算模型的困惑。
给定一个前缀句子,预测下一个单词。
转储softmax嵌入,字符级CNN字嵌入。
给一个句子,从LSTM状态转储嵌入。

Language Model on One Billion Word Benchmark

Authors:

Oriol Vinyals (vinyals@google.com, github: OriolVinyals), Xin Pan (xpan@google.com, github: panyx0718)

Paper Authors:

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

TL;DR

This is a pretrained model on One Billion Word Benchmark. If you use this model in your publication, please cite the original paper:

@article{jozefowicz2016exploring, title={Exploring the Limits of Language Modeling}, author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui}, journal={arXiv preprint arXiv:1602.02410}, year={2016} }

Introduction

In this release, we open source a model trained on the One Billion Word Benchmark (http://arxiv.org/abs/1312.3005), a large language corpus in English which was released in 2013. This dataset contains about one billion words, and has a vocabulary size of about 800K words. It contains mostly news data. Since sentences in the training set are shuffled, models can ignore the context and focus on sentence level language modeling.

In the original release and subsequent work, people have used the same test set to train models on this dataset as a standard benchmark for language modeling. Recently, we wrote an article (http://arxiv.org/abs/1602.02410) describing a model hybrid between character CNN, a large and deep LSTM, and a specific Softmax architecture which allowed us to train the best model on this dataset thus far, almost halving the best perplexity previously obtained by others.

Code Release

The open-sourced components include:

TensorFlow GraphDef proto buffer text file.
TensorFlow pre-trained checkpoint shards.
Code used to evaluate the pre-trained model.
Vocabulary file.
Test set from LM-1B evaluation.
The code supports 4 evaluation modes:

Given provided dataset, calculate the model's perplexity.
Given a prefix sentence, predict the next words.
Dump the softmax embedding, character-level CNN word embeddings.
Give a sentence, dump the embedding from the LSTM state.
项目地址:https://github.com/tensorflow/models/tree/master/lm_1b
        
tensorflow
http://www.tensorflownews.com/
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|小黑屋|手机版|深圳大学论坛 ( 粤ICP备11049555号  

GMT+8, 2017-11-21 06:38 , Processed in 0.171875 second(s), 21 queries .

Powered by Discuz! X3.1

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表