読者です 読者をやめる 読者になる 読者になる

chainerで自然言語処理できるかマン

chainerで自然言語処理を勉強していくブログ

ptbで学習したモデルを使って文生成

example\ptbを読む - chainerで自然言語処理できるかマンの学習結果のrnnlm.modelファイルを使って、文生成をしてみます。

準備

下記ファイルを同じディレクトリ内に用意しておきます。

  • ptb.train.txt
  • ptb.test.txt
  • ptb.valid.txt
  • rnnlm.model
  • net.py

コード

#encoding: utf-8
#
# Copyright (c) 2016 chainer_nlp_man
#
# This software is released under the MIT License.
# http://opensource.org/licenses/mit-license.php
#
import argparse
import math
import sys
import itertools
import random
import bisect

import numpy as np

import chainer
import chainer.links as L
import chainer.functions as F
from chainer import serializers

import net

# 引数にモデルファイルを指定
parser = argparse.ArgumentParser()
parser.add_argument('--model', '-m', default='',
                    help='the model from given file')
args = parser.parse_args()

# 単語<->ID変換用
vocab2id = {}
id2vocab = {}

# train_ptb.pyと同じ読み込み方にすることで単語とIDのペアが一致するようにする
def load_data(filename):
    global vocab2id, id2vocab, n_vocab
    words = open(filename).read().replace('\n', '<eos>').strip().split()
    for i, word in enumerate(words):
        if word not in vocab2id:
            vocab2id[word] = len(vocab2id)
            id2vocab[vocab2id[word]] = word

load_data('ptb.train.txt')
load_data('ptb.valid.txt')
load_data('ptb.test.txt')

# train_ptb.pyと同じ設定にする
n_units = 650

lm = net.RNNLM(len(vocab2id), n_units, False)
model = L.Classifier(lm)

# モデルデータの読み込み
serializers.load_hdf5(args.model, model)

# 文の適当な生成
for i in range(0,10):
    print(i+1, end=": ")
    # モデルの状態をいったんリセット
    model.predictor.reset_state()
    word = "<eos>"
    while True:
        # RNNLMへの入力を準備
        x = chainer.Variable(np.array([vocab2id[word]]))
        # RNNLMの出力のsoftmaxを取得
        y = F.softmax(model.predictor(x))
        # 各単語の確率値として、単語をサンプリングし、次の単語とする
        y_accum = np.add.accumulate(y.data[0])
        r = random.random()
        word = id2vocab[bisect.bisect(y_accum, r)]
        # もし文末だったら終了
        if word == "<eos>":
            print(".")
            break
        else:
            print(word, end=" ")

実行

$ python gen_sentence.py -m rnnlm.model

結果

1: mr. burton said certificates of annuity and acquisitions received by the government to be submitted on a <unk> basis by painewebber inc. will have just been involved in the forest business and private incentives but declined to disclose what full licensing only .
2: <unk> white house of duff & trecker said the new trading company 's cash flow has been reduced by $ N million for the sale of those shares under the agreement .
3: the fund called cholesterol .
4: gary l. <unk> head of the only reinsurance department 's office raised his <unk> account title to revise washington motor co. cleveland securities .
5: short interest in shares of high-yield high-risk junk bonds moved up N last week mostly as <unk> from boston 's N N N high over $ N billion .
6: the cut would focus on <unk> loans designed to distribute <unk> even to the ldp earlier this year .
7: but top trial activist david <unk> d. ore. cited the unprecedented factors of operating in damages of gifts from fashionable banks that be undervalued in <unk> division .
8: the fda 's successor will become married artistic efforts by a <unk> in the press as well as the quality of the reasons .
9: the house which will help lay <unk> out steppenwolf 's <unk> operations once became a direct <unk> to international environmental protection although it sold the $ N million procter & gamble co. buddy <unk> thompson operations and rjr  nabisco inc. in a fraudulent interview .
10: a number of agencies not <unk> legal corruption and lawyers at new york have <unk> the government 's <unk> plea into l. <unk> .

結構文っぽく生成されているように見えます。