Train a RNN model to write a Bahasa paragraph by itself

This small experiment aims at getting a robot to write me a short story in full Bahasa Malay. I used a small text file of size 3.1MB, it has cleaned Malay text with complete sentences.

What is the application ? - In digital marketing, a robot which understands local natural language is able to engage in interactive conversations personalised to customers and communicates insight-driven answers or recommendations to customers with confidence.

Below are two paragraphs written by robot. RESULT 1: (from robot entirely) keterlaluan dan Vladivostok, Keseluruhan lot dan pembuatan bangsa, urusan atau Islam . Anugerah Cipta yang hendak didiagnosis, lazimnya digunakan untuk mengekalkan bahagian lembaga Malaysia . Apakah aplikasi baru pertukaran yang diperkenalkan oleh Pusat Dasar Penilaian Siswazah . pusat ini, dan risiko yang didaftarkan di bawahnya masing-masing pastinya tidak lebih popular dan menyediakan surat sokongan dan nama agama, majlis Umno adalah pihaknya . Lantas yang menawarkan nilai penajaan bagi pertumbuhan, dan mendaulatkan program Pengajaran yang formal berdasarkan pertukaran utama ini termasuk kerajaan Persatuan Anugerah Vendor/Pembekal/Kontraktor Adat-adat Inggeris Wangsa Sekuriti / Yang Jawatan Anugerah Jenama dan Penyediaan atau arahan daripada kaedah . Apakah penajaan untuk memastikan bantuan kewangan yang menggunakan projek dan ketepatan 10 kolej di kalangan komuniti yang direncanakan tahun ini . Hari ini turut mengawal rawatan sedia ada seorang syarikat yang dibenarkan untuk melengkapkan kontrak umum restoran kasar USIM baru-baru ini, bekalan saluran apartmen pinjaman kelip-kelip, yang dikenal web cheng-pai.com/event/chapgohmei . Apakah upah di dalam tempat itu, kata perjanjian itu dan juga mereka boleh bacaan perlu membuat tuntutan untuk restoran, membangunkan Wawasan Santapan gram . Untuk menghasilkan perlantikan menafikan pembayar sokongan untuk memastikan bekalan hidup . Perarakan bulan lepas, Pengerusi Maal Labuan hari ini terletak ini . Katanya Dewan Malaysia dijangka

RESULT 2: (from robot entirely) penggubal peranan untuk memperkasakan satu prosiding; dan . Waris-waris simpanan atau struktur yang adalah sebanyak Amalan zakat . Surat lain . Apakah prosedur dan kewangan dan agensi bagi bagi laman web http://www.sprm.gov.my . Tan Sri Ismail Chiew menganggarkan Alam Keluarga dan laman web TUKAR . Perbadanan Pengurusan dan dan menetapkan Program PUP yang dijalankan oleh Jawatankuasa Tatacara Kerosakan dan Akaun Serantau & bunga barat tersebut. Majlis Tanah . Hong berkenaan telah mengeluarkan ujian terus mempertahankan deposit universiti dan projek kami . Ng turut menjangka pelaburan kekal di luar rantau ini. Dengan Mengurus Kumpulan Perkhidmatan Pengurusan Minyak . Malaysia merupakan satu pengusaha pengurupan yang sama ada sahaja mencukupi dan baginda meliwat Tan Tan Ismail Chong Operasinya, Umno TV India Melaka atau 4.3 Ahad, . Sebelum ini, Ongkili berada seperti perniagaan perumahan . Persidangan sama lokasi terutamanya generasi muda ke Doha, antara 2011 atau Olimpik angkasa Perak, Kuala Krai bagi Mohd Akil dibeli berbanding Ahad pakai kira-kira 45 program ini . Selain pengambilalihan itu adalah kepada Projek Maklumat SC bagi OPR untuk menunaikan tuntutan bagi meningkatkan pelatih atau peringkat mengosongkan kompleks dan Taman Tan Dr uta penyata diploma . parti yang memimpin program Pengajaran seperti penawaran sindiket di lebuh raya akan ditingkatkan dengan

The contents it produced was somewhat random and comprehensible, but in some sentences, the robot is able to learn the basic constituent structure of Malay sentence as well the placement of punctuation marks. For example: Hari ini turut mengawal rawatan sedia ada seorang syarikat yang dibenarkan untuk melengkapkan kontrak umum restoran kasar USIM baru-baru ini (Google translation: Today also controls existing treatment that allowed a company to complete a public contract rude restaurant recently USIM) Sebelum ini, Ongkili berada seperti perniagaan perumahan . (Google translation with my own tweak: Previously, Maximus was like in the business housing business)

For this experiment, I used the model produced by Sung Kim (from HKUST) which uses TensorFlow and the type of recurrent network is LSTM, number of hidden states is 50, number of rnn layer is 2, number of rnn sequence length is 20 and number of epochs is 10.

What is RNN ? - When you read, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. Unlike traditional neural networks, RNN has loops in them, allowing information to persist. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data (Sequence and list data: speech recognition, language modeling, translation, image captioning etc).

_config.yml copied from here

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. RNNs could do this but struggle with long term dependencies (longer gap). Consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

What is LSTM ? - Long Short Term Memory networks (LSTMs) are a special kind of RNN, capable of learning long-term dependencies which are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behaviour. All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. The repeating module in a standard RNN contains a single layer.

_config.yml copied from here

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

_config.yml

(Each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations)

The repeating module in an LSTM contains four interacting layers. The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state. The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1ht−1 and xtxt, and outputs a number between 00 and 11 for each number in the cell state Ct−1Ct−1. A 11 represents “completely keep this” while a 00 represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C̃ tC~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state. In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, Ct−1Ct−1, into the new cell state CtCt. The previous steps already decided what to do, we just need to actually do it. We multiply the old state by ftft, forgetting the things we decided to forget earlier. Then we add it∗C̃ tit∗C~t. This is the new candidate values, scaled by how much we decided to update each state value. In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to. For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Courtesy of the colah’s blog

Written on January 5, 2017