Tommy is a student at Comberton Village College and he joined us for two weeks of work experience. Here is his account of the work he did.
The challenge was to experiment with various deep learning libraries and produce a neural network to generate text in different styles.
As I had never seen, let alone programmed neural networks, the first thing to do was to find out some background information on how they work. At the basic level, the neural network takes a matrix as an input (e.g. the letters in a sentence are given IDs and concatenated into a matrix), passes it through layers of ‘neurons’ which process the matrix, and outputs a new vector (e.g. the next predicted letter). The output vector is then processed and compared to the proper value (e.g. the actual next letter), and information from the comparison used to tweak the weights and biases in the neural network. This is called ‘training’ the neural network, and is usually done over thousands of iterations. Once training is complete, the neural network should know enough about the input data to produce its own examples of the training data. This synthesis is the ‘testing/output’ stage. The results from this can vary based on the quality of the neural network, and can be improved by increasing the number of training iterations, the number of neurons per layer or the number of layers. There is a catch with adding more layers and neurons however, as the more neurons there are the longer it takes for the network to be trained, and eventually, with too many neurons or layers, the law of diminishing returns kicks in. There are a several different types of neural network; for what I was doing the most applicable were convolutional neural networks and recurrent neural networks.
Convolutional neural networks are used extensively for computer vision, for example in self-driving cars. They take many inputs simultaneously, for example all the pixels of an image, and compute results over groups of inputs at a time, to reach a decision about the inputs (for example what the image shows, how close/far the subject is). A classic example of using convolutional neural networks for machine learning is for reading the MNIST dataset of handwritten numbers. In this example, the neural network takes in an image of a number and guesses which number it really is. It then tells itself the answer and tweaks its algorithms, until it does well when it is tested against unseen images of numbers.
Recurrent Neural Networks (RNNs), Long Short Term Memory neural networks (LSTMs) and Gated Recurrent Units (GRUs) are widely used for text synthesis. These types of neural networks feature an input layer, one or more hidden layer(s) where processing of the input data occurs, and an output layer. In RNNs, LSTMs and GRUs the hidden layer is looped over multiple times with the output of previous steps being fed back into the hidden layer so that the final response is dependent on the input and previous outputs. Traditional RNNs are designed to remember the immediately preceding items in a sequence but have trouble remembering key themes from further back in the sequence such as the gender of the subject, which leads to the sentences they generate not making much sense. LSTMs and GRUs solve this problem by choosing which parts to remember or forget, for example, an LSTM might forget the gender of a previous subject, to remember the gender of the current one.
I identified three deep machine learning APIs to try: TensorFlow, Theano and Torch.
The first deep learning library I tried was TensorFlow which is accessed through a Python API. I used TensorFlow to write a convolutional neural network for the MNIST dataset and programmed the network to train itself on 20 000 images. To be able to monitor how the network improved with training, every 100 images during the training phase, I programmed it to output 4 of its guesses generated using the current iteration of the network, the actual values and images used. These are some of the results from the training:
After 0 iterations, the neural network seems to be predicting a 1 for everything. This is probably because it has no idea about what any of the numbers look like, as it hasn’t trained at all yet.
After 4000 iterations, the network is getting some of its predictions correct, but still making some understandable mistakes. For the number 8, the network doesn’t seem to be looking at the entire 8, but only parts of it. If you remove some of the lines, it becomes a 7:
Here are its results after 19 400 iterations:
Here, the network is making mistakes that are more like what a human would make. For example, the 7 here looks very much like a 9, and could easily be mistaken for one. After this many iterations, the network seems to only be making mistakes on the very ambiguous characters.
As the training continues the accuracy of the predictions continues to improve.
Once training was complete, I tested the resulting network on a further 10 000 images on which the network achieved around 90% accuracy. Not even humans get 100% accuracy on tests like these (such as CAPTCHAs), so this is a very good result.
As I moved onto experimenting with recurrent neural networks and text generation, I also changed to using Theano, another Python deep learning API.
For my first attempt at generating text I used 7.25MB (7 250 000 characters) of Reddit comments from August 2015. There were about 1,387,000 words in the file, but a dictionary of the most common 8000 words (counting punctuation as words) was generated to keep training time down and to stop the neural network using too many infrequent words. The original document was a CSV file with every new comment on a new line, but the output was a plain .txt file, because these are easier for the neural network to create.
The network I used had 2 hidden layers with 128 neurons each. Some of the best examples from this word-based network were:
few lbs spent .
you look reactionary as
bugs ? .
good, was ? unemployed time last sorry .
father high .
And the very summative:
nonsense notes .
These were truly the best examples, however, as most of them made no sense:
- see n't libraries honda a invalid these its void why known fitness your bin
but you shootings a % it our indicates as
all you cuz cop grad took is in active .
One noticeable thing about these phrases is the terrible grammar. This is because to reduce the number of different words, the data used for training was pre-processed, adding spaces before punctuation and suffixes like “n’t” (from “isn’t”) so they were treated as separate words. Then, when generating the new text, the neural network would put a space between all ‘words’, whether they were actual words or not. Therefore I decided to move on to a neural network that looked at individual characters instead of words.
To save time I decided to use a pre-written recurrent neural network. There are many examples available to download and I used Justin Johnson’s, available from https://github.com/jcjohnson/torch-rnn. This is written in Lua and uses the Torch API to implement recurrent neural networks and long short term memory modules.
As this network trains on individual characters rather than whole words, to begin with I tested how it compared to the previous network that trained on whole words. To make it a fair comparison with the word-based network I set it up to use 2 hidden layers with 128 neurons each. However, as it was being trained on characters rather than words there was no need to reduce the number of different words so there wasn’t a pre-processing stage which added spaces to the initial text and this is reflected in the output.
Some of the best examples from the character-based network were:
So find I was horrible stoin. When I say these because it's guides around it, there are kirmag with a argument or animals.
Can I feel there. It's juff. Some jake is an orgine eights to have a Priest is so dutarding to find formative things to get a little logic degar. I have watch it.
Not amazing, but an improvement. Ignoring the bad syntax and made-up words, it actually reads a little bit like the original Reddit comments. Unlike the word-based network, the character-based network doesn’t need a list of the 8000 most common words in the dataset because it only has to choose between characters, so it occasionally makes up words like ‘stoin’ and ‘dutarding’. It also learns about capital letters at the beginning of sentences and full stops at the end. As for “So. /r/game?” this Subreddit actually exists, but is private and invitation-only. URLs are understood very well, and the neural network actually invents a valid one:
However, there were many syntactical errors, and the neural network couldn’t seem to keep writing about a particular topic for a long time. This was probably because it could only remember a few words at a time. There are a few terrible examples, such as:
****Worttle Corrimum Since
Results could probably be improved by using more layers in the network, and more neurons in each layer, although this will significantly increase training time.
As a final test of the character based neural network, I modified it to have 256 neurons in each of the 2 hidden layers and trained it on 1.64MB of Shakespeare’s plays written in XML (I used Antony and Cleopatra, A Midsummer Night’s Dream, Hamlet, Julius Caesar, Macbeth, The Merchant of Venice, Othello, and Romeo and Juliet), and it came up with something looking slightly plausible.
<LINE>Why, ond all sense here of she; for whose heads have</LINE>
<LINE>In such love dim. Brutus Antony so,</LINE>
<LINE>Ever to otherwise. We leave me like of Corth; make</LINE>
<LINE>forficiers Portia, for, valleys! look you, my lord.</LINE>
<LINE>If he foe one satisfaction.</LINE>
<LINE>Then play'd me to act so tyranny in heaven;</LINE>
<LINE>Patient that I may entertaid to-day,</LINE>
<LINE>And his corvincious my world were authors.</LINE>
<LINE>Thou could not tender your part of our feast:</LINE>
<LINE>Speak, brave me, as you hear, adorrous all.</LINE>
The neural network understands the structure of the XML very well. It learns to always put a name in between the <SPEAKER> tags, and always put the lines in between the <LINE> tags, etc. It even properly learns stage directions, and sometimes re-uses characters:
<LINE>How like a learned of a tongue:</LINE>
<LINE>Double as one do attactly.</LINE>
<STAGEDIR>Enter a Messenger</STAGEDIR>
<LINE>Leave the wind; Hercules.</LINE>
<LINE>I tell you of a like a prison of his sounds,</LINE>
<LINE>To o'erlain'd alter Charitor; I would company</LINE>
<LINE>bewarket not my horse: the enchantia's will</LINE>
<LINE>To poor my eye, very more loves are not a memory.</LINE>
<LINE>Therefore, get thee say as that many child;</LINE>
<LINE>For if he not be that cowards contriver did,</LINE>
Occasionally, the neural network invents its own new words, just like in the Reddit example, except this time, this is more acceptable, as Shakespeare himself did this too.
Out of the 3 libraries I tested, Torch was by far the best, because it understood the structure of text quickly and semi-accurately. It also trained extremely quickly compared to the other two - the Reddit comments took 20 minutes to train on Torch, but the more basic networks I trained using Theano and TensorFlow both took hours. Theano might have produced more realistic Reddit comments if I had used a dictionary of letters instead of words. As for the TensorFlow neural network I wrote, it seemed to achieve good results, but was very much tailored to the file format of the MNIST dataset I was using and couldn’t be adapted to use other images. Overall, this was a very fun project, and it would be great in the future to make a convolutional neural network which would enhance the quality of images or draw its own objects.
Do you have a project that you would like to discuss with us? Or have a general enquiry? Please feel free to contact usContact us