Natural Language Translation

With Tensorflow Keras and Recurrent Neural Networks

Machine Translation Bridges Barriers of Language

Preprocessing Data

Data Sources:

European Parliament Proceedings Parallel Corpus 1996-2011

Data Obtaining

A function for downloading and unpacking ZIP file, returns a name of txt file for the further processing.
Enlgish - French sentence pairs from Tatoeba Project were used in this example.

Data Cleaning

The dataset contains tab-separated English - French sentence pairs, one pair per line:
initial data
Data need some cleaning - punctuation, numbers and nonprintable characters removing, as well as setting everything to lower case. Also there are duplicating lines and empty entries that need to be excluded.
After performing cleanin we get a two dimensional numpy array that can be used by Keras:


    array([['have fun', 'amusetoi bien'],
          ['he tries', 'il essaye'],
          ['hes wet', 'il est mouille'],
          ['hi guys', 'salut les mecs'],
          ['how cute', 'comme cest mignon'],
          ['how deep', 'quelle profondeur'],
          ['how nice', 'comme cest chouette'],
          ['humor me', 'faismoi rire'],
          ['hurry up', 'depechetoi'],
          ['i am fat', 'je suis gras']])

Array shape for English-French pairs is (96518, 2).

Then data is ready to be splitted into training and testing sets (80/20).

Tokenize data

The both languages sets are tokenized (the whole datasets). Tokenizers give some insights on the data:

a number of unique words in english dataset - 13547
max sentence size in english dataset - 44
a number of unique words in target dataset - 23556
max sentence size in target dataset - 54

Since the number of unique words in not very large, all the tokens were used for training.

Data encoding

Sentences from each language dataset were transformed into sequences based on tokenizer indices and then padded to the maximum sentence length.
At this step it is probably better to use median length and shrink all the sentences above it.
The first two sentenses of the English dataset after padding:


    array([[ 85,  55, 193,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0],
      [ 49, 105, 103, 361,   8,  17,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0]], dtype=int32)

In addition we apply one hot encoding on the target datatets (both train and test):


    array([[0., 0., 0., ..., 0., 0., 0.],
          [0., 0., 0., ..., 0., 0., 0.],
          [0., 0., 0., ..., 0., 0., 0.],
          ...,
          [1., 0., 0., ..., 0., 0., 0.],
          [1., 0., 0., ..., 0., 0., 0.],
          [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

Define Models

Long Short-Term Memory

Gated Recurrent Unit

Models Training

After hours of training LSTM model and playing with different settings (batch size, number of neurons, number of epochs), we were able to achieve the following results:

The minimum test loss is 1.11437 test accuracy is 0.8089.

GRU model was trained on 100,000 rows and 10,000 tokens for four days, below are some translation examples:

Haiku Generator

Inspired by Andrej Karpathy Stanford Computer Science Ph.D. student and the Director of AI at Tesla

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models was used to create a haiku generator.
Dataset - more than 400 haiku scraped from the web:

The output is generated by character, based on previous experience from the model training.

Conclusions

The more powerful the machine is, the easier is to adjust model settings and get cleaner results
Training on large datasets produces better results, but is very time consuming
It is a lot easier to work with Keras than with pure Tensorflow
While deploying large models, have to adjust model weights and architecture

The Team

Christina Park

Malvika Mathur

Sonya Smirnova

nothing but the empty row

Ed Ali

Abubeker Ali

All the code can be found on GitHub
The deployed application is located on Heroku