Machine Translation Bridges Barriers of Language
A function for downloading and unpacking ZIP file, returns a name of txt file for the further processing.
Enlgish - French sentence pairs from Tatoeba Project were used in this example.
The dataset contains tab-separated English - French sentence pairs, one pair per line:
Data need some cleaning - punctuation, numbers and nonprintable characters removing, as well as setting everything to lower case. Also there are duplicating
lines and empty entries that need to be excluded.
After performing cleanin we get a two dimensional numpy array that can be used by Keras:
array([['have fun', 'amusetoi bien'],
['he tries', 'il essaye'],
['hes wet', 'il est mouille'],
['hi guys', 'salut les mecs'],
['how cute', 'comme cest mignon'],
['how deep', 'quelle profondeur'],
['how nice', 'comme cest chouette'],
['humor me', 'faismoi rire'],
['hurry up', 'depechetoi'],
['i am fat', 'je suis gras']])
The both languages sets are tokenized (the whole datasets). Tokenizers give some insights on the data:
Sentences from each language dataset were transformed into sequences based on tokenizer indices and then padded to the maximum sentence length.
At this step it is probably better to use median length and shrink all the
sentences above it.
The first two sentenses of the English dataset after padding:
array([[ 85, 55, 193, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0],
[ 49, 105, 103, 361, 8, 17, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]], dtype=int32)
In addition we apply one hot encoding on the target datatets (both train and test):
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.]], dtype=float32)
After hours of training LSTM model and playing with different settings (batch size, number of neurons, number of epochs), we were able to achieve the following results:
GRU model was trained on 100,000 rows and 10,000 tokens for four days, below are some translation examples:
Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models was used to create a haiku generator.
Dataset - more than 400 haiku scraped from the web:
The output is generated by character, based on previous experience from the model training.