Series 1 LSTM Gender Classification Tensorflow

September 27, 2020
Tensorflow Text Classification NLP LSTM

Hello… this post will be part of a serial posts about how we using deep learning approach for simple text classification model, starting from training the model until serving the model into “production ready” application with TensorflowServing or Flask. This series posts will be divided into 3 posts, the first one is preparing and training our simple model to recognize a gender based on name.

Lets say we want a model for our application needs to predict user’s gender based on their name, so at first we have a datasets a pair of name and gender in .csv file. Then, we will stacking our model with LSTM architecture with input user’s characters name and output gender (male/female). on this series, we will using tensorflow and tensorflow keras with minimum additional components and architecture so you still can train the model on your own laptop without excessive resources.

In this series, I use jupyter notebook from tensorflow docker image so you no need to build the environment and install requirements package one by one. to use tensorflow docker with jupyter please read here: https://www.tensorflow.org/install/docker

  • Import Necessary Packages
    import re
    
    import numpy as np 
    import pandas as pd 
    
    import tensorflow as tf
    
    from sklearn.feature_extraction.text import CountVectorizer
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Input, Dense, Embedding, LSTM, Dropout
    from sklearn.model_selection import train_test_split
    
  • Load Datasets
    data = pd.read_csv('name_gender_pair.csv')
    
  • Create Character Dictionary for Name Features
    data['name'] = data['name'].apply(lambda x: x.lower())
    
    human_vocab = set()
    for name in data['name']:
          human_vocab.update(tuple(name))
    
    vocab_index = {v: k + 1 for k, v in enumerate(human_vocab)}
    
    print(vocab_index)
    print(len(vocab_index))
    
  • Preprocess datasets and split for train and test set
    name_datasets = data['name'].apply(lambda x: [vocab_index[key] for key in list(x)])
    X = pad_sequences(name_datasets)
    Y = pd.get_dummies(data['gender']).values
    
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
    
    print('data shape')
    print(X_train.shape, y_train.shape)
    print(X_test.shape, y_test.shape)
    
  • Build the classification model using LSTM
    #build model
    model = Sequential()
    
    # use embedding at first layer for handle character dictionary input
    model.add(Embedding(len(vocab_index) + 1, 16, input_length=X.shape[1]))
    model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(2, activation='sigmoid'))
    
    #compile model 
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    print('model summary')
    print(model.summary())
    
  • Train the model
    batch_size = 64
    n_epochs = 12
    # train the model
    model.fit(X_train, y_train, batch_size=batch_size, epochs=n_epochs, validation_data=(X_test, y_test), verbose=1)
    
  • Evaluate the model
    #test evaluate
    score, acc = model.evaluate(X_test, y_test, batch_size=64)
    print('score', score)
    print('accuracy', acc)
    
  • Test the model
    #test 
    name = 'Aminarti'
    name = list(name.lower())
    test_dt = [vocab_index[x] for x in name]
    test_dt = pad_sequences([test_dt], maxlen=X.shape[1])
    print(test_dt) #print encoded test input sequence
    
    pad = np.array(test_dt[0])
    # predict with the model
    res = model.predict(pad.reshape(1, pad.shape[0]), batch_size=1, verbose=2)[0]
    print(res)
    if np.argmax(res) == 0:
        print('Female')
    elif np.argmax(res) == 1:
        print('Male')
    

for the complete codes and notebook you can download the jupyter notbook from this repo: https://github.com/yudanta/lstm-gender-classification/blob/master/LSTM-Character-Level-Gender-Classification.ipynb

The second part of the series will be exporting the trained model for tensorflow serving and run with tensorflow serving.

ps. the second post already publised in this link: https://yudanta.github.io/posts/series-2-exporting-lstm-gender-classification-and-serving-with-tensorflowserving/

comments powered by Disqus