Seq2Seq-스펠링 교정

Deep Learning/Seq2Seq

Seq2Seq-스펠링 교정

jinyeah 2020. 11. 27. 05:52

최근 진행하는 프로젝트에서 단어의 스펠링을 교정해야 하는 일이 생겼다.

이에 대한 해결책으로 기존에 tenosrflow 버전 Deep-Spelling 모델을 pytorch 버전으로 변경해서 사용했다.

모델은 encoder-decoder로 이루어진 seq2seq 모델을 사용해서 pytorch로 seq2seq모델을 구현한 github을 참고했다.

하지만 github에 구현된 seq2seq 모델은 translation이나 question-answer task를 위한 것이기 때문에 모델의 input이 문장이지만 스펠링 교정 task는 모델의 input이 단어이다. 따라서 전처리 부분과 모델 설정 부분을 수정해야 했다.

전처리

모델 인풋 전처리

각 character마다 특정 integer가 할당되는데 학습하는 전체 단어를 integer 형식으로 바꿔준다. train, test를 할때 source단어들의 character-integer 맵핑과 target단어들의 character-integer 맵핑을 잘 구분해서 적용해주어야 한다.

def ordered_unique_list(input_list):
    input_dic = {}
    r_list = []
    for i, v in enumerate(input_list):
        get_value = input_dic.get(v, None)
        if get_value == None:
            input_dic[v] = i
            r_list.append(v)
    return r_list

def extract_character_vocab(data):
    """
    :param data: contents in txt file
    :return int_to_vocab {0:<PAD>, 1:<UNK>, 2:<GO>, 3:<EOS>, 4:'o', 5:'r'..}
            vocab_t_int {'<PAD>':0, '<UNK>:1, '<GO>':2, '<EOS>':3, 'o':4, 'r':5..}
    """
    special_words = ['<PAD>', '<UNK>', '<GO>',  '<EOS>']

    # set_words = sorted(set([character for line in data.split('\n') for character in line]))
    set_words = ordered_unique_list([character for line in data.split('\n') for character in line])
    random.seed(4)
    random.shuffle(set_words)
    int_to_vocab = {word_i: word for word_i, word in enumerate(special_words + list(set_words))}
    vocab_to_int = {word: word_i for word_i, word in int_to_vocab.items()}

    return int_to_vocab, vocab_to_int

디코너 인풋 전처리

단어의 끝에 <PAD>를 삭제하고 맨 앞에 <GO>를 추가한다.

def process_decoder_input(target_data, vocab_to_int_GO, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''

    ending = target_data[:, :-1]
    dec_input = np.insert(ending, 0, vocab_to_int_GO, axis=1)

    return dec_input

Embedding dimension 설정

Encoder와 Decoder에서 첫번째 RNN셀에 들어가기 전에 전처리된 단어를 embedding해준다. translation이나 question-answer task를 위한 encoder의 input dimension과 decoder의 output dimension은 각각 source의 단어 개수와 target의 단어 개수이다. 하지만 스펠링 교정의 encoder의 input dimension은 source의 character 개수이고 output dimension은 target의 character 개수이다.

input_dim = len(source_letter_to_int)
output_dim = len(target_letter_to_int)

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, num_layers, dropout=0):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        ...
        
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, num_layers, dropout=0):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim)
        ...

구현 github

github.com/YeJinJeon/Seq2Seq-pytorch