# "Can I Say This In Chinese" with BERT

### Introduction¶

As it turns out, most people are not very inclined to teaching. I'm learning Chinese, my wife is Chinese, seems like a match made in heaven. Except that she has no patience whatsoever with my broken Chinese (though she's wonderful in many other ways). Whenever I ask how to say something in Chinese, she anwers with either "I don't know" or "you can't say that (followed by no explanation)". The only way I can get anything out of her is by trying to say something in Chinese and asking whether it sounds right or not. This is less mentally taxing for her than actually having to translate from English, which I understand, especially for two languages so dissimilar.

Now I'm thinking, with the recent advances in Natural Language Processing with Deep Learning, maybe I can create something to replace my unwilling wife. The academic name for this task seems to be "Linguistic Acceptability". Exactly what this includes seems to be up for debate. For example, "the mouse ate the cat" is perfectly grammatical, although highly unlikey. Then there are sentences which are grammatical but seem logically impossible, like "the cat is a bus". This sentence makes no sense unless you've watched the movie Totoro, which features a... cat that is also a bus. Since this seems like a very difficult problem, I'll be focusing more on distinguishing grammatical vs. ungrammatical rather than sensical vs. nonsensical.

### Defining the problem¶

Recent Deep Learning architectures like BERT and GPT-2 basically train a language model or LM, i.e. given the surrounding context, they try to predict the missing word. In GPT-2s case, it predicts the next word given all the previous words in the sentence, while BERT predicts a missing word (a cloze) given both the words before and after it (the B in BERT stands for bidirectional). As such, GPT-2 works better as a language model, defining the joint probability over a sequence of words, while BERT's masked LM is less straight forward to use as such. As a reminder, the joint probability can be refactored recursively using the chain rule:

$$P(w_{1:n}) = P(w_n | w_{1:n-1})P(w_{1:n-1}) = P(w_n | w_{1:n-1}) \cdot \ldots \cdot P(w_2 | w_1)P(w_1)$$

Each of these factors is exactly what we get out of GPT-2, which means if we run inference and multiply the factors we get the joint probability, or actually more of an unormalized likelihood, of the whole sentence. BERT on the other hand gives us $P(w_k | w_{1:k-1}, w_{k+1:n})$ which is harder to intepret. There is research exploring ways of getting a joint probability model out of BERT using MRFs (Markov Random Fields), but I'd like to keep things simple for this little project.

Using GPT-2 will be difficult, since training it from scratch, having 1.5 billion weights, requires a cluster of GPUs and roughly $50k. So I'm constrained to pre-trained versions, of which there is none for Chinese AFAIK. The Python library pytorch-transformers does however have a pre-trained BERT for Chinese. #### How can we use BERT?¶ Being constrained by time and money leaves me no option but to use BERT at this point. While BERT can't be used as a language model per-se, we can perhaps use the output in some useful way. We'd like to get a binary decision whether a sentence is acceptable or not. We could try to use the masked probability for each word in the sentence, but again, it will be difficult to find some absolute thresold to distinguish unlikely sentences from unacceptable ones. What we could do is to train a classifier based on BERT with a dataset of positive and negative examples. While there are such datasets for other languages (CoLA - Corpus of Linguistic Acceptablility), I have not found such a dataset for Chinese. I was however able to crawl some examples from the AllSet grammar wiki (licensed with CC-NC) with in total 436 and 461 negative and positive examples respectively, split into grammar groups based on page (note: this will take some time to run): ! wget --quiet --mirror --convert-links --adjust-extension --follow-tags=a --no-parent resources.allsetlearning.com/chinese/grammar/ ! grep -r -e 'class="x"' resources.allsetlearning.com/chinese/**/* |\ sed -e 's/<li class="x">//g' -e 's/<span .*//g' -e 's/<\/*[a-z]*>//g' -e 's/ //g' -e 's/:.*→/:/g' \ > "$cache_path/allset_negative_examples.txt"
! grep -r -e 'class="o"' resources.allsetlearning.com/chinese/**/* |\
sed -e 's/<li class="o">//g' -e 's/<span .*//g' -e 's/<\/*[a-z]*>//g' -e 's/ //g' -e 's/:.*→/:/g' \
> "$cache_path/allset_positive_examples.txt"  While it's putting the car before the horse a bit, I suspected (correctly) that this small dataset would not be enough to train a classifier that generalizes well to any output. There are just too few examples to generalize to all the ways sentences can be correct and wrong, although these examples do contain many important and subtle errors learners commit. ### Self-supervised learning¶ Instead of only training on the small dataset, the idea is to pre-train a classifier in a self-supervised way by generating negative examples from positive ones. While the masked probabilities of all the words in a sentence is not enough to tell the acceptability of the sentence, we can assume there is useful information in the relative scores, or losses, between sentences. Using relative losses, we can generate negative samples from positive ones by finding a mutation that significantly increases the loss. Let's define the loss for a sentence as the average (since we're possibly comparing sentences of differing lengths) Cross-Entropy loss for each word: $$L(S) = -\frac{1}{N}\sum_{i=1}^{N}{\log(P(w_i | w_{1:i-1}, w_{i+1:N}))}$$ Then we can perform take a correct sentence$S_c$and perform a random mutation to get$S_m$. If$L(S_m) - L(S_c) > \epsilon$we consider it to be unacceptable. Note that even if we could use the bidirectional probabilities/losses to directly do classification, this is something we'd like to avoid since calculating this loss requires a forward pass for every token in the sentence. Using these expensively generated examples to train a classifier let's us bypass this problem. #### Hard Negatives¶ This way we can generate unacceptable sentences from any acceptable one. Now since there are many possible ways to mutate a sentence that increases the loss more than$\epsilon$, we can pick the minimal one that passes this threshold. This is similar to hard negative mining where if you already have a model, you can improve it by sampling hard negatives and retraining the model. This is common in image classification and localization where any part of an image not containing the specified object are potential negative examples. Then it makes sense to pick the ones that are misclassified or get high losses from the initial model. #### Mutations¶ For the actual sentences, we could use the original corpus, but I prefer using sentences from Tatoeba since it is a good source of informal language suitable for learners. For mutating the sentences, there are a few things we can do: • Permute the words • Swap two words • Insert word (sampled based on corpus frequency) • Replace word (sampled based on corpus frequency) • Delete word While we want to mutate the sentences to get unacceptable ones, there is some degree of unacceptability, and we want to generate ones that are hard, i.e. just barely unacceptable. Therefore I exclude random permutations since they are very unlikely to produce something close to acceptability. Similarly for insertions and word replacements, it makes more sense to sample common words more frequently than rare words since the language has a very long tail of very infrequent words. Below is the code for loading the Tatoeba dataset and generating hard negatives. (NOTE: this is a lot of not very interesting code, but it is runnable if you run this in a Jupyter Notebook or Google Colab environment). Also worth mentioning is that the starting point for the PyTorch training was this Colab Notebook, which serves as a good tutorial for fine-tuning BERT for sequence classification. First, installing some pip packages: !pip install --quiet pytorch-transformers pytorch-nlp hanziconv jieba sympy  Import a pre-trained Masked LM BERT model and define functions for preparing data for this model, as well as functions for predicting based on it, and calculating losses for whole sentences: import io import os import re import torch import jieba import random import tensorflow as tf import pandas as pd import numpy as np import matplotlib.pyplot as plt from collections import defaultdict from tqdm import tqdm, trange from hanziconv import HanziConv from sympy.ntheory import factorint from functools import lru_cache from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler from keras.preprocessing.sequence import pad_sequences from sklearn.model_selection import train_test_split, GroupKFold from pytorch_transformers import BertTokenizer, BertConfig, BertModel from pytorch_transformers import AdamW, BertForSequenceClassification, BertForMaskedLM from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo from sklearn.metrics import matthews_corrcoef, precision_score, recall_score, accuracy_score % matplotlib inline device_name = tf.test.gpu_device_name() if device_name != '/device:GPU:0': raise SystemError('GPU device not found') print('Found GPU at: {}'.format(device_name)) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") n_gpu = torch.cuda.device_count() torch.cuda.get_device_name(0) def gpu_usage(print_stats=False): """ Convenience function to check GPU memory usage. Returns free memory in GB """ nvmlInit() handle = nvmlDeviceGetHandleByIndex(0) info = nvmlDeviceGetMemoryInfo(handle) if print_stats: print(f"Total memory: {info.total/1e9:.2f} GB") print(f"Free memory: {info.free/1e9:.2f} GB") print(f"Used memory: {info.used/1e9:.2f} GB") return info.free/1e9 # Make sure we have enough memory if gpu_usage(print_stats=True) < 8: raise SystemError('Not enough memory') # Load pre-trained model tokenizer (vocabulary) tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', do_lower_case=True) # Load pre-trained model (weights) masked_lm_model = BertForMaskedLM.from_pretrained('bert-base-chinese') masked_lm_model.cuda() def prepare_data(df, test_size=0.1, batch_size=32, shuffle=True, add_cls_sep=True): sentences = df.sentence.values # We need to add special tokens at the beginning and end of each sentence for BERT to work properly if add_cls_sep: sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences] has_labels = 'label' in df.columns if has_labels: labels = df.label.values else: labels = np.zeros(len(sentences)) tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences] # Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. # In the original paper, the authors used a length of 512. MAX_LEN = 128 # Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts] # Pad our input tokens input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post") # Create attention masks attention_masks = [] # Create a mask of 1s for each token followed by 0s for padding for seq in input_ids: seq_mask = [float(i > 0) for i in seq] attention_masks.append(seq_mask) # Use train_test_split to split our data into train and validation sets for training # but if test_size is zero then only generate training sets if test_size > 0.0: train_inputs, validation_inputs, train_labels, validation_labels = train_test_split( input_ids, labels, random_state=2018, test_size=test_size, shuffle=shuffle) train_masks, validation_masks, _, _ = train_test_split( attention_masks, input_ids, random_state=2018, test_size=test_size, shuffle=shuffle) else: train_inputs = input_ids train_labels = labels train_masks = attention_masks validation_inputs = [] validation_labels = [] validation_masks = [] # Convert all of our data into torch tensors, the required datatype for our model train_inputs = torch.tensor(train_inputs) validation_inputs = torch.tensor(validation_inputs) train_labels = torch.tensor(train_labels) validation_labels = torch.tensor(validation_labels) train_masks = torch.tensor(train_masks) validation_masks = torch.tensor(validation_masks) # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, # with an iterator the entire dataset does not need to be loaded into memory train_data = TensorDataset(train_inputs, train_masks, *([train_labels] if has_labels else [])) if shuffle: train_sampler = RandomSampler(train_data) else: train_sampler = SequentialSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size) validation_data = TensorDataset(validation_inputs, validation_masks, *([validation_labels] if has_labels else [])) validation_sampler = SequentialSampler(validation_data) validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size) return train_dataloader, validation_dataloader def predict(dataloader, model, has_labels=True): """ Evaluates data from a data loader on a model and returns either a tuple of predicted probability and true label if has_labels=True otherwise it returns the raw logits """ # Put model in evaluation mode model.eval() # Predict for i, batch in enumerate(dataloader): # Add batch to GPU batch = tuple(t.to(device) for t in batch) # Unpack the inputs from our dataloader if has_labels: b_input_ids, b_input_mask, b_labels = batch else: b_input_ids, b_input_mask = batch # Telling the model not to compute or store gradients, saving memory and speeding up prediction with torch.no_grad(): # Forward pass, calculate logit predictions logits, *_ = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask) # Move logits and labels to CPU logits = logits.detach().cpu().numpy() if has_labels: softmax_probs = np.exp(logits[:, 1]) / np.exp(logits).sum(axis=1) label_ids = b_labels.to('cpu').numpy() for prob, label in zip(softmax_probs, label_ids): yield prob, label else: yield logits def eval_loss_sentences(sentences, masking='char'): """ Evaluate the loss for a list of sentences sentences: the list of sentences masking: 'word' for whole word, and 'char' for single character masking """ assert masking in ['word', 'char'] masking_words = masking == 'word' indexed_sentence_tokens = [] tokenized_sentences = [] sentence_mask_indices = [] all_examples = [] for sentence in sentences: # NOTE: the tokenizer removes spaces tokenized_sentence = tokenizer.tokenize(sentence) tokenized_sentence = tokenized_sentence[:128] indexed_sentence_tokens.append(tokenizer.convert_tokens_to_ids(tokenized_sentence)) if masking_words: tokenized_sentence = list(t[0] for t in jieba.tokenize(''.join(tokenized_sentence))) tokenized_sentences.append(tokenized_sentence) mask_indices = [] char_idx = 0 for i in range(len(tokenized_sentence)): mask_token = tokenized_sentence[i] mask_token_parts = len(tokenizer.tokenize(mask_token)) if masking_words else 1 all_examples.append(''.join(tokenized_sentence[:i]) + ''.join(mask_token_parts*['[MASK]']) + ''.join(tokenized_sentence[i+1:])) mask_indices.append((char_idx, char_idx+mask_token_parts)) char_idx += mask_token_parts mask_indices.append('[SEP]') sentence_mask_indices.append(mask_indices) df = pd.DataFrame(data={'sentence': all_examples}) dataloader, _ = prepare_data(df, test_size=0.0, batch_size=32, shuffle=False) sentence_losses = [] curr_sentence_loss = 0 curr_sentence = 0 curr_mask_idx = 0 curr_example = 0 for batch_logits in predict(dataloader, masked_lm_model, has_labels=False): for i in range(batch_logits.shape[0]): mask_start, mask_end = sentence_mask_indices[curr_sentence][curr_mask_idx] for m in range(mask_start, mask_end): mask_logits = batch_logits[i][m+1] mask_logits_exp = np.exp(mask_logits) mask_token_probs = mask_logits_exp / mask_logits_exp.sum() mask_entropy = -(mask_token_probs * np.log(mask_token_probs)).sum() masked_token_index = indexed_sentence_tokens[curr_sentence][m] # Cross-Entropy Loss curr_sentence_loss += -np.log(mask_token_probs[masked_token_index]) curr_mask_idx += 1 curr_example += 1 if curr_mask_idx == len(tokenized_sentences[curr_sentence]): # We've reached a new sentence, reset and append log prob # Normalize sentence loss by number of tokens curr_sentence_loss /= len(tokenized_sentences[curr_sentence]) sentence_losses.append(curr_sentence_loss) curr_sentence_loss = 0 curr_mask_idx = 0 curr_sentence += 1 return sentence_losses  Download example sentences from Tatoeba and word frequency dataset: ! wget http://downloads.tatoeba.org/exports/sentences.tar.bz2 ! bzip2 -dc sentences.tar.bz2 > "$cache_path/sentences.txt"


Below is the code for reading the Tatoeba and Weibo frequency datasets and generating hard negatives:

orig_sentences = []
with open(cache_path+'/sentences.txt', 'r') as f:
for line in f:
splits = line.split('\t')
if len(splits) < 3:
continue
_, lang, zh = line.split('\t')
if lang != 'cmn': continue
zh = HanziConv.toSimplified(zh.strip())
orig_sentences.append(zh)

words = []
counts = []
with open(cache_path+'/weibo.txt', 'r', encoding='utf-8-sig') as f:
word, count = line.split('\t')
tokenized_word = tokenizer.tokenize(word)
if len(tokenized_word) == 0:
continue

# Skip [UNK] or other garbage unkown to the BERT tokenizer
skip = False
for t in tokenized_word:
if len(t) > 1:
skip = True
break
if skip: continue
words.append(word)
counts.append(int(count))

# Calculate the probability and cumulative probability function for words over
# the frequency
counts = np.array(counts)
word_probs = counts / counts.sum()
cdf = np.cumsum(word_probs)

def sample_word():
""" Sample a random word based on frequency """
r = random.random()
idx = np.searchsorted(cdf, r)
return words[idx]

@lru_cache(maxsize=128)
def middle_coprime(n):
""" Find the middle coprime of a number, e.g. of all the
sorted coprimes of n, pick the middle one """
factors = list(factorint(n).keys())
coprimes = [1]
for i in range(n-2, 1, -1):
coprime = True
for f in factors:
if i % f == 0:
coprime = False
break
if coprime:
coprimes.append(i)
return coprimes[len(coprimes) // 2]

def pseudo_random_range(from_idx, to_idx=None):
"""
Visit all indices in a range pseudo-randomly by visiting (ax + b) mod n,
where a and n are co-prime. Small and large coprimes tend to not look random,
so pick the middle one.
"""
if to_idx is None:
from_idx, to_idx = 0, from_idx

n = to_idx - from_idx
coprime = middle_coprime(n)
offset = random.randint(0, n-1) if n > 1 else 0
for i in range(0, n):
yield from_idx + (coprime*i + offset) % n

IGNORE = set(['。', '」', '「', '，', ' ', '！', '？', '?', '!', '.', ','])
# Swaps that usually produce acceptable sentences:
POSITIVE_SWAP_GROUPS = [set(['我', '你', '他', '她']), # personal pronouns
set(['我们', '你们', '他们', '她们'])] # plural personal pronouns
def is_positive_swap(from_token, to_token):
swap_set = set([from_token, to_token])
# Check if both tokens are in a positive swap group, if so we don't swap
for swap_group in POSITIVE_SWAP_GROUPS:
if len(swap_set & swap_group) == 2:
return True
return False

def generate_delete(sentence, tokens):
for idx in pseudo_random_range(len(tokens)):
token = tokens[idx][0]
if token in IGNORE:
continue
tokens_deleted = tokens[:idx] + tokens[idx+1:]
yield ''.join(t[0] for t in tokens_deleted)

def generate_insert(sentence, tokens):
for idx in pseudo_random_range(len(tokens)):
word = sample_word()
tokens_inserted = tokens[:idx] + [(word,)] + tokens[idx:]
yield ''.join(t[0] for t in tokens_inserted)

def generate_replace(sentence, tokens):
for idx in pseudo_random_range(len(tokens)):
token = tokens[idx][0]
if token in IGNORE:
continue
# Sample words until it's not equal to the token we're replacing
word = token
while word == token:
word = sample_word()
tokens_replaced = tokens[:idx] + [(word,)] + tokens[idx+1:]
yield ''.join(t[0] for t in tokens_replaced)

def generate_swap(sentence, tokens):
token_set = set([t[0] for t in tokens])
for from_idx in pseudo_random_range(len(tokens)-1):
from_token = tokens[from_idx][0]
if from_token in IGNORE:
continue

for to_idx in pseudo_random_range(from_idx, len(tokens)):
to_token = tokens[to_idx][0]
if (from_token == to_token or
to_token in IGNORE):
continue

if is_positive_swap(from_token, to_token):
continue

# Swap the tokens and return the new string
mtokens = list(tokens)
mtokens[to_idx], mtokens[from_idx] = mtokens[from_idx], mtokens[to_idx]
yield ''.join(t[0] for t in mtokens)

def generate_mutated(sentence):
tokens = list(jieba.tokenize(sentence))
generators = [#generate_delete(sentence, tokens),
generate_insert(sentence, tokens),
generate_replace(sentence, tokens),
generate_swap(sentence, tokens)]
pick_probs = np.array([0.15, 0.15, 0.7])
while len(generators) > 0:
gen_idx = np.random.choice(np.arange(len(generators)), p=pick_probs)
random_gen = generators[gen_idx]
try:
yield next(random_gen)
except StopIteration:
# The generator is out of sentences to generate, so remove it
del generators[gen_idx]
pick_probs = np.delete(pick_probs, gen_idx)
# Need to normalize so probabilities add up to 1
pick_probs /= pick_probs.sum()

def generate_hard_negatives(sentences, model, loss_threshold=0.5, generate_max=10,
debug_print=False):
"""
Creates hard negative examples, which are sampled based on mutations that
increase the loss the least but still significantly enough to very likely be a
true negative.
"""
sentence_examples = list(sentences)
for i, sentence in enumerate(sentence_examples):
# Skip sentences with unknown words or other garbage
predict_sentences = [sentence]
generator = generate_mutated(sentence)
for _ in range(generate_max):
try:
predict_sentences.append(next(generator))
except StopIteration:
break

print('C: ', sentence)
for s, l in sorted(zip(predict_sentences[1:], losses[1:]), key=lambda x: x[1]):
if l - losses[0] > loss_threshold:
if debug_print:
print('W: ', s, l, ' +', l-losses[0])
yield s
break

negatives_path = cache_path + '/negatives.txt'
if os.path.exists(negatives_path):
with open(negatives_path, 'r') as f:
hard_negatives = [l.strip() for l in f.readlines()]
else:
# NOTE: generating hard negatives takes a long time since to check a single mutation
# we need to run inference len(sentence) times, and we need to generate a number
# of mutations for each sentence in order to find a good one
# So we run a few thousand at a time and store them in case runtime gets recycled
use_num = 30000
num_at_a_time = 3000
use_sentences = orig_sentences[:use_num]
hard_negatives = []
for i in range(0, use_num // num_at_a_time):
if os.path.exists(f'{cache_path}/negatives{i+1}.txt'):
continue
with open(f'{cache_path}/negatives{i+1}.txt', 'w') as f:
sentences = use_sentences[i*num_at_a_time:(i+1)*num_at_a_time]
for negative in generate_hard_negatives(sentences, masked_lm_model, debug_print=True):
hard_negatives.append(negative)
f.write(negative + '\n')

# Concatenate all files to one
with open(negatives_path, 'w') as n:
for i in range(0, use_num // num_at_a_time):
with open(f'{cache_path}/negatives{i+1}.txt', 'r') as f:


### Fine-tuning BERT¶

There are plenty of tutorials on how to fine-tune a BERT model. For this experiment I'll use the pre-trained Chinese model in the Python library pytorch-transformers by huggingface. This model is trained with a character-by-character tokenizer, meaning multi-character Chinese words are split into separate word embeddings for each character. This may be suboptimal, unless the model is powerful enough to capture the structure of words, but for now this is what we have to work with.

Below is the code for training and validating the BERT model for classification:

def train(dataloader, epochs=4, model=None, debug_print=False):
if model is None:
model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=2);
model.cuda()

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0}
]

# Set our model to training mode (as opposed to evaluation mode)
model.train()
train_loss_set = []

# trange is a tqdm wrapper around the normal python range which prints progress
r = trange(epochs, desc="Epoch") if debug_print else range(epochs)
for _ in r:
# Tracking variables
train_loss = 0
num_examples, num_steps = 0, 0

# Train the data for one epoch
if debug_print: print(f'Batch: {step}')
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
# Clear out the gradients (by default they accumulate)
# Forward pass
loss, *_ = model(b_input_ids, token_type_ids=None,
train_loss_set.append(loss.item())
# Backward pass
loss.backward()
# Update parameters and take a step using the computed gradient
optimizer.step()

# Update tracking variables
train_loss += loss.item()
num_examples += b_input_ids.size(0)
num_steps += 1

if debug_print: print("Train loss: {}".format(train_loss/num_steps))

return model

y_true = []
y_pred = []
for prob, label in predict(dataloader, model):
y_true.append(label)
y_pred.append(1 if prob > 0.5 else 0)
return y_true, y_pred

def print_stats(y_true, y_pred, sentences=None, label=None):
tab = ''
if label is not None:
print(f'{label}:')
tab = '\t'
print(f'{tab}Matthews Correlaton Coefficient:', matthews_corrcoef(y_true, y_pred))
print(f'{tab}Accuracy:', accuracy_score(y_true, y_pred))
print(f'{tab}Precision:', precision_score(y_true, y_pred))
print(f'{tab}Recall:', recall_score(y_true, y_pred))


Now we can train our first classification model on positive examples from the Tatoeba dataset and our generated hard negatives. Here I'll train the classifier with an increasing number of examples to see if we need more data. Training an iterating is slow, so I prefer to keep it as small as possible for now.

model_path = cache_path + '/self_supervised_classification_model.pt'
if os.path.exists(model_path):
else:
training_accuracies = []
validation_accuracies = []
classification_model = None
for num in [3000, 6000, 9000, len(hard_negatives)]:
hard_negatives_df = pd.DataFrame(data={
'sentence': hard_negatives[:num] + orig_sentences[num:2*num],
'orig': orig_sentences[:2*num],
'label': num*[0]+num*[1]})

hard_negatives_df, test_size=0.1, batch_size=32)

# Save to disk, for rerunning and making copies
torch.save(classification_model, model_path)

df = pd.DataFrame(data={
'sentence': hard_negatives + orig_sentences[len(hard_negatives):2*len(hard_negatives)],
'label': len(hard_negatives)*[0] + len(hard_negatives)*[1]})
dataloader, _ = prepare_data(df, test_size=0.0, batch_size=32)


Now let's load the AllSet grammatical wiki examples and train models with cross-validation either from scratch or using the pre-trained model.

One important difference from the previous dataset is that we want to know how well the model generalizes to new unseen grammatical rules rather than just unseen examples. Therefore we split the data into training and validation sets based on the grammatical rule/group, such that examples from the same group never are split between the train and test sets.

allset_negative_examples = defaultdict(list)
with open(cache_path+'/allset_negative_examples.txt', 'r') as f:
filename, sentence = l.split(':')
allset_negative_examples[filename].append(sentence.strip())
allset_positive_examples = defaultdict(list)
with open(cache_path+'/allset_positive_examples.txt', 'r') as f:
filename, sentence = l.split(':')
allset_positive_examples[filename].append(sentence.strip())

all_files = list(set(allset_negative_examples.keys()) |
set(allset_positive_examples.keys()))
allset_sentences = []
allset_labels = []
allset_groups = []
for g, filename in enumerate(all_files):
negative = allset_negative_examples[filename]
positive = allset_positive_examples[filename]
allset_sentences += negative + positive
allset_labels += [0]*len(negative) + [1]*len(positive)
allset_groups += (len(negative)+len(positive))*[g]

allset_sentences = np.array(allset_sentences)
allset_labels = np.array(allset_labels)
allset_groups = np.array(allset_groups)

tatoeba_sample = np.random.choice(orig_sentences, 10000)
hard_negative_sample = np.random.choice(hard_negatives, 10000)
self_supervised_df = pd.DataFrame(data={
'sentence': list(hard_negative_sample) + list(tatoeba_sample),
'label': len(hard_negative_sample)*[0] + len(tatoeba_sample)*[1]})
batch_size=32, shuffle=False)

def cross_validate_allset(initial_model_path=None, epochs=4, n_splits=10,
print_progress=True):
train_results = [[], []]
test_results = [[], []]
self_supervised_results = [[], []]
new_model = None
if n_splits == 1:
generator = [(np.arange(len(allset_sentences)),
np.arange(len(allset_sentences)))]
else:
group_kfold = GroupKFold(n_splits=n_splits)
generator = group_kfold.split(allset_sentences, allset_labels, allset_groups)

for i, (train_index, test_index) in enumerate(generator):
train_examples = allset_sentences[train_index]
train_labels = allset_labels[train_index]
test_examples = allset_sentences[test_index]
test_labels = allset_labels[test_index]

pd.DataFrame(data={'sentence': train_examples, 'label': train_labels}),
test_size=0.0, batch_size=32)
pd.DataFrame(data={'sentence': test_examples, 'label': test_labels}),
test_size=0.0, batch_size=32)

model = None
if initial_model_path is not None:

debug_print=print_progress)

if print_progress:
print_stats(*train_result, label='AllSet Train')
print_stats(*test_result, label='AllSet Test')
print_stats(*self_supervised_result, label='Self-Supervised')

train_results[0] += train_result[0]
train_results[1] += train_result[1]
test_results[0] += test_result[0]
test_results[1] += test_result[1]
self_supervised_results[0] += self_supervised_result[0]
self_supervised_results[1] += self_supervised_result[1]

print_stats(*train_result, label='Overall AllSet Train')
print_stats(*test_result, label='Overall AllSet Test')
print_stats(*self_supervised_result, label='Overall Self-Supervised')

# Return the last model
return new_model


First, let's train a model from scratch on the AllSet data and see how well it does against against itself as well as against our self-supervised Tatoeba + Hard negative dataset:

cross_validate_allset(initial_model_path=None, epochs=6, n_splits=10, print_progress=False);

Overall AllSet Train:
Matthews Correlaton Coefficient: 0.978298651254621
Accuracy: 0.9891304347826086
Precision: 0.9838337182448037
Recall: 0.9953271028037384
Overall AllSet Test:
Matthews Correlaton Coefficient: 0.9366607354497857
Accuracy: 0.967391304347826
Precision: 0.94
Recall: 1.0
Overall Self-Supervised:
Matthews Correlaton Coefficient: 0.46815654446892113
Accuracy: 0.7165
Precision: 0.6568613244457325
Recall: 0.9066


As you can see, it seems to generalize well on the AllSet data across the folds, meaning somehow it generalizes to unseen grammatical rules. But the performance on the self-supervised dataset is poor. This is probably due to the AllSet data being biased towards easier, illustrative examples, which are substantially different from the average sentence from Tatoeba. It also doesn't cover all the more "obvious" ways sentences can be grammatical.

Now lets do the same thing, but with a model pre-trained on the self-supervised dataset, with the hope that we can generalize on both data sets:

cross_validate_allset(initial_model_path=model_path, epochs=6, n_splits=10, print_progress=False);

Overall AllSet Train:
Matthews Correlaton Coefficient: 0.9927488225424451
Accuracy: 0.9963768115942029
Precision: 0.9976580796252927
Recall: 0.9953271028037384
Overall AllSet Test:
Matthews Correlaton Coefficient: 0.9784719757905218
Accuracy: 0.9891304347826086
Precision: 0.9791666666666666
Recall: 1.0
Overall Self-Supervised:
Matthews Correlaton Coefficient: 0.8895640148971811
Accuracy: 0.94365
Precision: 0.9777107785075912
Recall: 0.908


The overall results show that the model has generalized relatively well to both datasets, although the scores are lower for the self-supervised data set compared to before.

For training the final model, we can get an even better result for the self-supervised data by training it from scratch on both data sets, but with the AllSet data upsampled to match the self-supervised in size, giving both equal importance. Here I'll train it once with a single test set instead of k-fold cross validation, so I don't time out in Google Colab.

final_model_path = cache_path+'/final_model.pt'
if os.path.exists(final_model_path):
else:
# Again, need to split AllSet into train/test using GroupKFold
# GroupKFold.split returns all cross-validation sets, but we'll just use the first
allset_train_idx, allset_test_idx = next(GroupKFold(n_splits=10).split(allset_sentences, allset_labels, allset_groups))
allset_train = allset_sentences[allset_train_idx]
allset_train_labels = allset_labels[allset_train_idx]
allset_test = allset_sentences[allset_test_idx]
allset_test_labels = allset_labels[allset_test_idx]

# Next split the self-supervised data set into train/test as well
ss_train, ss_test, ss_train_labels, ss_test_labels =  train_test_split(
orig_sentences[len(hard_negatives):2*len(hard_negatives)] + hard_negatives,
[1]*len(hard_negatives) + [0]*len(hard_negatives), test_size=0.1)

# Then combine both data sets, but with upsampling for AllSet so that they are
# of equal size
upsample_times = 2*len(hard_negatives) // len(allset_sentences)
all_train = (list(ss_train) + upsample_times*list(allset_train))
all_train_labels = (ss_train_labels + upsample_times*list(allset_train_labels))

pd.DataFrame(data={'sentence': all_train, 'label': all_train_labels}),
test_size=0.0, batch_size=32)
pd.DataFrame(data={'sentence': allset_test, 'label': allset_test_labels}),
test_size=0.0, batch_size=32)
pd.DataFrame(data={'sentence': ss_test, 'label': ss_test_labels}),
test_size=0.0, batch_size=32)

debug_print=True)

print_stats(*train_result, label='Train')
print_stats(*allset_test_result, label='AllSet Test')
print_stats(*ss_test_result, label='Self-Supervised Test')
torch.save(final_model, final_model_path)



And a sanity check on a few new examples I've found by googling, and some I've come up with myself:

incorrect_sentences = [
'你有没有车吗？',
'你是很高',
'你得包很漂亮',
'这个车很贵',
'这本车很贵',
'我碰到他在公园昨天了',
'在一家中国饭店，马丽见面了汤姆。',
'他们在法国见面了对方。',
'马丽结婚了汤姆。',
'汤姆结婚了马丽。',
'我喜欢都学生。',
'这是我的都。',
'我们开会在明天上午九点 。',
'我不有时间。'
]
correct_sentences = [
'你有没有车',
'你很高',
'你的包很漂亮',
'这辆车很贵',
'这辆车很贵',
'我昨天在公园碰到他了',
'在一家中国饭店，马丽和汤姆见面了。',
'他们在法国和对方见面了。',
'马丽嫁了汤姆。',
'汤姆娶了马丽。',
'我喜欢所有学生。',
'这是我的所有。',
'我们明天上午九点开会。',
'我没有时间。'
]

incorrect_df = pd.DataFrame(data={'sentence': incorrect_sentences, 'label': len(incorrect_sentences)*[0]})
incorrect_dataloader, _ = prepare_data(incorrect_df, test_size=0.0, batch_size=1, shuffle=False)
correct_df = pd.DataFrame(data={'sentence': correct_sentences, 'label': len(correct_sentences)*[1]})
correct_dataloader, _ = prepare_data(correct_df, test_size=0.0, batch_size=1, shuffle=False)
gen = zip(correct_sentences, predict(correct_dataloader, model=final_model, has_labels=True),
print('Correct | Incorrect')
for correct, (prob_correct, _), incorrect, (prob_incorrect, _) in gen:
print(f'{correct}: {prob_correct:.2f} | {incorrect}: {prob_incorrect:.2f}')

Correct | Incorrect



For those of you who don't know any Chinese, I'll explain the 3 false positives out of these examples.

The first two false positives are when using the wrong "measure word" for the noun "car". In English we have measure words for some things, like a pair of shoes or a loaf of bread, but Chinese loads of them. It seems like the model hasn't managed to learn this, but it's also a simple thing to add more data for: we can just find sentences with measure words and swap them for the wrong one.

The last error is one of sentence word ordering, where in Chinese the time and place always comes first in a sentence. Getting this wrong is a bit suprising, but it also had a probability of 0.58, so at least it's not very sure about it.