Skip to main content

"Can I Say This In Chinese" with BERT


As it turns out, most people are not very inclined to teaching. I'm learning Chinese, my wife is Chinese, seems like a match made in heaven. Except that she has no patience whatsoever with my broken Chinese (though she's wonderful in many other ways). Whenever I ask how to say something in Chinese, she anwers with either "I don't know" or "you can't say that (followed by no explanation)". The only way I can get anything out of her is by trying to say something in Chinese and asking whether it sounds right or not. This is less mentally taxing for her than actually having to translate from English, which I understand, especially for two languages so dissimilar.

Now I'm thinking, with the recent advances in Natural Language Processing with Deep Learning, maybe I can create something to replace my unwilling wife. The academic name for this task seems to be "Linguistic Acceptability". Exactly what this includes seems to be up for debate. For example, "the mouse ate the cat" is perfectly grammatical, although highly unlikey. Then there are sentences which are grammatical but seem logically impossible, like "the cat is a bus". This sentence makes no sense unless you've watched the movie Totoro, which features a... cat that is also a bus. Since this seems like a very difficult problem, I'll be focusing more on distinguishing grammatical vs. ungrammatical rather than sensical vs. nonsensical.

Defining the problem

Recent Deep Learning architectures like BERT and GPT-2 basically train a language model or LM, i.e. given the surrounding context, they try to predict the missing word. In GPT-2s case, it predicts the next word given all the previous words in the sentence, while BERT predicts a missing word (a cloze) given both the words before and after it (the B in BERT stands for bidirectional). As such, GPT-2 works better as a language model, defining the joint probability over a sequence of words, while BERT's masked LM is less straight forward to use as such. As a reminder, the joint probability can be refactored recursively using the chain rule:

$$P(w_{1:n}) = P(w_n | w_{1:n-1})P(w_{1:n-1}) = P(w_n | w_{1:n-1}) \cdot \ldots \cdot P(w_2 | w_1)P(w_1)$$

Each of these factors is exactly what we get out of GPT-2, which means if we run inference and multiply the factors we get the joint probability, or actually more of an unormalized likelihood, of the whole sentence. BERT on the other hand gives us $P(w_k | w_{1:k-1}, w_{k+1:n})$ which is harder to intepret. There is research exploring ways of getting a joint probability model out of BERT using MRFs (Markov Random Fields), but I'd like to keep things simple for this little project.

Using GPT-2 will be difficult, since training it from scratch, having 1.5 billion weights, requires a cluster of GPUs and roughly $50k. So I'm constrained to pre-trained versions, of which there is none for Chinese AFAIK. The Python library pytorch-transformers does however have a pre-trained BERT for Chinese.

How can we use BERT?

Being constrained by time and money leaves me no option but to use BERT at this point. While BERT can't be used as a language model per-se, we can perhaps use the output in some useful way.

We'd like to get a binary decision whether a sentence is acceptable or not. We could try to use the masked probability for each word in the sentence, but again, it will be difficult to find some absolute thresold to distinguish unlikely sentences from unacceptable ones. What we could do is to train a classifier based on BERT with a dataset of positive and negative examples. While there are such datasets for other languages (CoLA - Corpus of Linguistic Acceptablility), I have not found such a dataset for Chinese.

I was however able to crawl some examples from the AllSet grammar wiki (licensed with CC-NC) with in total 436 and 461 negative and positive examples respectively, split into grammar groups based on page (note: this will take some time to run):

! wget --quiet --mirror --convert-links --adjust-extension --follow-tags=a --no-parent
! grep -r -e 'class="x"'**/* |\
  sed -e 's/<li class="x">//g' -e 's/<span .*//g' -e 's/<\/*[a-z]*>//g' -e 's/ //g' -e 's/:.*→/:/g' \
  > "$cache_path/allset_negative_examples.txt"
! grep -r -e 'class="o"'**/* |\
  sed -e 's/<li class="o">//g' -e 's/<span .*//g' -e 's/<\/*[a-z]*>//g' -e 's/ //g' -e 's/:.*→/:/g' \
  > "$cache_path/allset_positive_examples.txt"

While it's putting the car before the horse a bit, I suspected (correctly) that this small dataset would not be enough to train a classifier that generalizes well to any output. There are just too few examples to generalize to all the ways sentences can be correct and wrong, although these examples do contain many important and subtle errors learners commit.

Self-supervised learning

Instead of only training on the small dataset, the idea is to pre-train a classifier in a self-supervised way by generating negative examples from positive ones. While the masked probabilities of all the words in a sentence is not enough to tell the acceptability of the sentence, we can assume there is useful information in the relative scores, or losses, between sentences.

Using relative losses, we can generate negative samples from positive ones by finding a mutation that significantly increases the loss. Let's define the loss for a sentence as the average (since we're possibly comparing sentences of differing lengths) Cross-Entropy loss for each word: $$ L(S) = -\frac{1}{N}\sum_{i=1}^{N}{\log(P(w_i | w_{1:i-1}, w_{i+1:N}))} $$ Then we can perform take a correct sentence $S_c$ and perform a random mutation to get $S_m$. If $L(S_m) - L(S_c) > \epsilon$ we consider it to be unacceptable.

Note that even if we could use the bidirectional probabilities/losses to directly do classification, this is something we'd like to avoid since calculating this loss requires a forward pass for every token in the sentence. Using these expensively generated examples to train a classifier let's us bypass this problem.

Hard Negatives

This way we can generate unacceptable sentences from any acceptable one. Now since there are many possible ways to mutate a sentence that increases the loss more than $\epsilon$, we can pick the minimal one that passes this threshold. This is similar to hard negative mining where if you already have a model, you can improve it by sampling hard negatives and retraining the model. This is common in image classification and localization where any part of an image not containing the specified object are potential negative examples. Then it makes sense to pick the ones that are misclassified or get high losses from the initial model.


For the actual sentences, we could use the original corpus, but I prefer using sentences from Tatoeba since it is a good source of informal language suitable for learners.

For mutating the sentences, there are a few things we can do:

  • Permute the words
  • Swap two words
  • Insert word (sampled based on corpus frequency)
  • Replace word (sampled based on corpus frequency)
  • Delete word

While we want to mutate the sentences to get unacceptable ones, there is some degree of unacceptability, and we want to generate ones that are hard, i.e. just barely unacceptable. Therefore I exclude random permutations since they are very unlikely to produce something close to acceptability.

Similarly for insertions and word replacements, it makes more sense to sample common words more frequently than rare words since the language has a very long tail of very infrequent words.

Below is the code for loading the Tatoeba dataset and generating hard negatives. (NOTE: this is a lot of not very interesting code, but it is runnable if you run this in a Jupyter Notebook or Google Colab environment). Also worth mentioning is that the starting point for the PyTorch training was this Colab Notebook, which serves as a good tutorial for fine-tuning BERT for sequence classification.

First, installing some pip packages:

!pip install --quiet pytorch-transformers pytorch-nlp hanziconv jieba sympy

Import a pre-trained Masked LM BERT model and define functions for preparing data for this model, as well as functions for predicting based on it, and calculating losses for whole sentences:

import io
import os
import re
import torch
import jieba
import random
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from tqdm import tqdm, trange
from hanziconv import HanziConv
from sympy.ntheory import factorint
from functools import lru_cache
from import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split, GroupKFold
from pytorch_transformers import BertTokenizer, BertConfig, BertModel
from pytorch_transformers import AdamW, BertForSequenceClassification, BertForMaskedLM
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
from sklearn.metrics import matthews_corrcoef, precision_score, recall_score, accuracy_score

% matplotlib inline

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

def gpu_usage(print_stats=False):
  """ Convenience function to check GPU memory usage. Returns free memory in GB """
  handle = nvmlDeviceGetHandleByIndex(0)
  info = nvmlDeviceGetMemoryInfo(handle)
  if print_stats:
    print(f"Total memory: {} GB")
    print(f"Free memory: {} GB")
    print(f"Used memory: {info.used/1e9:.2f} GB")

# Make sure we have enough memory
if gpu_usage(print_stats=True) < 8:
  raise SystemError('Not enough memory')

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', do_lower_case=True)

# Load pre-trained model (weights)
masked_lm_model = BertForMaskedLM.from_pretrained('bert-base-chinese')

def prepare_data(df, test_size=0.1, batch_size=32, shuffle=True, add_cls_sep=True):
  sentences = df.sentence.values
  # We need to add special tokens at the beginning and end of each sentence for BERT to work properly
  if add_cls_sep:
    sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
  has_labels = 'label' in df.columns
  if has_labels:
    labels = df.label.values
    labels = np.zeros(len(sentences))

  tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

  # Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
  # In the original paper, the authors used a length of 512.
  MAX_LEN = 128

  # Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
  input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

  # Pad our input tokens
  input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

  # Create attention masks
  attention_masks = []

  # Create a mask of 1s for each token followed by 0s for padding
  for seq in input_ids:
    seq_mask = [float(i > 0) for i in seq]

  # Use train_test_split to split our data into train and validation sets for training
  # but if test_size is zero then only generate training sets
  if test_size > 0.0:
    train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
        input_ids, labels, random_state=2018, test_size=test_size, shuffle=shuffle)
    train_masks, validation_masks, _, _ = train_test_split(
        attention_masks, input_ids, random_state=2018, test_size=test_size, shuffle=shuffle)
    train_inputs = input_ids
    train_labels = labels
    train_masks = attention_masks
    validation_inputs = []
    validation_labels = []
    validation_masks = []
  # Convert all of our data into torch tensors, the required datatype for our model
  train_inputs = torch.tensor(train_inputs)
  validation_inputs = torch.tensor(validation_inputs)
  train_labels = torch.tensor(train_labels)
  validation_labels = torch.tensor(validation_labels)
  train_masks = torch.tensor(train_masks)
  validation_masks = torch.tensor(validation_masks)

  # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
  # with an iterator the entire dataset does not need to be loaded into memory
  train_data = TensorDataset(train_inputs, train_masks, *([train_labels] if has_labels else []))
  if shuffle:
    train_sampler = RandomSampler(train_data)
    train_sampler = SequentialSampler(train_data)
  train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

  validation_data = TensorDataset(validation_inputs, validation_masks, *([validation_labels] if has_labels else []))
  validation_sampler = SequentialSampler(validation_data)
  validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
  return train_dataloader, validation_dataloader

def predict(dataloader, model, has_labels=True):
  Evaluates data from a data loader on a model and returns either a tuple of
  predicted probability and true label if has_labels=True otherwise it returns
  the raw logits
  # Put model in evaluation mode

  # Predict 
  for i, batch in enumerate(dataloader):
    # Add batch to GPU
    batch = tuple( for t in batch)
    # Unpack the inputs from our dataloader
    if has_labels:
      b_input_ids, b_input_mask, b_labels = batch
      b_input_ids, b_input_mask = batch

    # Telling the model not to compute or store gradients, saving memory and speeding up prediction
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits, *_ = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    if has_labels:
      softmax_probs = np.exp(logits[:, 1]) / np.exp(logits).sum(axis=1)
      label_ids ='cpu').numpy()
      for prob, label in zip(softmax_probs, label_ids):
        yield prob, label
      yield logits

def eval_loss_sentences(sentences, masking='char'):
  Evaluate the loss for a list of sentences
  sentences: the list of sentences
  masking: 'word' for whole word, and 'char' for single character masking
  assert masking in ['word', 'char']
  masking_words = masking == 'word'
  indexed_sentence_tokens = []
  tokenized_sentences = []
  sentence_mask_indices = []
  all_examples = []
  for sentence in sentences:
    # NOTE: the tokenizer removes spaces
    tokenized_sentence = tokenizer.tokenize(sentence)
    tokenized_sentence = tokenized_sentence[:128]
    if masking_words:
      tokenized_sentence = list(t[0] for t in jieba.tokenize(''.join(tokenized_sentence)))
    mask_indices = []
    char_idx = 0
    for i in range(len(tokenized_sentence)):
      mask_token = tokenized_sentence[i]
      mask_token_parts = len(tokenizer.tokenize(mask_token)) if masking_words else 1
      all_examples.append(''.join(tokenized_sentence[:i]) +
                          ''.join(mask_token_parts*['[MASK]']) +
      mask_indices.append((char_idx, char_idx+mask_token_parts))
      char_idx += mask_token_parts

  df = pd.DataFrame(data={'sentence': all_examples})
  dataloader, _ = prepare_data(df, test_size=0.0, batch_size=32, shuffle=False)

  sentence_losses = []
  curr_sentence_loss = 0
  curr_sentence = 0
  curr_mask_idx = 0
  curr_example = 0
  for batch_logits in predict(dataloader, masked_lm_model, has_labels=False):
    for i in range(batch_logits.shape[0]):
      mask_start, mask_end = sentence_mask_indices[curr_sentence][curr_mask_idx]
      for m in range(mask_start, mask_end):
        mask_logits = batch_logits[i][m+1]
        mask_logits_exp = np.exp(mask_logits)
        mask_token_probs = mask_logits_exp / mask_logits_exp.sum()
        mask_entropy = -(mask_token_probs * np.log(mask_token_probs)).sum()
        masked_token_index = indexed_sentence_tokens[curr_sentence][m]
        # Cross-Entropy Loss
        curr_sentence_loss += -np.log(mask_token_probs[masked_token_index])

      curr_mask_idx += 1
      curr_example += 1
      if curr_mask_idx == len(tokenized_sentences[curr_sentence]):
        # We've reached a new sentence, reset and append log prob
        # Normalize sentence loss by number of tokens
        curr_sentence_loss /= len(tokenized_sentences[curr_sentence])
        curr_sentence_loss = 0
        curr_mask_idx = 0
        curr_sentence += 1
  return sentence_losses

Download example sentences from Tatoeba and word frequency dataset:

! wget
! bzip2 -dc sentences.tar.bz2 > "$cache_path/sentences.txt"
! wget -O "$cache_path/weibo.txt"

Below is the code for reading the Tatoeba and Weibo frequency datasets and generating hard negatives:

orig_sentences = []
with open(cache_path+'/sentences.txt', 'r') as f:
  for line in f:
      splits = line.split('\t')
      if len(splits) < 3:
      _, lang, zh = line.split('\t')
      if lang != 'cmn': continue
      zh = HanziConv.toSimplified(zh.strip())

words = []
counts = []
with open(cache_path+'/weibo.txt', 'r', encoding='utf-8-sig') as f: 
    for line in f.readlines():
        word, count = line.split('\t')
        tokenized_word = tokenizer.tokenize(word)
        if len(tokenized_word) == 0:
        # Skip [UNK] or other garbage unkown to the BERT tokenizer
        skip = False
        for t in tokenized_word:
          if len(t) > 1:
            skip = True
        if skip: continue

# Calculate the probability and cumulative probability function for words over
# the frequency
counts = np.array(counts)
word_probs = counts / counts.sum()
cdf = np.cumsum(word_probs)

def sample_word():
  """ Sample a random word based on frequency """
  r = random.random()
  idx = np.searchsorted(cdf, r)
  return words[idx]

def middle_coprime(n):
  """ Find the middle coprime of a number, e.g. of all the
      sorted coprimes of n, pick the middle one """
  factors = list(factorint(n).keys())
  coprimes = [1]
  for i in range(n-2, 1, -1):
    coprime = True
    for f in factors:
      if i % f == 0:
        coprime = False
    if coprime:
  return coprimes[len(coprimes) // 2]

def pseudo_random_range(from_idx, to_idx=None):
  Visit all indices in a range pseudo-randomly by visiting (ax + b) mod n, 
  where a and n are co-prime. Small and large coprimes tend to not look random,
  so pick the middle one.
  if to_idx is None:
    from_idx, to_idx = 0, from_idx

  n = to_idx - from_idx
  coprime = middle_coprime(n)
  offset = random.randint(0, n-1) if n > 1 else 0
  for i in range(0, n):
    yield from_idx + (coprime*i + offset) % n 

IGNORE = set(['。', '」', '「', ',', ' ', '!', '?', '?', '!', '.', ','])
# Swaps that usually produce acceptable sentences:
POSITIVE_SWAP_GROUPS = [set(['我', '你', '他', '她']), # personal pronouns
                       set(['我们', '你们', '他们', '她们'])] # plural personal pronouns
def is_positive_swap(from_token, to_token):
  swap_set = set([from_token, to_token])
  # Check if both tokens are in a positive swap group, if so we don't swap
  for swap_group in POSITIVE_SWAP_GROUPS:
    if len(swap_set & swap_group) == 2:
      return True
  return False

def generate_delete(sentence, tokens):
  for idx in pseudo_random_range(len(tokens)):    
    token = tokens[idx][0]
    if token in IGNORE:
    tokens_deleted = tokens[:idx] + tokens[idx+1:]
    yield ''.join(t[0] for t in tokens_deleted)

def generate_insert(sentence, tokens):
  for idx in pseudo_random_range(len(tokens)):    
    word = sample_word()
    tokens_inserted = tokens[:idx] + [(word,)] + tokens[idx:]
    yield ''.join(t[0] for t in tokens_inserted)

def generate_replace(sentence, tokens):
  for idx in pseudo_random_range(len(tokens)):    
    token = tokens[idx][0]
    if token in IGNORE:
    # Sample words until it's not equal to the token we're replacing
    word = token
    while word == token:
      word = sample_word()
    tokens_replaced = tokens[:idx] + [(word,)] + tokens[idx+1:]
    yield ''.join(t[0] for t in tokens_replaced)

def generate_swap(sentence, tokens):
  token_set = set([t[0] for t in tokens])
  for from_idx in pseudo_random_range(len(tokens)-1):    
    from_token = tokens[from_idx][0]
    if from_token in IGNORE:

    for to_idx in pseudo_random_range(from_idx, len(tokens)):
      to_token = tokens[to_idx][0]
      if (from_token == to_token or
          to_token in IGNORE):

      if is_positive_swap(from_token, to_token):
      # Swap the tokens and return the new string
      mtokens = list(tokens)
      mtokens[to_idx], mtokens[from_idx] = mtokens[from_idx], mtokens[to_idx]
      yield ''.join(t[0] for t in mtokens)

def generate_mutated(sentence):
  tokens = list(jieba.tokenize(sentence))
  generators = [#generate_delete(sentence, tokens),
                generate_insert(sentence, tokens),
                generate_replace(sentence, tokens),
                generate_swap(sentence, tokens)]
  pick_probs = np.array([0.15, 0.15, 0.7])
  while len(generators) > 0:
    gen_idx = np.random.choice(np.arange(len(generators)), p=pick_probs)
    random_gen = generators[gen_idx]
      yield next(random_gen)
    except StopIteration:
      # The generator is out of sentences to generate, so remove it
      del generators[gen_idx]
      pick_probs = np.delete(pick_probs, gen_idx)
      # Need to normalize so probabilities add up to 1
      pick_probs /= pick_probs.sum()

def generate_hard_negatives(sentences, model, loss_threshold=0.5, generate_max=10,
  Creates hard negative examples, which are sampled based on mutations that
  increase the loss the least but still significantly enough to very likely be a
  true negative.
  sentence_examples = list(sentences)
  for i, sentence in enumerate(sentence_examples):
    # Skip sentences with unknown words or other garbage
    predict_sentences = [sentence]
    generator = generate_mutated(sentence)
    for _ in range(generate_max):
      except StopIteration:
    losses = eval_loss_sentences(predict_sentences, masking='char')
    print('C: ', sentence)
    for s, l in sorted(zip(predict_sentences[1:], losses[1:]), key=lambda x: x[1]):
      if l - losses[0] > loss_threshold:
        if debug_print:
          print('W: ', s, l, ' +', l-losses[0])
        yield s

negatives_path = cache_path + '/negatives.txt'
if os.path.exists(negatives_path):
  with open(negatives_path, 'r') as f:
    hard_negatives = [l.strip() for l in f.readlines()]
  # NOTE: generating hard negatives takes a long time since to check a single mutation
  # we need to run inference len(sentence) times, and we need to generate a number
  # of mutations for each sentence in order to find a good one
  # So we run a few thousand at a time and store them in case runtime gets recycled
  use_num = 30000
  num_at_a_time = 3000
  use_sentences = orig_sentences[:use_num]
  hard_negatives = []
  for i in range(0, use_num // num_at_a_time):
    if os.path.exists(f'{cache_path}/negatives{i+1}.txt'):
    with open(f'{cache_path}/negatives{i+1}.txt', 'w') as f:
      sentences = use_sentences[i*num_at_a_time:(i+1)*num_at_a_time]
      for negative in generate_hard_negatives(sentences, masked_lm_model, debug_print=True):
        f.write(negative + '\n')

  # Concatenate all files to one
  with open(negatives_path, 'w') as n:
    for i in range(0, use_num // num_at_a_time):
      with open(f'{cache_path}/negatives{i+1}.txt', 'r') as f:

Fine-tuning BERT

There are plenty of tutorials on how to fine-tune a BERT model. For this experiment I'll use the pre-trained Chinese model in the Python library pytorch-transformers by huggingface. This model is trained with a character-by-character tokenizer, meaning multi-character Chinese words are split into separate word embeddings for each character. This may be suboptimal, unless the model is powerful enough to capture the structure of words, but for now this is what we have to work with.

Below is the code for training and validating the BERT model for classification:

def train(dataloader, epochs=4, model=None, debug_print=False):
  if model is None:
    model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=2);

  param_optimizer = list(model.named_parameters())
  no_decay = ['bias', 'gamma', 'beta']
  optimizer_grouped_parameters = [
      {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
       'weight_decay_rate': 0.01},
      {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
       'weight_decay_rate': 0.0}
  optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)
  # Set our model to training mode (as opposed to evaluation mode)
  train_loss_set = []

  # trange is a tqdm wrapper around the normal python range which prints progress
  r = trange(epochs, desc="Epoch") if debug_print else range(epochs)
  for _ in r:
    # Tracking variables
    train_loss = 0
    num_examples, num_steps = 0, 0

    # Train the data for one epoch
    for step, batch in enumerate(dataloader):
      if debug_print: print(f'Batch: {step}')
      # Add batch to GPU
      batch = tuple( for t in batch)
      # Unpack the inputs from our dataloader
      b_input_ids, b_input_mask, b_labels = batch
      # Clear out the gradients (by default they accumulate)
      # Forward pass
      loss, *_ = model(b_input_ids, token_type_ids=None,
                       attention_mask=b_input_mask, labels=b_labels)
      # Backward pass
      # Update parameters and take a step using the computed gradient

      # Update tracking variables
      train_loss += loss.item()
      num_examples += b_input_ids.size(0)
      num_steps += 1

    if debug_print: print("Train loss: {}".format(train_loss/num_steps))

  return model

def evaluate(model, dataloader, df=None):
    y_true = []
    y_pred = []
    for prob, label in predict(dataloader, model):
      y_pred.append(1 if prob > 0.5 else 0)
    return y_true, y_pred

def print_stats(y_true, y_pred, sentences=None, label=None):
  tab = ''
  if label is not None:
    tab = '\t'
  print(f'{tab}Matthews Correlaton Coefficient:', matthews_corrcoef(y_true, y_pred))
  print(f'{tab}Accuracy:', accuracy_score(y_true, y_pred))
  print(f'{tab}Precision:', precision_score(y_true, y_pred))
  print(f'{tab}Recall:', recall_score(y_true, y_pred))

Now we can train our first classification model on positive examples from the Tatoeba dataset and our generated hard negatives. Here I'll train the classifier with an increasing number of examples to see if we need more data. Training an iterating is slow, so I prefer to keep it as small as possible for now.

model_path = cache_path + '/'
if os.path.exists(model_path):
  classification_model = torch.load(model_path)
  training_accuracies = []
  validation_accuracies = []
  classification_model = None
  for num in [3000, 6000, 9000, len(hard_negatives)]:
    hard_negatives_df = pd.DataFrame(data={
        'sentence': hard_negatives[:num] + orig_sentences[num:2*num],
        'orig': orig_sentences[:2*num],
        'label': num*[0]+num*[1]})

    train_dataloader, validation_dataloader = prepare_data(
        hard_negatives_df, test_size=0.1, batch_size=32)
    classification_model = train(train_dataloader, epochs=4, debug_print=False)
    print('Train accuracy: ', accuracy(*evaluate(classification_model, train_dataloader)))
    print('Validation accuracy: ', accuracy(*evaluate(classification_model, validation_dataloader)))
  # Save to disk, for rerunning and making copies, model_path)

df = pd.DataFrame(data={
    'sentence': hard_negatives + orig_sentences[len(hard_negatives):2*len(hard_negatives)],
    'label': len(hard_negatives)*[0] + len(hard_negatives)*[1]})
dataloader, _ = prepare_data(df, test_size=0.0, batch_size=32)
print_stats(*evaluate(classification_model, dataloader), label='Final')

Now let's load the AllSet grammatical wiki examples and train models with cross-validation either from scratch or using the pre-trained model.

One important difference from the previous dataset is that we want to know how well the model generalizes to new unseen grammatical rules rather than just unseen examples. Therefore we split the data into training and validation sets based on the grammatical rule/group, such that examples from the same group never are split between the train and test sets.

allset_negative_examples = defaultdict(list)
with open(cache_path+'/allset_negative_examples.txt', 'r') as f:
  for l in f.readlines():
    filename, sentence = l.split(':')
allset_positive_examples = defaultdict(list)
with open(cache_path+'/allset_positive_examples.txt', 'r') as f:
  for l in f.readlines():
    filename, sentence = l.split(':')

all_files = list(set(allset_negative_examples.keys()) |
allset_sentences = []
allset_labels = []
allset_groups = []
for g, filename in enumerate(all_files):
  negative = allset_negative_examples[filename]
  positive = allset_positive_examples[filename]
  allset_sentences += negative + positive
  allset_labels += [0]*len(negative) + [1]*len(positive)
  allset_groups += (len(negative)+len(positive))*[g]

allset_sentences = np.array(allset_sentences)
allset_labels = np.array(allset_labels)
allset_groups = np.array(allset_groups)

tatoeba_sample = np.random.choice(orig_sentences, 10000)
hard_negative_sample = np.random.choice(hard_negatives, 10000)
self_supervised_df = pd.DataFrame(data={
    'sentence': list(hard_negative_sample) + list(tatoeba_sample),
    'label': len(hard_negative_sample)*[0] + len(tatoeba_sample)*[1]})
self_supervised_dataloader, _ = prepare_data(self_supervised_df, test_size=0.0,
                                             batch_size=32, shuffle=False)

def cross_validate_allset(initial_model_path=None, epochs=4, n_splits=10,
  train_results = [[], []]
  test_results = [[], []]
  self_supervised_results = [[], []]
  new_model = None
  if n_splits == 1:
    generator = [(np.arange(len(allset_sentences)),
    group_kfold = GroupKFold(n_splits=n_splits)
    generator = group_kfold.split(allset_sentences, allset_labels, allset_groups)

  for i, (train_index, test_index) in enumerate(generator):
    train_examples = allset_sentences[train_index]
    train_labels = allset_labels[train_index]
    test_examples = allset_sentences[test_index]
    test_labels = allset_labels[test_index]
    train_dataloader, _ = prepare_data(
        pd.DataFrame(data={'sentence': train_examples, 'label': train_labels}),
        test_size=0.0, batch_size=32)
    test_dataloader, _ = prepare_data(
        pd.DataFrame(data={'sentence': test_examples, 'label': test_labels}),
        test_size=0.0, batch_size=32)
    model = None
    if initial_model_path is not None:
      model = torch.load(initial_model_path)

    new_model = train(train_dataloader, epochs=epochs, model=model,
    train_result = evaluate(new_model, train_dataloader)
    test_result = evaluate(new_model, test_dataloader)
    self_supervised_result = evaluate(new_model, self_supervised_dataloader)
    if print_progress:
      print_stats(*train_result, label='AllSet Train')
      print_stats(*test_result, label='AllSet Test')
      print_stats(*self_supervised_result, label='Self-Supervised')

    train_results[0] += train_result[0]
    train_results[1] += train_result[1]
    test_results[0] += test_result[0]
    test_results[1] += test_result[1]
    self_supervised_results[0] += self_supervised_result[0]
    self_supervised_results[1] += self_supervised_result[1]
  print_stats(*train_result, label='Overall AllSet Train')
  print_stats(*test_result, label='Overall AllSet Test')
  print_stats(*self_supervised_result, label='Overall Self-Supervised')

  # Return the last model
  return new_model

First, let's train a model from scratch on the AllSet data and see how well it does against against itself as well as against our self-supervised Tatoeba + Hard negative dataset:

cross_validate_allset(initial_model_path=None, epochs=6, n_splits=10, print_progress=False);
Overall AllSet Train:
	Matthews Correlaton Coefficient: 0.978298651254621
	Accuracy: 0.9891304347826086
	Precision: 0.9838337182448037
	Recall: 0.9953271028037384
Overall AllSet Test:
	Matthews Correlaton Coefficient: 0.9366607354497857
	Accuracy: 0.967391304347826
	Precision: 0.94
	Recall: 1.0
Overall Self-Supervised:
	Matthews Correlaton Coefficient: 0.46815654446892113
	Accuracy: 0.7165
	Precision: 0.6568613244457325
	Recall: 0.9066

As you can see, it seems to generalize well on the AllSet data across the folds, meaning somehow it generalizes to unseen grammatical rules. But the performance on the self-supervised dataset is poor. This is probably due to the AllSet data being biased towards easier, illustrative examples, which are substantially different from the average sentence from Tatoeba. It also doesn't cover all the more "obvious" ways sentences can be grammatical.

Now lets do the same thing, but with a model pre-trained on the self-supervised dataset, with the hope that we can generalize on both data sets:

cross_validate_allset(initial_model_path=model_path, epochs=6, n_splits=10, print_progress=False);
Overall AllSet Train:
	Matthews Correlaton Coefficient: 0.9927488225424451
	Accuracy: 0.9963768115942029
	Precision: 0.9976580796252927
	Recall: 0.9953271028037384
Overall AllSet Test:
	Matthews Correlaton Coefficient: 0.9784719757905218
	Accuracy: 0.9891304347826086
	Precision: 0.9791666666666666
	Recall: 1.0
Overall Self-Supervised:
	Matthews Correlaton Coefficient: 0.8895640148971811
	Accuracy: 0.94365
	Precision: 0.9777107785075912
	Recall: 0.908

The overall results show that the model has generalized relatively well to both datasets, although the scores are lower for the self-supervised data set compared to before.

For training the final model, we can get an even better result for the self-supervised data by training it from scratch on both data sets, but with the AllSet data upsampled to match the self-supervised in size, giving both equal importance. Here I'll train it once with a single test set instead of k-fold cross validation, so I don't time out in Google Colab.

final_model_path = cache_path+'/' 
if os.path.exists(final_model_path):
  final_model = torch.load(final_model_path)
  # Again, need to split AllSet into train/test using GroupKFold
  # GroupKFold.split returns all cross-validation sets, but we'll just use the first
  allset_train_idx, allset_test_idx = next(GroupKFold(n_splits=10).split(allset_sentences, allset_labels, allset_groups))
  allset_train = allset_sentences[allset_train_idx]
  allset_train_labels = allset_labels[allset_train_idx]
  allset_test = allset_sentences[allset_test_idx]
  allset_test_labels = allset_labels[allset_test_idx]
  # Next split the self-supervised data set into train/test as well
  ss_train, ss_test, ss_train_labels, ss_test_labels =  train_test_split(
      orig_sentences[len(hard_negatives):2*len(hard_negatives)] + hard_negatives,
      [1]*len(hard_negatives) + [0]*len(hard_negatives), test_size=0.1)
  # Then combine both data sets, but with upsampling for AllSet so that they are
  # of equal size
  upsample_times = 2*len(hard_negatives) // len(allset_sentences)
  all_train = (list(ss_train) + upsample_times*list(allset_train))
  all_train_labels = (ss_train_labels + upsample_times*list(allset_train_labels))
  all_train_dataloader, _ = prepare_data(
      pd.DataFrame(data={'sentence': all_train, 'label': all_train_labels}),
      test_size=0.0, batch_size=32)
  allset_test_dataloader, _ = prepare_data(
      pd.DataFrame(data={'sentence': allset_test, 'label': allset_test_labels}),
      test_size=0.0, batch_size=32)
  ss_test_dataloader, _ = prepare_data(
      pd.DataFrame(data={'sentence': ss_test, 'label': ss_test_labels}),
      test_size=0.0, batch_size=32)
  final_model = train(all_train_dataloader, epochs=4,
  train_result = evaluate(final_model, all_train_dataloader)
  allset_test_result = evaluate(final_model, allset_test_dataloader)
  ss_test_result = evaluate(final_model, ss_test_dataloader)
  print_stats(*train_result, label='Train')
  print_stats(*allset_test_result, label='AllSet Test')
  print_stats(*ss_test_result, label='Self-Supervised Test'), final_model_path)

And a sanity check on a few new examples I've found by googling, and some I've come up with myself:

incorrect_sentences = [
  '我们开会在明天上午九点 。',
correct_sentences = [

incorrect_df = pd.DataFrame(data={'sentence': incorrect_sentences, 'label': len(incorrect_sentences)*[0]})
incorrect_dataloader, _ = prepare_data(incorrect_df, test_size=0.0, batch_size=1, shuffle=False)
correct_df = pd.DataFrame(data={'sentence': correct_sentences, 'label': len(correct_sentences)*[1]})
correct_dataloader, _ = prepare_data(correct_df, test_size=0.0, batch_size=1, shuffle=False)
gen = zip(correct_sentences, predict(correct_dataloader, model=final_model, has_labels=True),
          incorrect_sentences, predict(incorrect_dataloader, model=final_model, has_labels=True))
print('Correct | Incorrect')
for correct, (prob_correct, _), incorrect, (prob_incorrect, _) in gen:
  print(f'{correct}: {prob_correct:.2f} | {incorrect}: {prob_incorrect:.2f}')
Correct | Incorrect
你有没有车: 1.00 | 你有没有车吗?: 0.29
你很高: 1.00 | 你是很高: 0.02
你的包很漂亮: 1.00 | 你得包很漂亮: 0.00
这辆车很贵: 1.00 | 这个车很贵: 1.00
这辆车很贵: 1.00 | 这本车很贵: 1.00
我昨天在公园碰到他了: 0.87 | 我碰到他在公园昨天了: 0.00
在一家中国饭店,马丽和汤姆见面了。: 1.00 | 在一家中国饭店,马丽见面了汤姆。: 0.01
他们在法国和对方见面了。: 1.00 | 他们在法国见面了对方。: 0.00
马丽嫁了汤姆。: 0.84 | 马丽结婚了汤姆。: 0.00
汤姆娶了马丽。: 1.00 | 汤姆结婚了马丽。: 0.00
我喜欢所有学生。: 1.00 | 我喜欢都学生。: 0.00
这是我的所有。: 1.00 | 这是我的都。: 0.02
我们明天上午九点开会。: 1.00 | 我们开会在明天上午九点 。: 0.58
我没有时间。: 1.00 | 我不有时间。: 0.00

For those of you who don't know any Chinese, I'll explain the 3 false positives out of these examples.

The first two false positives are when using the wrong "measure word" for the noun "car". In English we have measure words for some things, like a pair of shoes or a loaf of bread, but Chinese loads of them. It seems like the model hasn't managed to learn this, but it's also a simple thing to add more data for: we can just find sentences with measure words and swap them for the wrong one.

The last error is one of sentence word ordering, where in Chinese the time and place always comes first in a sentence. Getting this wrong is a bit suprising, but it also had a probability of 0.58, so at least it's not very sure about it.