SNAILS Identifier Naturalness data¶
Dataset Generation¶
We made several different attempts at schema identifier naturalness classification including human scoring, heuristics-based, and machine learning. Ultimately, we generated a ground truth dataset by validating and, where needed, modifying labels classified by a finetuned GPT davinci classification model. Using this dataset, we then trained a local model based on Canine that outperforms the GPT-based model.
Dataset Download¶
The dataset can be accessed on our GitHub repository.