

The publicly available curated set of limericks accompanying this paper is an additional contribution. We find that the models identify the original limerick at rates better than chance, but with a nontrivial gap relative to human accuracy (average of 98.3% across tasks). We evaluate Transformer-based models by checking if they assign a higher probability to the non-corrupted limerick in each minimal pair. Our general task is detection of the original limerick, which we believe tests a language model’s capacity to utilize “end rhymes”, a common feature of poetry. The latter is created by (1) shuffling two rhyming end-of-the-line words, (2) shuffling two rhyming lines, (3) replacing end-of-the-line word by a non-rhyming synonym.

Following the BLiMP schema, the BPoMP tasks use 10,000 minimal pairs of limerick/corrupted limerick. The tasks presented herein use one genre of English-language poetry, the limerick (five-lines, rhyme scheme AABBA). Abstract We adapt BLiMP (Benchmark of Linguistic Minimal Pairs) language model evaluation framework to the context of poetry, introducing the first of a series of tasks titled Benchmark of Poetic Minimal Pairs (BPoMP).
