NC Bench - A Benchmark for Creative Writing Models

This AI benchmark has been built by Novelcrafter (hence the name NC Bench) and is designed to evaluate the performance of creative writing models as well as various other related tasks.

The following categories are included in the benchmark:

Creative Writing: Assessing the model's ability to generate pleasing, well-structured text that adheres to good writing principles.
Instruction Following: Evaluating the model's capacity to accurately follow specific instructions.
Utility: Testing the model's ability to perform data extraction and reformatting tasks without hallucinations.
Tooling: Checking if the model can be interfaced with programmatic interfaces and its ability to produce error-free output.
Language: Assessing the model's proficiency in generating high-quality text across multiple languages.

Test Focus, AI ethics

Our focus is on enhancing the writing process with AI assistance rather than replacing it entirely.

This benchmark tests creative writing quality rather than a model's ability to write complete stories or replace the entire writing process. We believe AI should serve as a tool to assist writers, which is reflected in our test focus:

Text Manipulation: Evaluating the model's ability to modify given text without introducing hallucinations. This is valuable for writers who need to change tenses, rephrase paragraphs, or make minor adjustments.
Text Generation: Assessing the model's capacity to provide inspiration or ideas while closely following human-given instructions and maintaining coherence with the provided storyline.
Text Summarization: Testing the model's ability to create concise elevator pitches or summaries, useful for quick overviews or marketing purposes.
Text Translation: Evaluating the model's proficiency in translating text into other languages, enabling writers to reach broader audiences or draw inspiration from diverse linguistic sources.

In summary, we do not focus on replacing the full writing process, but rather on assisting writers in their writing process by providing specific tools that can help them with their work.

How We Rank

Each scenario page features multiple tabs that rank models in different ways. Here's what each one means:

Total Score (Top Performers)

The Total Score is the straightforward average of a model's evaluation results across all runs for a given scenario. Each run is scored between 0% and 100% based on how well the model's output met the scenario's specific evaluation criteria. The Top Performers chart shows the top 20 models ranked by their median total score, displayed as box plots so you can see the spread of results across runs.

Stability (Most Stable)

A model that scores 90% on one run but 30% on the next isn't very useful in practice — you can't rely on it to deliver consistent results. The Stability score captures this by combining two factors:

Median score: The middle value of all run scores. This ensures a model that consistently scores 0% isn't rewarded for being "stable."
Consistency: Calculated as 1 - 2 × standard deviation. A model with no variance scores 1.0 (perfectly consistent), while maximum variance maps to 0.

The final stability score is the product of these two: median × consistency. This means a model needs to both score well and do so reliably to rank high on the Most Stable chart.

Rank Score (Top Overall)

The Rank Score is a composite that balances four dimensions to give a holistic view of model quality:

Performance: How well the model scores on evaluations (higher is better).
Cost: How much each run costs via the model's API (lower is better).
Speed: How long the model takes to respond (lower is better).
Stability: How consistent the results are across runs (higher is better).

Each dimension is min-max normalized across all models in a scenario, mapping values to a 0–1 range. The composite score is the average of all available dimensions for each model — if cost or speed data isn't available for a model, those dimensions are simply excluded rather than penalized. The Top Overall chart ranks models by this composite, and the rank badges in the Details table reflect the same ordering.