Awesome De Novo Peptide Sequencing

A comprehensive, interactive map of the field — algorithms, post-processors, downstream applications, and adjacent tools, deep-learning and classical alike.

Scope. A comprehensive map of de novo peptide sequencing — core algorithms, post-processors (re-rankers / FDR / refinement), downstream applications (immunopeptidomics, metaproteomics, cyclopeptides), adjacent tools (database-search hybrids, glycopeptide pipelines), reviews / surveys, and benchmarks. Both deep-learning and classical methods are tracked; the filters below let you slice by approach, acquisition mode (DDA / DIA), and paper kind. Want a paper added? See Contributing.

🎚️ Filters — Kind · Approach · Acquisition. Apply across the whole page; pinned to the top while you scroll.

The wave

The architectures

De novo sequencing has cycled through several methodological families — first hand-engineered dynamic programming and learning-to-rank, then a long stretch of CNN+RNN models, then transformers, GNNs, NAR variants, and most recently diffusion. Use the filters to focus on one slice of the field; hover a dot to read the method’s signature contribution.

Where the work happens

Authors and the institutions behind them span countries. Pan and zoom the map — zoom in past about 2× and country circles split into the individual cities behind them. Circle area scales with the chosen metric; fill colour shows the quartile rank.

Top institutions

Who’s driving it

The chart shows the twenty most-published authors. The network below it shows how authors with ≥ 3 papers are connected through co-authorship; drag a node to reshape the layout, or hover to highlight a neighborhood.

Models and the authors behind them

This second view rewires the same network as a bipartite graph: every prolific author (≥ 3 papers) is linked to the models they helped publish. Algorithm nodes are diamonds colored by their architecture family — so you can see which research groups own which slice of the architectural landscape.

How the field cites itself

A chronological citation arc diagram. Papers are placed left-to-right by publication date and stratified vertically by kind; within each row, the most-cited papers float to the top. Each arc connects a citing paper (right end) to a paper it cites (left end), curving upward above the row. Hover any paper to highlight the citations into it (red) and out of it (blue), and dim everything else.

Edges resolved from Crossref (by DOI) and Semantic Scholar (by DOI or title-search fallback), matched back to publications via DOI-exact (and, for refs without a DOI, fuzzy-title with token-set ratio ≥ 92). Only intra-catalog citations are drawn — references to papers outside the catalog are filtered out. Every arrow runs citing → cited, so the arrowhead always lands on the older paper.

Where it appears

Most papers in this space appear first on bioRxiv or arXiv. Toggle preprints vs. peer-reviewed to see how the venue distribution shifts.

Venue citedness (open-data analog of the Impact Factor)

Two-year mean citedness from OpenAlex (summary_stats.2yr_mean_citedness). Methodologically equivalent to the Clarivate Impact Factor formula — mean citations in year t to articles published in years t-1 and t-2 — but computed over OpenAlex’s open Crossref-aggregated citation graph rather than the paywalled Web of Science one. Conferences and preprint servers are omitted (their non-rolling publication schedule makes the metric misleading). Built offline via build_journal_metrics.py; refresh annually.

Publication lifecycle

How a method goes from arXiv / bioRxiv preprint to a peer-reviewed publication. For each algorithm, every preprint is paired greedily with the earliest following peer-reviewed publication (peer-reviewed / ML-conference / thesis all count as “post-preprint”). The Status column then tells you whether each row is paired (lifecycle complete), preprint-only (still in flight), or peer-reviewed-only (published without a preprint we have on file).

Browse all papers

Browse all authors

Aggregated from the currently-filtered set of papers. Searching is case-insensitive across every column (name, affiliation, country, methods).

Contributing

Easiest path: open a GitHub issue with a link to the paper (DOI / arXiv / bioRxiv / OpenReview / …) and I’ll wire it into the database. Corrections are equally welcome — wrong author lists, missing affiliations, mis-classified kind / DL / acquisition, broken hyperlinks, anything that looks off.

Advanced: edit the database directly

The site is generated from denovo.db (SQLite, the source of truth). If you’re comfortable with SQL:

  1. Edit denovo.db with any SQLite tool (sqlite3 CLI, DB Browser for SQLite, DataGrip, …). A new paper typically needs rows in publication, publication_author, and publication_algorithm; a new model also needs a row in algorithm (set kind, is_deep_learning, acquisition_mode). Affiliations cascade through country → city → affiliation and link to authors via author_affiliation.

  2. Regenerate the human-readable SQL dump so the diff is reviewable:

    sqlite3 denovo.db .dump > denovo.sql
  3. Open a PR with both denovo.db and denovo.sql. The GitHub Action rebuilds the site and publishes to gh-pages on merge — typically live within ~3 minutes.


This page is a comprehensive map of de novo peptide sequencing — algorithms, post-processors, downstream applications, and adjacent tools, deep-learning and classical alike. Source data and code: GitHub — rebuilt automatically on every push to main.

Back to top