Palimpsest

An open-source tool that reads through the noise of scanned lecture notes and rewrites them as clean, modern LaTeX documents. Made by a student, for students.

§ 01 · WHY THIS EXISTS

A specific kind of frustration

University courses are full of brilliant lecturers whose course materials haven't aged well: scanned photocopies of handwritten notes, pages photographed with CamScanner, image-only PDFs with no text layer, no copy-paste, no search. Dense physics, statics, fluid mechanics, tensor notation — locked inside blurry pixels.

Generic OCR doesn't work. Tesseract chokes on ∂²u/∂x². Google Docs mangles integrals. There's nothing built for STEM content that doesn't cost a fortune or require a Mathpix subscription.

Palimpsest is the workaround. Drop a scanned PDF, get a clean .tex and a compiled .pdf back — readable, searchable, printable, hand-in-able. That's the whole promise.

palimpsest /ˈpalɪmp.sɛst/ · a manuscript page that has been scraped clean and written over, so that traces of the original text still show through. That's exactly what this tool does: it reads through the noise and rewrites the content cleanly.

§ 02 · BEFORE & AFTER

What you put in, what comes out

A typical page from a 1980s mechanics course, before and after Palimpsest:

Original scan

░▒▓▓░ Statique ░▒░
▓░ Solide en équilibre ░▒
░ ∑F = 0 ∑M = 0 ░░
▒░ exemple : poutre ▓░
░ encastrée en A ░░░▒
▓░ M(A) = -F·L ░░▓░
░ R(A) = F ░░░▒▓░░
▒ (figure 4.2) ░░▒░
░░ démonstration… ░░
▓░ d²θ/dt² = -k/m·θ ▒

▷▷▷

Clean LaTeX

Section 4 · Statique
Pour un solide en équilibre statique : ∑ F⃗ = 0 ∑ M⃗ = 0 Exemple. Une poutre encastrée en A subit une force F à son extrémité, à distance L. Les réactions à l'encastrement sont : R_A = F M_A = −F·L

Equations are typeset properly. Sections are numbered. The Table of Contents is generated. Every figure reference points somewhere real. Open it in Overleaf, edit, hand in.

§ 03 · HOW IT WORKS

The pipeline, in six steps

Each page goes through a chain of small, replaceable stages. If one fails, the rest keep going; if you stop midway, you can resume from the last cached page.

01 · EXTRACT

Page → image

Every page of the PDF is rasterised at 400 DPI so even the smallest indices stay legible.

pdf2image · poppler

02 · PREPROCESS

Clean the scan

Adaptive binarisation kills the photocopy grain, Hough deskew straightens the page, denoising smooths the rest.

OpenCV

03 · READ

Vision OCR

A vision LLM reads the page image directly — formulas, Greek letters, integrals, indices — and emits a first LaTeX pass.

OpenAI · Anthropic

04 · CONTEXT

Carry meaning across pages

A small YAML ledger tracks variables, conventions and section structure so notation on page 3 still makes sense on page 50.

page-to-page memory

05 · SANITISE

Make it compile

A post-pass scrubs the patterns that break Overleaf: stray code fences, banned macros, orphan TikZ, unbalanced math.

src/sanitize.py

06 · COMPILE

Typeset to PDF

All pages are assembled into one document with a proper preamble, cover page and TOC, then compiled with xelatex.

xelatex · LaTeX

§ 04 · WHAT YOU GET

Small details that matter

Real LaTeX, not Markdown

Proper \section{}, \begin{equation}, \begin{tikzpicture}. Edit it like any other LaTeX document.

Cover page included

Every output ships with a typeset titlepage (title, subject, author, credit) so you can hand it in as-is.

Page-by-page cache

Interrupted runs resume exactly where they stopped. You don't pay for the same page twice.

Inter-page memory

A notation defined on page 4 is still understood on page 47 — variables and conventions persist.

Job history

Every document is logged. Come back days later and the archive is still there with download links.

Overleaf compatible

The output compiles with both xelatex and pdflatex via an iftex conditional.

Multi-model

Eight models across OpenAI and Anthropic. Use o4-mini for cheap-and-good, Claude Opus for hard pages.

No Mathpix needed

Vision-direct mode lets the LLM OCR straight from images — no extra subscription, no extra API key.

§ 05 · FAQ

Questions that come up

How much does it cost to process a document?

With o4-mini (the default), roughly $0.04 per page. A 50-page lecture costs around $2. A 300-page textbook chapter, around $12. Costs vary with page complexity and image size — heavy figures and dense equations cost more than plain text.

What happens to my PDF? Is it private?

Your file is uploaded to the server, sent to the chosen LLM provider (OpenAI or Anthropic), and the output is stored locally on the server. Uploads are auto-purged after PALIMPSEST_UPLOAD_RETENTION_DAYS (default: 7 days). The hosted instance is for personal/educational use — if you're working on something sensitive, self-host it. The whole codebase is on GitHub.

Why LaTeX instead of just Word?

STEM content is full of equations, Greek letters, indices, integrals, and matrix notation. LaTeX is the only output format that renders all of that correctly without compromise. As a bonus, your hand-in will look like it was typeset by a publisher.

What if my scan is really bad?

The pipeline degrades gracefully: even very degraded scans usually produce a readable first pass, sometimes with a few transcription errors. Use the slower but more accurate claude-opus or gpt-4.1 for rough scans. The page-by-page cache means you can re-process individual pages without restarting the whole run.

Can I self-host this?

Yes. Clone the repo, fill in your API key in config.yaml, run python server.py. There's also a Dockerfile + a docker-compose.yml for a one-command deploy. See the README on GitHub for the full instructions.

What languages does it support?

The pipeline is currently optimised for French STEM content (because that's what I needed it for as a student), but it works with any Latin-script language the underlying LLM understands — English, Spanish, Italian, German, etc. Multi-language UX is on the roadmap.

Who built this?

A student called Abdullah Camur who got fed up with photocopied lecture notes. No company, no startup, no monetisation. Just a tool that exists because it had to. Source on GitHub, MIT licensed.

Ready to clean up a scan?

Drop a PDF on the workshop page. The pipeline takes a few minutes per document — you'll get a live progress feed while it runs, and a clean .tex plus .pdf at the end.

▷ open the workshop view source ↗