Archival text workflows
Testing whether OCR, alignment, and corpus-building workflows can make fragile manuscripts and dispersed archives more legible to scholars.
Lingua is PureTensor’s language research initiative. It brings together research notes, technical essays, and prototype work around language preservation, archival text workflows, and low-resource computational linguistics.
Lingua concentrates on the practical side of language technology: how source material is digitised, organised, searched, analysed, and turned into usable research material.
Testing whether OCR, alignment, and corpus-building workflows can make fragile manuscripts and dispersed archives more legible to scholars.
Exploring how multilingual models, transcription systems, and annotation pipelines might support endangered and historical language work without overstating what current systems can do.
Treating compute, storage, search, and careful review processes as part of the research problem, not just the delivery layer around it.
For PureTensor, the interesting question is not only whether a model can classify or transcribe. It is whether the surrounding workflow — digitisation, review, corpus assembly, search, storage, provenance, and human judgement — can be improved enough to make small, careful research efforts materially more effective.
That is why Lingua combines research direction with infrastructure thinking, rather than treating models in isolation from the workflows around them.
Lingua operates as a compact, publication-driven initiative. The emphasis is on careful problem selection, technical clarity, and work that can support future prototypes, partnerships, or archive-building efforts.
Research notes, essays, and working papers from the Lingua research line, retained as part of the PureTensor archive.
Of the roughly 7,000 languages spoken on Earth today, a significant proportion have never been written down. They exist only in the mouths and ears of their speakers — in conversation, in song, in the stories told at nig…
For most of recorded history, the decipherment of ancient scripts has been a fundamentally human endeavour — part intuition, part obsessive pattern recognition, part luck. Michael Ventris spent years working on Linear B…
When we talk about endangered languages, the conversation almost always centres on spoken words — the fading voices of elderly speakers, the unwritten grammars of remote communities, the oral traditions that die when the…
In a small classroom on the Big Island of Hawai’i, a three-year-old greets her teacher entirely in ʻōlelo Hawaiʻi — the Hawaiian language. Her parents don’t speak it. Her grandparents don’t speak it. But she does, becaus…
Somewhere in the world, a language is falling silent. Not with a dramatic last word or a ceremonial farewell, but quietly — in the gap between an elderly grandmother who dreams in her mother tongue and grandchildren who…
A language dies roughly every two weeks. With nearly half of the world’s 7,000 languages at risk of vanishing within a generation, linguists and technologists are locked in an unprecedented race against time. But a new a…
The useful framing for Lingua is not launch theatre. It is disciplined sequencing: concept first, then pilot, then partnership only if the work warrants it.
Clarify the thesis, publish research notes, and narrow the actual problem worth building for.
Assemble a small demonstration corpus and test a limited workflow end-to-end with human review.
If the work proves useful, pursue collaborators, source access, and formal research relationships.
Lingua is led by Heimir Helgason within PureTensor, combining research direction, technical development, and infrastructure design.
If the work matures, the outward form can mature with it. Until then, the site is intentionally modest and direct.
Heimir Helgason
Founder, PureTensor
Infrastructure, systems design, and applied AI are the current backbone of the project. Linguistic depth and source access would need to be built through future collaboration rather than implied by branding alone.
Use the form below if you want to discuss archival material, potential collaboration, or whether there is a genuinely useful pilot worth attempting.