Context-sensitive languages are exactly the LBA languages (CSL = LBA)

Statement

The automaton model for the context-sensitive (type-1) languages is the linear bounded automaton (LBA): a nondeterministic Turing machine restricted to the portion of tape occupied by its input. The classical Myhill–Landweber–Kuroda theorem says that LBAs and context-sensitive grammars characterize the same class of languages.

Langlib formalizes this as a full class equality

\[\mathrm{CS} = \mathrm{LBA}.\]

In Lean

CS_eq_LBA — the headline class equality (CS : Set (Language T)) = LBA for the canonical endmarker LBA (is_LBA/LBA).
CS_subset_LBA / LBA_subset_CS — the Kuroda and Myhill directions.
is_LBA_pos_iff — is_LBA_pos L ↔ is_CS L ∧ [] ∉ L: the genuinely space-bounded core. The marker-free |w|-cell model is_LBA_pos recognizes exactly the ε-free context-sensitive languages.
myhill_language_eq — the core LBA → CS bisimulation: the Myhill grammar generates exactly the bounded-tape machine’s (non-empty) language.

The endmarker’s two simulations are simMachine/language_simMachine_eq — run a bounded-tape machine on ⊢ w ⊣ — and flagMachine/language_flagMachine_eq — fold the markers away onto |w| cells.

The two construction halves are each split into a machine/grammar construction plus completeness and soundness arguments, mirroring the TM = RE and PDA = CFG developments:

LBA → CS (Myhill) — the grammar myhillGrammar, with completeness and soundness arguments establishing myhill_language_eq.
CS → LBA (Kuroda) — the simulator kMachine and capstone noncontracting_finite_to_LBA, with kMachine_complete and kMachine_sound.

The LBA model and the empty word

The canonical model is the endmarker LBA. The input w is written bracketed between a left endmarker ⊢ and a right endmarker ⊣, so the tape is ⊢ w ⊣ with |w| + 2 cells, over the alphabet EndAlpha T Γ = Option (T ⊕ Γ) ⊕ Bool (interior = blank / input / work over a finite work alphabet Γ; the two Bool symbols are the endmarkers). The head is confined to the bracketed region by the usual boundary clamping, and the working space comes entirely from Γ.

Crucially the empty word gets the genuine two-cell tape ⊢⊣, on which the machine runs like any other input: it accepts ε exactly when its transitions accept with the two markers adjacent. So the machine itself decides ε — no external flag is needed, and CS = LBA holds on the nose (rather than merely “up to ε”).

The genuinely space-bounded content is isolated in the marker-free |w|-cell model is_LBA_pos: a tape of exactly |w| cells over Option (T ⊕ Γ), which (having no cell to run on for ε) recognizes precisely the ε-free context-sensitive languages — is_LBA_pos L ↔ is_CS L ∧ [] ∉ L. The endmarker model is this plus the ε case, run natively on ⊢⊣; that single extra cell is exactly what upgrades “ε-free CSL” to all of CS. So there is no empty-word flag anywhere — the only place ε is decided is the machine’s behaviour on the two-cell tape ⊢⊣.

Proof idea

CS → LBA (Kuroda). Given a context-sensitive language, restrict to a non-contracting grammar g₀ with finitely many nonterminals (exists_noncontracting_offEmpty_of_CS), whose language is the original one minus ε. The simulator kMachine g₀ works on a tape of |w| cells whose work track decodes to a sentential form. It nondeterministically runs the grammar backwards: from the input word it repeatedly rewrites an occurrence of a rule’s output back to the rule’s input pattern (patList), and accepts once the track has been reduced to the single start symbol [S]. Non-contraction guarantees every sentential form of a derivation of w fits in |w| cells, so the bounded tape suffices. Completeness replays a forward derivation [S] ⇒* w in reverse (kMachine_complete); soundness maintains a forward invariant — at every reachable configuration the decoded form still derives to w, so an accepting run that verifies [S] exhibits a genuine derivation (kMachine_sound).

LBA → CS (Myhill). The Myhill grammar encodes LBA configurations as bracketed sentential forms of bounded length and runs the LBA’s transitions in reverse. Because the LBA never leaves the input region, intermediate forms stay bounded and the rules never erase, so the construction is structurally context-sensitive; myhill_language_eq proves it generates exactly the automaton’s non-empty language.

The reduction between the non-contracting core and Langlib’s canonical class is_CS gives the ε-free characterization is_LBA_pos_iff; running the bounded-tape machine on ⊢ w ⊣ and deciding ε on ⊢⊣ (the two language_*_eq simulations) then upgrades it to the headline CS_eq_LBA.

Every LBA language is also Turing-recognizable — lba_language_subset_tm0_language shows NLBA languages ⊆ Turing-machine languages via BFS determinization. Combined with CS ⊆ recursive this places the context-sensitive languages inside the recursive (hence recursively enumerable) ones. The non-contracting form of context-sensitive grammar used in the Kuroda direction is bridged to Langlib’s canonical class in non-contracting = context-sensitive.

Keywords / also known as

context-sensitive languages equal linear bounded automata, CSL = LBA, LBA ⇔ CSG, Myhill–Landweber–Kuroda theorem, Kuroda construction, Myhill grammar construction, type-1 languages automaton characterization, NLBA, nondeterministic linear bounded automaton.

Formalized in Lean 4 with Mathlib, in Langlib.