Here are my thoughts...
I think the starting point would be to use the uncorrected DOI string; "ihaledeposotedinthecopnttol..."
And then have AI separate and correct the message.
"I have deposited in the county of.."
Once AI can do that, then the next step would be to bring in a "text" document to remove punctuation, headers, the spaces, etc. Documents before 1820 puts the first word of the next at the bottom. That too needs to be removed.
Then do the same for different formats: pdfs, jpgs, etc. There are probably converter to do this that could be leveraged: pdf to text, jpg to text, etc.