Today, while surfing the web, I found this YouTube video of Eric Gilson and John Joergensen of Rutgers Camden Law Library presenting at the 2005 CALI (Computer Aided Law Instruction) conference. In it, Gilson & Joergensen discuss how to digitize congressional documents and build a digital library in a low-cost manner using open-source software. Unfortunately, you cannot really see the screen output from the projector, but the concepts are still relevant.
For more specifics about full-text indexing, here is Joergensen at CALI 2010 explaining the Swish-e search engine which Rutgers uses to index the congressional documents:
Some of the processing details have changed since 2005, but the Digital Library is still running today, with over 13K congressional documents processed in its U.S. Congressional Documents collection.
If you are interested in legal informatics, I highly recommend CALI’s YouTube channel.