recoll: Going paperless without getting moneyless

Table of Contents

Context

For ages, I have wanted to go paperless.  Not that I particularly fear the effects of time on paper, as it can be very cruel to digital media too.  My issue was much more pragmatic:

  • Unless you have a very efficient memory and/or a very clever organisation system, finding THAT letter you got from the bank 3 years quickly is not always that easy;
  • While I do have the chance to live in a decently vast house, I don’t want to fill it with paper.  More precisely, I want what I do keep on paper to fit in one or two binders.

Going paperless involves four things:

  1. Some form of IT infrastructure to store the digital documents;
  2. Pieces of software that will allow for content search in these documents;
  3. A scanner to digitize received paper documents;
  4. A backup strategy that includes off-site and confidentiality aspects.

This article will focus on topics 1 and 2.  Topic 3 is up to your budget and willingness to spend ages or not on scanning, and topic 4 will be handled later on.

Some form of IT infrastructure to store the digital documents

I already had an infrastructure in place, that revolved around what you could call a home server on steroïds:

  • A 4-core Core i5-4570S server with 16GB of RAM running Proxmox.  I primarily use LXC Containers, except when functionnally impossible and for a virtualised pfSense firewall.  At the time of building the solution, the OS and most of the VM/CT’s boot disks were on a SSD, and the actual data storage of the guests lived on a 4TB RAID1 array of NAS-grade HDD’s.
  • To store my media files, including the scanned documents and eBooks from various sources, a Synology ds216j NAS running a 4TB RAID1 array of, once more, NAS-grade HDD’s.
  • And some more stuff not relevant to this article.

Although I run a worrying amount of stuff at home, I do not belong to this gang of people who have a datacenter-grade rack full of servers in their attic.

However, there is one key takeaway: if you decide you want to store important documents, that matter in your everyday life, in an electronic manner, foresee copies.  That means having RAID arrays, for redundancy, AND externalized backups.  A RAID is NOT a backup by any means; this topic will be covered in another post.

Pieces of software that will allow for content search in these documents

Choosing a solution

The solution to acheive what the title announces are globally divided into two categories:

  • The Document Management Systems, like Mayan EDMS.  These solutions will manage versions of your file, review workflows, and so on.  The overall concept is that they will ingest files you put in an input folder, move them elsewhere, create entries in database, and so on.  This is not what I was looking for, since I just wanted to be able to quickly search inside my file.
  • The Full-Text Index solutions, that analyze a certain amount of folder, create indexes of the contents, and allow you to search inside them.

I was obviously looking for the latter.  The solution had to run, preferably, on Linux and be compatible with containers (which is pretty much almost everything nowadays).

Two solutions came up as interesting:

  • The Ambar Document Search Engine, which appears in several OSS lists;
  • Recoll, which is more aimed at providing a single-user GUI based full text index of your documents.  It also had a web GUI, so that did seem interesting.

I initially set my sights on Ambar.  On the paper, it did everything I wanted and even more.  Then I opened the documentation and looked at the requirements:

Ambar requirements

Operating System: 64-bit Unix system (CentOS recommended)
CPU: 2xCPU (If you have a lot of documents to OCR, please use high-perfomance CPU)
RAM: 8GB (If you have <8 GB of RAM, Ambar will crash due to low memory exceptions)
HDD: Only SSD, slow disk drives will dramatically decrease perfomance. Ambar’s index will take up to 30% of raw documents size

Oï !  That half the server just for a full-text index.  Nope, sorry.  I’m not part of the right gang for this solution.

As so I took a deeper look at Recoll, and I liked what I found.  And I ended up giving a boost to the server because, you kow, the Ambar people know what an index means in terms of performance.  But we’ll discuss that later.

What a glorious logo

By the way, recoll uses a Xapian database, whatever the hell that means.  I haven’t read the docs about the internals yet, but in practice I’ve found it works pretty well with a limited amount of resources.

Deploying Recoll

I usually run stuff inside Debian containers, it makes my life easier to be consistent with the host.

Get you files available to the container through whatever method works for you.  In my case, I expose relevant NFS shares from the NAS and mount them wherever necessary in my systems.

So, here, I just add what’s needed into /etc/fstab:

eviseur@Memoria:~$ cat /etc/fstab
UNCONFIGURED FSTAB FOR BASE SYSTEM
daguerreo.nas.ev1z.be:/volume1/docs /srv/docs nfs ro 0 1
daguerreo.nas.ev1z.be:/volume1/media /srv/media nfs ro 0 1

You’ve guessed it, Memoria is the container holding recoll and Daguerreo is the NAS.  We’ve got some form of theme in the naming gong on here, guys.

Someone will probably point out in the comments that’s stupid, because I could just mount the share in the Proxmox host and expose it through LXC.  I know.  I just wasn’t bothered to change everything yet.

Anyway, I’m digressing.

The recoll author is kind enough to provide repositories with up to date packages.

Add them to your systems, and then it’s install time !  We will deploy packs packs of software:

  • Apache, to serve the Web GUI later on;
  • recoll itself, which will build the index that the Web GUI can then consume for you;
  • tesseract, which is an open-source OCR (Optical Character Recognition) tool.  It will be used for images and scanned PDF’s.  The characters in the picture will be recognised and included in the recoll index, so that this content also becomes searchable.
  • A lot of small tools that recoll can call upon to look into various file formats.
# apt install tesseract-ocr tesseract-ocr-all poppler-utils netpbm imagemagick python-recoll python3-recoll unrtf groff ghostscript antiword psutils recollcmd libapache2-mod-wsgi-py3 apache2 git djvulibre-bin: wv untex lyx unrar catdvi

As a good practice, I have recoll run with a shell-less dedicated user:

# useradd -s /bin/false -d /var/lib/recoll/ -r -m -U recoll

Next, you want to create the .recoll folder, create the recoll configuration file and create the index for the first time:

# cd /var/lib/recoll
# mkdir .recoll
# vim .recoll/recoll.conf
topdirs = /srv/docs /srv/media/Artbooks /srv/media/BD /srv/media/eBooks
ocrprogs = tesseract
tesseractcmd = /usr/bin/tesseract
# chown -R recoll. ./
# sudo -u recoll -n -- recollindex -z

Of course, the topdirs will depend on your personal case.  The recoll documentation about its settings can be found in the docs and is self-explanatory.

Depending on how much you’ve given recoll to eat, that last step may take time.  You will see all the eventual errors during indexing and can kick you Google-fu to fix them.  Everything is pretty straightforward at this point.

Keeping the index up to date

There are two strategies to keep the recoll index in sync with the contents of your topdirs:

  • Use inotify.  This will keep watch on the file system and update the index in more or less realtime.
  • Use cron jobs.

The former solution is of course the quickest and may be suited when your indexed data changes a lot.  That was not my use case, and I did not like the idea of constantly monitoring the disks for nothing.

So, I went with the second option and added a crontab entry to refresh the index every night:

45 1 * * * su recoll -c /usr/bin/recollindex -k

Getting the Web GUI to work

We now have an up-to-date index, but no Web GUI to use (and actually no GUI at all.  Remember, this is a headless container).

When I installed recoll, the document for the WebUI was piss poor and did not match the actual implementation.  But I’ll spare you the rant because it’s now reasonably well documented.

I personally went with the WSGI option so that I could also deploy SSL.

I cloned the git repo that contains the WebUI to /var/www/recollwebui and finalised the configuration according to the aforementioned documentation.

My resulting Apache vHost looks like this:

ServerName memoria.srv.ev1z.be
ServerAdmin ***@***
DocumentRoot /var/www/html

# Enable SSL
SSLEngine on
SSLCertificateFile /etc/ssl/certs/memoria.srv.ev1z.be.crt
SSLCertificateKeyFile /etc/ssl/private/memoria.srv.ev1z.be.pem

# Since the Python WGSI lives on /recoll, redirect the root
# for ease of use
RedirectMatch ^/$ https://memoria.srv.ev1z.be/recoll/ 

# Server the Recoll WebUI with the recoll user
WSGIDaemonProcess recoll user=recoll group=recoll \ 
  threads=1 processes=5 display-name=%{GROUP} \ 
  python-path=/var/www/recollwebui
WSGIScriptAlias /recoll /var/www/recollwebui/webui-wsgi.py 

# Create Apache aliases to the folders that contain
# the indexed documents so that links are clickable
# inside the WebUI
Alias "/docs" "/srv/docs"
<Directory "/srv/docs">
  Require all granted
</Directory>

Alias "/artbooks" "/srv/media/Artbooks"
<Directory "/srv/media/Artbooks">
  Require all granted
</Directory>

Alias "/bd" "/srv/media/BD"
<Directory "/srv/media/BDs">
  Require all granted
</Directory>

Alias "/ebooks" "/srv/media/eBooks"
<Directory "/srv/media/eBooks">
  Require all granted
</Directory>

<Directory /var/www/recollwebui>
  WSGIProcessGroup recoll
  Require all granted
</Directory>

ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined

And voilà, you have a working index AND a Web GUI to use it.

Isn’t that absolutely marvellous?

Some figures

The resulting index eats a little over 6.5 GB for about 100 GB of input data of varying types.  I would that’s not bad at all.

The container RAM use is almost inexistent:

The tool may not have all the bell and whistles of Ambar and such, but it does what I wanted and it does so quite well.

Accepting a long considered upgrade is now due

Recoll did work from the get go, with acceptable performance.  However, the server quickly made its point about HDD’s being inadequate for this.

Every time the index would be updated, and sometimes while I was performing searches, the IO wait and overall load of the physical server would go batsh*t crazy.  A quick investigation with iostat et al. revealed the HDD’s were the bottleneck.

Well, the Ambar people told it.  I had been considering it for a long time, for heat, power draw and performance reasons.  You’ve guessed it…

SSD time!

The RAID1 array inside the server had to move to SSD’s.  Out of the 4 TB, I was barely using 1.2, so I waited for good deals on Amazon for 2TB SSD’s and made the switcheroo.

And everyone was happy, that day was deemed good and that’s the end of this story.

Add a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.