Fscrawler – File System Crawl & Indexing Library

Shaharia Azam

3 years ago

File system crawler and indexing library

Have you ever thought about to index all of your entire filesystem in a database with file meta info, it’s contents? Or, have you faced any problem that will require you to search and run some queries to find documents or contents from your large file system? It’s a common experience we all face. May be I am right, you face it too. To crawl file system and index all the files, it’s meta info and contents fscrawler is a fantastic library and it’s already very popular among system administrator, DevOps team or whoever manage IT infrastructures.

So let’s talk about exactly what is fscrawler.

What is fscrawler?

With the name I guess you understood it’s purpose. fs (File system) crawl (watch changes, crawl recursively). It’s fscrawler. It’s an open source library actively maintaining in it’s GitHub’s repository. Already it’s very popular among people. If you see their GitHub issues, open PR, etc you will notice that.

Also one another important information is David Pilato is the owner of this library and he works in Elasticsearch. I will explain later about why it’s important for a library.

Feature – crawling & indexing file system

It’s the primary feature of fscrawler. Most importantly if you want to crawl, watch changes and index file meta and it’s contents in Elasticsearch. So you can search efficiently from your entire filesystem.
With fscrawler, you can –

set frequency to watch your filesystem
custom directory settings, so it will only watch and crawl that directly at a regular interval
exclude/include file based on patterns
Extract PDF, Docs file and make it indexable
OCR integration
Index on Elasticsearch

Here is the example of the configuration file, so you can understand how much flexible this tools can be. Ohh one thing I forgot to mention, it’s a Java library and you can run it from command line.

name: "job_name"
fs:
  url: "/path/to/docs"
  update_rate: "5m"
  includes:
  - "*.doc"
  - "*.xls"
  excludes:
  - "resume.doc"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: true
  index_content: true
  indexed_chars: "10000.0"
  attributes_support: false
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  pdf_ocr: true
  ocr:
    language: "eng"
    path: "/path/to/tesseract/if/not/available/in/PATH"
    data_path: "/path/to/tesseract/tessdata/if/needed"
server:
  hostname: "localhost"
  port: 22
  username: "dadoonet"
  password: "password"
  protocol: "SSH"
  pem_path: "/path/to/pemfile"
elasticsearch:
  nodes:
  # With Cloud ID
  - cloud_id: "CLOUD_ID"
  # With URL
  - url: "http://127.0.0.1:9200"
  index: "docs"
  bulk_size: 1000
  flush_interval: "5s"
  byte_size: "10mb"
  username: "elastic"
  password: "password"
rest:
  url: "https://127.0.0.1:8080/fscrawler"

Now I think you understood a lots about it.

Development platform & Stack

Written in Java
Indexing on Elasticsearch
Can be run via command line
Docker image is also available

Installation & Documentation

You can read the entire documentation about fscrawler from here. You can browse the GitHub repository from https://github.com/dadoonet/fscrawler

Note: I am also a contributor of fscrawler