Have you ever thought about to index all of your entire filesystem in a database with file meta info, it’s contents? Or, have you faced any problem that will require you to search and run some queries to find documents or contents from your large file system? It’s a common experience we all face. May be I am right, you face it too. To crawl file system and index all the files, it’s meta info and contents fscrawler is a fantastic library and it’s already very popular among system administrator, DevOps team or whoever manage IT infrastructures.
So let’s talk about exactly what is fscrawler.
What is fscrawler?
With the name I guess you understood it’s purpose. fs (File system) crawl (watch changes, crawl recursively). It’s fscrawler. It’s an open source library actively maintaining in it’s GitHub’s repository. Already it’s very popular among people. If you see their GitHub issues, open PR, etc you will notice that.
Also one another important information is David Pilato is the owner of this library and he works in Elasticsearch. I will explain later about why it’s important for a library.
Feature – crawling & indexing file system
It’s the primary feature of fscrawler. Most importantly if you want to crawl, watch changes and index file meta and it’s contents in Elasticsearch. So you can search efficiently from your entire filesystem.
With fscrawler, you can –
- set frequency to watch your filesystem
- custom directory settings, so it will only watch and crawl that directly at a regular interval
- exclude/include file based on patterns
- Extract PDF, Docs file and make it indexable
- OCR integration
- Index on Elasticsearch
Here is the example of the configuration file, so you can understand how much flexible this tools can be. Ohh one thing I forgot to mention, it’s a Java library and you can run it from command line.
name: "job_name"
fs:
url: "/path/to/docs"
update_rate: "5m"
includes:
- "*.doc"
- "*.xls"
excludes:
- "resume.doc"
json_support: false
filename_as_id: true
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: true
index_content: true
indexed_chars: "10000.0"
attributes_support: false
raw_metadata: true
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
pdf_ocr: true
ocr:
language: "eng"
path: "/path/to/tesseract/if/not/available/in/PATH"
data_path: "/path/to/tesseract/tessdata/if/needed"
server:
hostname: "localhost"
port: 22
username: "dadoonet"
password: "password"
protocol: "SSH"
pem_path: "/path/to/pemfile"
elasticsearch:
nodes:
# With Cloud ID
- cloud_id: "CLOUD_ID"
# With URL
- url: "http://127.0.0.1:9200"
index: "docs"
bulk_size: 1000
flush_interval: "5s"
byte_size: "10mb"
username: "elastic"
password: "password"
rest:
url: "https://127.0.0.1:8080/fscrawler"
Now I think you understood a lots about it.
Development platform & Stack
- Written in Java
- Indexing on Elasticsearch
- Can be run via command line
- Docker image is also available
Installation & Documentation
You can read the entire documentation about fscrawler from here. You can browse the GitHub repository from https://github.com/dadoonet/fscrawler
Note: I am also a contributor of fscrawler