Have you ever thought about to index all of your entire filesystem in a database with file meta info, it’s contents? Or, have you faced any problem that will require you to search and run some queries to find documents or contents from your large file system? It’s a common experience we all face. May be I am right, you face it too. To crawl file system and index all the files, it’s meta info and contents fscrawler is a fantastic library and it’s already very popular among system administrator, DevOps team or whoever manage IT infrastructures.
So let’s talk about exactly what is fscrawler.
What is fscrawler?
With the name I guess you understood it’s purpose. fs (File system) crawl (watch changes, crawl recursively). It’s fscrawler. It’s an open source library actively maintaining in it’s GitHub’s repository. Already it’s very popular among people. If you see their GitHub issues, open PR, etc you will notice that.
Also one another important information is David Pilato is the owner of this library and he works in Elasticsearch. I will explain later about why it’s important for a library.
Feature – crawling & indexing file system
It’s the primary feature of fscrawler. Most importantly if you want to crawl, watch changes and index file meta and it’s contents in Elasticsearch. So you can search efficiently from your entire filesystem.
With fscrawler, you can –
- set frequency to watch your filesystem
- custom directory settings, so it will only watch and crawl that directly at a regular interval
- exclude/include file based on patterns
- Extract PDF, Docs file and make it indexable
- OCR integration
- Index on Elasticsearch
Here is the example of the configuration file, so you can understand how much flexible this tools can be. Ohh one thing I forgot to mention, it’s a Java library and you can run it from command line.
name: "job_name" fs: url: "/path/to/docs" update_rate: "5m" includes: - "*.doc" - "*.xls" excludes: - "resume.doc" json_support: false filename_as_id: true add_filesize: true remove_deleted: true add_as_inner_object: false store_source: true index_content: true indexed_chars: "10000.0" attributes_support: false raw_metadata: true xml_support: false index_folders: true lang_detect: false continue_on_error: false pdf_ocr: true ocr: language: "eng" path: "/path/to/tesseract/if/not/available/in/PATH" data_path: "/path/to/tesseract/tessdata/if/needed" server: hostname: "localhost" port: 22 username: "dadoonet" password: "password" protocol: "SSH" pem_path: "/path/to/pemfile" elasticsearch: nodes: # With Cloud ID - cloud_id: "CLOUD_ID" # With URL - url: "http://127.0.0.1:9200" index: "docs" bulk_size: 1000 flush_interval: "5s" byte_size: "10mb" username: "elastic" password: "password" rest: url: "https://127.0.0.1:8080/fscrawler"
Now I think you understood a lots about it.
Development platform & Stack
- Written in Java
- Indexing on Elasticsearch
- Can be run via command line
- Docker image is also available
Installation & Documentation
Note: I am also a contributor of fscrawler