background preloader

Search Engine

Facebook Twitter

Cogniva Research Blog » How to Set up Solr and ManifoldCF on an Ubuntu Based Computer. This blog post is intended to provide some guidance on how to set up a computer to run Apache Solr ( and Apache ManifoldCF ( Solr is a wrapper for Lucene.

Cogniva Research Blog » How to Set up Solr and ManifoldCF on an Ubuntu Based Computer

It provides a web UI and a variety of features such as document text extraction (via Apache Tika). ManifoldCF is a utility for scheduling jobs and providing repository connectors. We have used it to import documents from both Windows (CIFS) file share and MS SharePoint 2010 into Solr. This guide was written while installing and configuring Solr and ManifoldCF on a VirtualBox virtual machine running Linux Mint 15 (Mate) x64 ( I chose Linux Mint because it is a “hot” GNU/Linux distribution these days ( These instructions can be used to install/configure Sorl and ManifoldCF on Ubuntu. You just need to be aware that the standard text editor on Mate in pluma and on gnome its gedit.

The development of this guide was a joint effort of Chris Salter and myself. Intro To Search API (Part 1) - How To Create Search Pages. The core Search module in Drupal 7 is great for simple search pages however, the configuration options are fairly limited.

Intro To Search API (Part 1) - How To Create Search Pages

If you want to change the look and feel of the search results, you could use the Display Suite Search sub-module that ships with Display Suite. You can go one step further and create a custom search page using just Views. All you need to do is create a page display and expose the Search: Search Terms filter, and you're done. But the filter still relies on the index data that the Search module creates. If you want flexibility and control over your search pages, then you should take a serious look at the Search API module. In this tutorial, we'll create a custom search page using Views and Search API for the results. Getting Started Before we can begin, download Search API, Search API Database Search, Entity API and Views.

If you use Drush, run the following command: Crawl and index files and directories. Crawl and index directories and files from your filesystem.

Crawl and index files and directories

If you use linux that means you can crawl whatever is mountable to linux like a harddisk or partitions formated with fat, ext3, ext4 or a fileserver connected via ntfs, shares like smb or even sshfs or sftp on servers) into Apache Solr. Integrates automatic text recognition (OCR) for images and photos (i.e. as files like PNG, JPG, GIF ...) or inside PDFs (i.e.scanned Documents) using Tesseract-OCR. Usage Index a file Using the web admin interface: Open the page FilesEnter filename to the formPress button "crawl" Using the commandline: solr-index-file filename Using the REST-API: Index directrories Open the page FilesEnter directory name to the formPress button "crawl" Connecting Drupal to Solr Server. Happily, Solr also plays nicely with Drupal.

Connecting Drupal to Solr Server

So my colleagues want to connect their Drupal application to the installed Solr server (see my last blog entry). Drupal provide two modules for Solr server integration. The modules links external Solr server with the Drupal application, passing data into Solr to index, and then enabling Drupal to serve up the search results. As of the writing of this how-to two modules have an advantage and a disadvantage. The first is Search API Solr search with the advantage you can use the Solr Index straight in views. Okay however which module we choose, the installation of the configuration files are the same. Uploading Data with Solr Cell using Apache Tika - Apache Solr Reference Guide. Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself.

Uploading Data with Solr Cell using Apache Tika - Apache Solr Reference Guide

Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing. When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell. If you want to supply your own ContentHandler for Solr to use, you can extend the ExtractingRequestHandler and override the createFactory() method. This factory is responsible for constructing the SolrContentHandler that interacts with Tika, and allows literals to override Tika-parsed values.

Set the parameter literalsOverride, which normally defaults to *true, to *false to append Tika-parsed values to literal values. Topics covered in this section: Key Concepts. Solr Reference Guide - Apache Solr Reference Guide. Install Solr on Tomcat. Solr has been tested on Tomcat 5.5, 6, and 7.

Install Solr on Tomcat

In Tomcat 7 there was a bug with resolving URLs ending in "/". This should be fixed in Tomcat 7.0.5+, see SOLR-2022 for full details. See the instructions in the generic Solr installation page for general info before consulting this page. Simple Example Install Solr4.3 requires completely different deployment. Though this page needs to be completely re-written for the latest Solr version, here are the main differences with Solr 4.3 (at least for running a single instance).

Java 1.7 is required The JAR files from the Solr lib/ext directory (something like /opt/solr/example/lib/ext) must be copied to $CATALINA_HOME/lib/ The log4j.properties file from the resources file (something like /opt/solr/example/resources) must be copied to $CATALINA_HOME/lib/ Installing Tomcat 6 Apache Tomcat is a web application server for Java servlets.

Carrot