日期:2014-05-16  浏览次数:20406 次

Solr 创建索引 From DataBase

The Data Import Handler Framework

Solr includes a very popular contrib module for importing data known as the DataImportHandler (DIH in short). It's a data processing pipeline built specificallyfor Solr. Here's a summary of notable capabilities:

???? ?Imports data from databases through JDBC (Java Database Connectivity)
??? ?° Supports importing only changed records, assuming a last-updated date
???? ?Imports data from a URL (HTTP GET)
???? ?Imports data from files (that is it crawls files)
???? ?Imports e-mail from an IMAP server, including attachments
???? ?Supports combining data from different sources
???? ?Extracts text and metadata from rich document formats
???? ?Applies XSLT transformations and XPath extraction on XML data
???? ?Includes a diagnostic/development tool

?

?

The DIH is not considered a core part of Solr, even though it comes with the Solr download, and so you must add its Java JAR files to your Solr setup to use it. If this isn't done, you'll eventually see a ClassNotFoundException error. The DIH's JAR files are located in Solr's dist directory: apache-solr-dataimporthandler-3.4.0.jar and apache-solr-dataimporthandler-extras-3.4.0.jar. The easiest way to add JAR files to a Solr configuration is to copy them to the <solr_home>/lib directory; you may need to create it. Another method is to reference them from solrconfig.xml via <lib/> tags—see Solr's example configuration for examples of that. You will most likely need some additional JAR files as well. If you'll be communicating with a database, then you'll need to get a JDBC driver for it. If you will be extracting text from various document formats then you'll need to add the JARs in /contrib/extraction/lib. Finally, if you'll be indexing e-mail then you'll need to add the JARs in /contrib /dataimporthandler/lib.

?

?

The DIH needs to be registered with Solr in solrconfig.xml like so:

<requestHandler name="/dih_artists_jdbc" 
      class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">mb-dih-artists-jdbc.xml</str>
    </lst>
</requestHandler>

?This reference mb-dih-artists-jdbc.xml is located in <solr-home>/conf, which specifies the details of a data importing process. We'll get to that file in a bit.

?

DIHQuickStart

http://wiki.apache.org/solr/DIHQuickStart

?

Index a DB table directly into Solr

Step 1 : Edit your solrconfig.xml to add the request handler

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">data-config.xml</str>
</lst>
</requestHandler>

?

?Step 2 : Create a data-config.xml file as follows and save it to the conf dir

<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost/dbname" 
              user="user-name" 
              password="password"/>
  <document>
    <entity name="id" 
            query="select id,name,desc from mytable">
    </entity>
  </document>
</dataConfig>

?

Step 3 : Ensure that your solr schema (schema.xml) has the fields 'id', 'name', 'desc'. Change the appropriate details in the data-config.xml

?

Step 4: Drop your JDBC driver jar file into the <solr-home>/lib directory .

?

Step 5 : Run the command

http://solr-host:port/solr/dataimport?command=full-import .

Keep in mind that every time a full-import is executed the index is