用户:陈岳峰

Lucene in Action

Meet Lecuent

recall metrics

recall metrics Recall measures how well the search system finds relevant documents

precision metrics

precision measures how well the system filters out the irrelevant documents

Understanding the core indexing classes

IndexWriter

Analyzer

Document

Filed

Keyword

不再进行分词，但是要建索引且 stored in the index verbatim， keyword suitable for the value shuld be 完整的 such as name ， filepath， urls ， telephonenum

UnIndexed

neither analyzed nor indexed, but store in the index , this field suitable for the value which person don't user for search, but the value will be in the search result. As it is stored in the index, so the infomation of this should not be too larger

UnStored

analyzed and indexed but not stored suitable for html bodys or Doc document

Text

analyzed and indexed This implies that fields of this type canbe searched against . If the data indexed is a String, it’s also stored; but if the data (as in our Indexer example) is from a Reader, it isn’t stored

Understanding the core searching classes

IndexSearcher

IndexSearcher is to searching what IndexWriter is to indexing. Just like to find a word in the file in a read-only mode the simplest takes a single Query object as a parameter and returns a Hits object. A typical use of this method looks like this:

IndexSearcher is = new IndexSearcher(FSDirectory.getDirectory("/tmp/index", false));
Query q = new TermQuery(new Term("contents", "lucene"));
Hits hits = is.search(q);

Term

Similar to the Field object, it consists of a pair of string elements: the name of the field and the value of that field. Terms are involved in indexing. But it is done by Lucene's internals. During Searching , you can construct Term Object like

Query q = new TermQuery( new Term( "contends", "Lucene"))
Hit hits = is.Search(q);

Query

Query: TermQuery. Other Query types are BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery.

TeamQuery

It's used for matching documents that contain fields with specific values

Hit

Hits instances don't load from the index all documents that match a query, but only a small portion of them at a time

Indexing

Conceptual ducument model

Conceptual index model
Performing basic index operations
Boosting Documents and Fields during indexing
Indexing dates, numbers, and Fields for use in sorting search results
Using parameters that affect Lucene's indxing performance and resource consumption
Optimizing indexes
Understanding concurrency, multithreading, and locking issues
Advanced indexing functions

Documents and Fields

three things Lucene can do with each field

the field may be indexed or not.Only text fields may be indexed (binary valued fields may only be stored). If indexed ,then the tokens will be derived from its value after analysis, and then these tokens will be indexed

If it is indexed, the field may also optionally store term vectors, which is really a miniature inverted index for that one field, allowing you to retrieve all tokens for that field.

the field be stored, will not analysis but will be indexed,so that it can later be retrieved.

Flexible Schema

Denormalization

Understanding the indexing process

Extracting text and creating the document

Extract any formation infomation to Document which lucene can deal with.Cnce you have the text you'd like to index, and you've created a Document Field with all s you'd like to index, all text must then be analyzed.

Analysis

The combination of an original source of tokens followed by the series of filters that modify the tokens produced by that source, together make up the Analyzer

Index writing and files

inverted structure

Index Segments

Each segment, in turn, consists of multiple files, of the form _X.<ext>, where X is the segment’s name and <ext> is the extension that identifies which part of the index that file corresponds to. There are separate files to hold the different parts of the index (term vectors, stored fields, inverted index, etc.).

Basic Index Operators

Adding Documents to an index

addDocument(Document) Document

adds the using the default analyzer, which you specifiedwhen creating the IndexWriter ,for tokenization

addDocument(Document, Analyzer)

adds the Documents using the provided analyzer for tokenization. But be careful! In order for searches to work correctly you need the analyzer used at search time to " match " the tokens produced by the analyzers at indexing time.

Deleting Documents from an index

deleteDocuments(Term) deletes all documents containing the provided term.
deleteDoucments(Term[]) deletes all documents containing the any of the terms in provided array
deleteDocuments(Query) deletes all documents matching the provided query.
deleteDocuments(Query[]) deletes all documents matching any of the queries in the provided array.

In each case, the deletes are not done immediately. Instead, they are buffered in memory, just like the added documents, and periodically flushed to disk. As with added documents, you must call commit() or close() on your writer to commit the changes to the index.

When you delete a document, the disk space for that document is not immediately freed.

maxDoc() the total number of deleted or un-deleted documents in the index,
numDocs() the number of un-deleted documents in the index

When will use optimize() then the maxDoc() will be equal to numDocs()

Updating Documents in an index

JavaClass

StringBuffer

String和StringBuffer他们都可以存储和操作字符串，即包含多个字符的字符串数据。 String类是字符串常量，是不可更改的常量。而StringBuffer是字符串变量，它的对象是可以扩充和修改的。

public StringBuffer()

public StringBuffer append(int , String , bool , char x[], ...)

public StringBuffer insert(int offset , int , String , bool , char x[], ...)

charAt()

方法可以返回字符中的单个字符

setCharAt(0,‘x’)

方法可以对字符中的单个字符进行替换