用户:陈岳峰
Lucene in Action
[编辑]Meet Lecuent
[编辑]recall metrics
[编辑]- recall metrics Recall measures how well the search system finds relevant documents
precision metrics
[编辑]- precision measures how well the system filters out the irrelevant documents
Understanding the core indexing classes
[编辑]IndexWriter
[编辑]Directory
[编辑]FSDirectory
[编辑]RAMDirectory
[编辑]Analyzer
[编辑]Document
[编辑]Filed
[编辑]Keyword
[编辑]不再进行分词, 但是要建索引且 stored in the index verbatim, keyword suitable for the value shuld be 完整的 such as name , filepath, urls , telephonenum
UnIndexed
[编辑]neither analyzed nor indexed, but store in the index , this field suitable for the value which person don't user for search, but the value will be in the search result. As it is stored in the index, so the infomation of this should not be too larger
UnStored
[编辑]analyzed and indexed but not stored suitable for html bodys or Doc document
Text
[编辑]analyzed and indexed This implies that fields of this type canbe searched against . If the data indexed is a String, it’s also stored; but if the data (as in our Indexer example) is from a Reader, it isn’t stored
Understanding the core searching classes
[编辑]IndexSearcher
[编辑]IndexSearcher is to searching what IndexWriter is to indexing. Just like to find a word in the file in a read-only mode the simplest takes a single Query object as a parameter and returns a Hits object. A typical use of this method looks like this:
IndexSearcher is = new IndexSearcher(FSDirectory.getDirectory("/tmp/index", false));
Query q = new TermQuery(new Term("contents", "lucene"));
Hits hits = is.search(q);
Term
[编辑]Similar to the Field object, it consists of a pair of string elements: the name of the field and the value of that field. Terms are involved in indexing. But it is done by Lucene's internals. During Searching , you can construct Term Object like
Query q = new TermQuery( new Term( "contends", "Lucene"))
Hit hits = is.Search(q);
Query
[编辑]Query: TermQuery. Other Query types are BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery.
TeamQuery
[编辑]It's used for matching documents that contain fields with specific values
Hit
[编辑]Hits instances don't load from the index all documents that match a query, but only a small portion of them at a time
Indexing
[编辑]Conceptual ducument model
[编辑]- Conceptual index model
- Performing basic index operations
- Boosting Documents and Fields during indexing
- Indexing dates, numbers, and Fields for use in sorting search results
- Using parameters that affect Lucene's indxing performance and resource consumption
- Optimizing indexes
- Understanding concurrency, multithreading, and locking issues
- Advanced indexing functions
Documents and Fields
[编辑]three things Lucene can do with each field
- the field may be indexed or not.Only text fields may be indexed (binary valued fields may only be stored). If indexed ,then the tokens will be derived from its value after analysis, and then these tokens will be indexed
- If it is indexed, the field may also optionally store term vectors, which is really a miniature inverted index for that one field, allowing you to retrieve all tokens for that field.
- the field be stored, will not analysis but will be indexed,so that it can later be retrieved.
Flexible Schema
[编辑]Denormalization
[编辑]Understanding the indexing process
[编辑]Extracting text and creating the document
[编辑]Extract any formation infomation to Document which lucene can deal with.Cnce you have the text you'd like to index, and you've created a Document Field with all s you'd like to index, all text must then be analyzed.
Analysis
[编辑]The combination of an original source of tokens followed by the series of filters that modify the tokens produced by that source, together make up the Analyzer
Index writing and files
[编辑]inverted structure
Index Segments
[编辑]Each segment, in turn, consists of multiple files, of the form _X.<ext>, where X is the segment’s name and <ext> is the extension that identifies which part of the index that file corresponds to. There are separate files to hold the different parts of the index (term vectors, stored fields, inverted index, etc.).
Basic Index Operators
[编辑]Adding Documents to an index
[编辑]- addDocument(Document) Document
adds the using the default analyzer, which you specifiedwhen creating the IndexWriter ,for tokenization
- addDocument(Document, Analyzer)
adds the Documents using the provided analyzer for tokenization. But be careful! In order for searches to work correctly you need the analyzer used at search time to " match " the tokens produced by the analyzers at indexing time.
Deleting Documents from an index
[编辑]- deleteDocuments(Term) deletes all documents containing the provided term.
- deleteDoucments(Term[]) deletes all documents containing the any of the terms in provided array
- deleteDocuments(Query) deletes all documents matching the provided query.
- deleteDocuments(Query[]) deletes all documents matching any of the queries in the provided array.
In each case, the deletes are not done immediately. Instead, they are buffered in memory, just like the added documents, and periodically flushed to disk. As with added documents, you must call commit() or close() on your writer to commit the changes to the
index.
When you delete a document, the disk space for that document is not immediately freed.
- maxDoc() the total number of deleted or un-deleted documents in the index,
- numDocs() the number of un-deleted documents in the index
When will use optimize() then the maxDoc() will be equal to numDocs()
Updating Documents in an index
[编辑]JavaClass
[编辑]StringBuffer
[编辑]String和StringBuffer他们都可以存储和操作字符串,即包含多个字符的字符串数据。 String类是字符串常量,是不可更改的常量。而StringBuffer是字符串变量,它的对象是可以扩充和修改的。
public StringBuffer()
[编辑]public StringBuffer append(int , String , bool , char x[], ...)
[编辑]public StringBuffer insert(int offset , int , String , bool , char x[], ...)
[编辑]charAt()
[编辑]方法可以返回字符中的单个字符
setCharAt(0,‘x’)
[编辑]方法可以对字符中的单个字符进行替换