background preloader

Apache Tika - Apache Tika

Apache Tika - Apache Tika

State of Adversarial Stylometry: can you change your prose-style? Today at the Chaos Computer Congress in Berlin (28C3), Sadia Afroz and Michael Brennan presented a talk called "Deceiving Authorship Detection," about research from Drexel College on "Adversarial Stylometry," the practice of identifying the authors of texts who don't want to be identified, and the process of evading detection. Stylometry has made great and well-publicized advances in recent years (and it made the news with scandals like "Gay Girl in Damascus"), but typically this has been against authors who have not taken active, computer-assisted countermeasures at disguising their distinctive "voice" in prose. As part of the presentation, the Drexel Team released Anonymouth, a free/open tool that partially automates the process of evading authorship detection. The tool is still a rough alpha, and it requires human intervention to oversee the texts it produces, but it is still an exciting move in adversarial stylometry tools. Privacy, Security and Automation Lab

Lucene教程 - Lucene安装、配置、实例 Index Microsoft Office Files with Lucene | Christoph Hartmann on January 7th, 2009 Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the Lucene project. Lucene is an open-source project that “provides Java-based indexing and search technology”. I looked at two projects: While Tika is not available as a binary download Aperture is. Just download the Tika source code viasvn checkout tika and use maven to install the binary into your local maven repository. The following part do the core binding between Tika and Lucene. logger.debug("Indexing " + file);try { Document doc = null; // parse the document synchronized (contentParserAccess) { doc = contentParser.getDocument(file); } // put it into Lucene if (doc ! The ContentParser calls the TikaParser for each file and put the metadata it returns into a Lucene document. A custom tika parser may looks like:

Adam Parrish · Getting data from the web Python: hidden details In the interest of brevity, we’ve skipped over some fairly important details of Python. Here’s our chance to play catch-up. Other kinds of loops; loop control The for loop is far and away the most common loop in Python. But there’s another kind of loop that you’ll encounter frequently: the while loop. >>> i = 0 >>> while i < 10: ... i += 1 ... print i ... 1 2 3 4 5 6 7 8 9 10 Python also has two loop control statements. >>> i = 0 >>> while i < 10: ... i += 1 ... if i % 2 == 1: ... continue ... print i ... 2 4 6 8 10 The continue statement causes Python to skip back to the top of the loop; the remaining statements aren't executed. Finally, we have break, which causes Python to drop out of the loop altogether. >>> i = 0 >>> while i < 10: ... i += 1 ... if i > 5: ... break ... print i ... 1 2 3 4 5 Here, as soon as i achieves a value greater than 5, the break statement gets executed, and Python stops executing the loop. Tuples from module import stuff File objects URLs 01.<? 02.

全文検索エンジン「Lucene.Net」を使う 「Lucene.Net」は.NET Framework上で利用できる「全文検索エンジン」です。例えば、ASP.NETを使ってWebサイトを作成する際に、サイト内のコンテンツを検索する検索ページを作成したいという場合や、Windowsアプリケーションで全文検索機能を利用したい場合にLucene.Netが利用できます。 Lucene.NetはApache Software Foundationが開発しているプロジェクトの1つで、オープンソースで開発されています。Java言語で記述された「Lucene」がそのオリジナルであり、これは、Wikipediaをはじめ多くのWebサイトで現在利用されています( Lucene-java WikiのPowerdBy ) Luceneの.NET版であるLucene.NETは、Java版と同様Apache Software Foundationの「 Lucene.Netプロジェクト 」で提供されています。今回は、このLucene.Netを紹介します。 Lucene.Netの概要 全文検索とは、簡単にいうと「複数のテキストから特定の文字列を検索する」機能です。 「逐次検索型」は、UNIXのgrepコマンドのように、実行するたびにテキストをすべて走査して検索を行うものです。 Lucene.NETはインデックス型で、事前にテキストからトークンを切り出しておき、インデックスを作成したうえで検索処理を実行します。 全文検索を行うLucene.NET本体に加えて、日本語の環境では、テキストからトークンを切り出すための独自の処理が必要となります。 英文では単語ごとにスペースで区切られて文章が記述されているため、文章を解析するための処理はそれほど複雑ではありませんが、単語の区切りが明確でない日本語の文章を解析するには、独自の高度な形態素解析処理が必要となります。 オリジナルであるJava版のLuceneを含め、Apache Software Foundationで配布されているLuceneには、日本語に対応したトークンを取り出す機能(=日本語アナライザ)が付属していません。 Lucene.Netによる全文検索処理 Lucene.Netでは、主に次の2つの処理を行います。 インデックス作成処理 検索処理 インデックスを作成すれば、検索処理が利用可能となります。

Lucene - Index File Formats Index File Formats This document defines the index file formats used in Lucene version 3.0. If you are using a different version of Lucene, please consult the copy of docs/fileformats.html that was distributed with the version you are using. Apache Lucene is written in Java, but several efforts are underway to write versions of Lucene in other programming languages. As Lucene evolves, this document should evolve. Compatibility notes are provided in this document, describing how file formats have changed from prior versions. In version 2.1, the file format was changed to allow lock-less commits (ie, no more commit lock). In version 2.3, the file format was changed to allow segments to share a single set of doc store (vectors & stored fields) files. Definitions The fundamental concepts in Lucene are index, document, field and term. An index contains a sequence of documents. A document is a sequence of fields. The same string in two different fields is considered a different term. Segments

Mastering Google Analytics Custom Variables I’ve got a stack of posts that I want to write, and realized that the all deal with Custom Variables. So, to make sure that we’re all on the same page when it comes to custom vars, here’s my guide to Mastering Google Analytics Custom Variables. For those of you that have not used custom variables, CVs are a way for you to insert custom data into Google Analytics. There are 4 parts to a custom variable: 1. Name & Value Custom variables are name-value pairs of data. Google Analytics will show you a list of all the custom variable names in a list and then let you drill down into the list and see all of the values. Here’s an example. Then I can click on “Year” to a get a list of all the values: Custom variables can also be used in custom reports and advanced segments. Index or Slot The index is a way to organize your custom variables. You can technically have more than 5 custom variables, but we need to discuss the next concept, scope, and how it impacts the index. Scope The Code Super Nerd Stuff

Lucene索引中的编码问题,好郁闷 - 巴士飞扬-技术BLOG 这几天一直在研究Lucene索引,遇到一些问题,搞得我头都大了.我不知道别人是怎么做的. 开始时,我是把内容读取出来,直接索引在索引文件里,这样就方便在查询时读取内容并高亮显示.但是给果发现,这个东西很受字符编码的影响,于是,我就在文件读取时加上一个编码,可是发现中文检索不出来. 我是这样做的,索引文件contents: doc.add(new Field("contents", new FileReader(file))); 结果这样做导致问题的出现,原来FileReader读取文件内容时是采取的系统编码,这样就导致UTF-8的文件可能以GBK方式读取进来(因为我发现GBK的没出现问题)进行索引,结果导致在检索时,检索不到. 后来,我就换了一个: doc.add(new Field("contents", FileDocument.readFileContents(file.getCanonicalPath(), charset), Field.Store.YES, Field.Index.TOKENIZED)); 其中readFileContents是我自己写的按照charset编码读取文件内容的函数.这样做能解决索引的问题.但是,这样索引是把文件内容都写到索引中,会导致索引文件很大很大,也就会增加索引时的负担.所以,我还是放弃这样的解决办法了. 回过来继续使用前一个索引方案,但是对它进行改进.后来想想,我只要把doc.add(new Field("contents", new FileReader(file)));里的第二个参数FileReader,让他能以相应的编码来加载文件就可以了.于是我看了一下FileReader的构造函数,没有带编码的构造函数.怎么办.迷惑之中,我尝试用InputStreamReader类实例来代替FileReader,没想到,结果居然能成功,代码如下: doc.add(new Field("contents", new InputStreamReader(new FileInputStream(file.getCanonicalPath()), charset))); 他们是Field.Index.TOKENIZED和Field.Store.NO的。 Tags: FileReader Lucene InputStreamReader 索引 搜索引擎 |