background preloader

Apache Tika - Apache Tika

Apache Tika - Apache Tika
Related:  da valutare

State of Adversarial Stylometry: can you change your prose-style? Today at the Chaos Computer Congress in Berlin (28C3), Sadia Afroz and Michael Brennan presented a talk called "Deceiving Authorship Detection," about research from Drexel College on "Adversarial Stylometry," the practice of identifying the authors of texts who don't want to be identified, and the process of evading detection. Stylometry has made great and well-publicized advances in recent years (and it made the news with scandals like "Gay Girl in Damascus"), but typically this has been against authors who have not taken active, computer-assisted countermeasures at disguising their distinctive "voice" in prose. As part of the presentation, the Drexel Team released Anonymouth, a free/open tool that partially automates the process of evading authorship detection. The tool is still a rough alpha, and it requires human intervention to oversee the texts it produces, but it is still an exciting move in adversarial stylometry tools. Privacy, Security and Automation Lab

Lucene教程 - Lucene安装、配置、实例 Index Microsoft Office Files with Lucene | acidum.de Christoph Hartmann on January 7th, 2009 Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the Lucene project. Lucene is an open-source project that “provides Java-based indexing and search technology”. I looked at two projects: While Tika is not available as a binary download Aperture is. Just download the Tika source code viasvn checkout tika and use maven to install the binary into your local maven repository. The following part do the core binding between Tika and Lucene. logger.debug("Indexing " + file);try { Document doc = null; // parse the document synchronized (contentParserAccess) { doc = contentParser.getDocument(file); } // put it into Lucene if (doc ! The ContentParser calls the TikaParser for each file and put the metadata it returns into a Lucene document. A custom tika parser may looks like:

Chapter 1. Getting Started Your application can send events into the engine using the sendEvent method that is part of the runtime interface: engine.getEPRuntime().sendEvent(new PersonEvent("Peter", 10)); The output you should see is: Name: Peter, Age: 10 Upon sending the PersonEvent event object to the engine, the engine consults the internally-maintainced shared filter index tree structure to determine if any EPL statement is interested in PersonEvent events. The EPL statement that was created as part of this example has PersonEvent in the from-clause, thus the engine delegates processing of such events to the statement. For different event representations and sendEvent methods please see Section 3.5, “Comparing Event Representations”. Specialized event senders are explained in Section 14.4.1, “Event Sender”. For reference, here is the complete code without event class:

Adam Parrish · Getting data from the web Python: hidden details In the interest of brevity, we’ve skipped over some fairly important details of Python. Here’s our chance to play catch-up. Other kinds of loops; loop control The for loop is far and away the most common loop in Python. But there’s another kind of loop that you’ll encounter frequently: the while loop. >>> i = 0 >>> while i < 10: ... i += 1 ... print i ... 1 2 3 4 5 6 7 8 9 10 Python also has two loop control statements. >>> i = 0 >>> while i < 10: ... i += 1 ... if i % 2 == 1: ... continue ... print i ... 2 4 6 8 10 The continue statement causes Python to skip back to the top of the loop; the remaining statements aren't executed. Finally, we have break, which causes Python to drop out of the loop altogether. >>> i = 0 >>> while i < 10: ... i += 1 ... if i > 5: ... break ... print i ... 1 2 3 4 5 Here, as soon as i achieves a value greater than 5, the break statement gets executed, and Python stops executing the loop. Tuples from module import stuff File objects URLs 01.<? 02.

全文検索エンジン「Lucene.Net」を使う 「Lucene.Net」は.NET Framework上で利用できる「全文検索エンジン」です。例えば、ASP.NETを使ってWebサイトを作成する際に、サイト内のコンテンツを検索する検索ページを作成したいという場合や、Windowsアプリケーションで全文検索機能を利用したい場合にLucene.Netが利用できます。 Lucene.NetはApache Software Foundationが開発しているプロジェクトの1つで、オープンソースで開発されています。Java言語で記述された「Lucene」がそのオリジナルであり、これは、Wikipediaをはじめ多くのWebサイトで現在利用されています( Lucene-java WikiのPowerdBy ) Luceneの.NET版であるLucene.NETは、Java版と同様Apache Software Foundationの「 Lucene.Netプロジェクト 」で提供されています。今回は、このLucene.Netを紹介します。 Lucene.Netの概要 全文検索とは、簡単にいうと「複数のテキストから特定の文字列を検索する」機能です。 「逐次検索型」は、UNIXのgrepコマンドのように、実行するたびにテキストをすべて走査して検索を行うものです。 Lucene.NETはインデックス型で、事前にテキストからトークンを切り出しておき、インデックスを作成したうえで検索処理を実行します。 全文検索を行うLucene.NET本体に加えて、日本語の環境では、テキストからトークンを切り出すための独自の処理が必要となります。 英文では単語ごとにスペースで区切られて文章が記述されているため、文章を解析するための処理はそれほど複雑ではありませんが、単語の区切りが明確でない日本語の文章を解析するには、独自の高度な形態素解析処理が必要となります。 オリジナルであるJava版のLuceneを含め、Apache Software Foundationで配布されているLuceneには、日本語に対応したトークンを取り出す機能(=日本語アナライザ)が付属していません。 Lucene.Netによる全文検索処理 Lucene.Netでは、主に次の2つの処理を行います。 インデックス作成処理 検索処理 インデックスを作成すれば、検索処理が利用可能となります。

Lucene - Index File Formats Index File Formats This document defines the index file formats used in Lucene version 3.0. If you are using a different version of Lucene, please consult the copy of docs/fileformats.html that was distributed with the version you are using. Apache Lucene is written in Java, but several efforts are underway to write versions of Lucene in other programming languages. As Lucene evolves, this document should evolve. Compatibility notes are provided in this document, describing how file formats have changed from prior versions. In version 2.1, the file format was changed to allow lock-less commits (ie, no more commit lock). In version 2.3, the file format was changed to allow segments to share a single set of doc store (vectors & stored fields) files. Definitions The fundamental concepts in Lucene are index, document, field and term. An index contains a sequence of documents. A document is a sequence of fields. The same string in two different fields is considered a different term. Segments

Selenium-Cucumber | Code Less… Test More…

Related: