background preloader

Binary parsers

Facebook Twitter

Regex 2013-03-11. Performance - Why doesn't Python's mmap work with large files. 16.7. mmap — Memory-mapped file support. Memory-mapped file objects behave like both strings and like file objects.

16.7. mmap — Memory-mapped file support

Unlike normal string objects, however, these are mutable. Python - How do I re.search or re.match on a whole file without reading it all into memory. Regex - Python: find regexp in a file. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) By Joel Spolsky Wednesday, October 08, 2003 Ever wonder about that mysterious Content-Type tag?

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. But it won't. Regex Tutorial - Unicode Characters and Properties. Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead.

Regex Tutorial - Unicode Characters and Properties

With more and more software being required to support multiple languages, or even just any language, Unicode has been strongly gaining popularity in recent years. Using different character sets for different languages is simply too cumbersome for programmers and users. Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Of the regex flavors discussed in this tutorial, Java, XML and the .NET framework use Unicode-based regex engines.

Perl supports Unicode starting with version 5.6. RegexBuddy's regex engine is fully Unicode-based starting with version 2.0.0. Characters, Code Points, and Graphemes or How Unicode Makes a Mess of Things Most people would consider à a single character. All Unicode regex engines discussed in this tutorial treat any single Unicode code point as a single character. Parsing huge file without reading into memory (Performance forum at JavaRanch)

Originally posted by David Harkness: Not if you set the ByteBuffer's position and limit of the buffer before decoding it.

Parsing huge file without reading into memory (Performance forum at JavaRanch)

Loop over the mapped buffer, setting up a good block size using position and limit. Decocding will now just decode the bytes in the range you specify.Use CharsetDecoder.decode(ByteBuffer, CharBuffer) or one of the other similar methods so you can reuse the same CharBuffer. Since decoding advances the position, it should leave you at the next correct spot, dealing with multi-byte character encodings for you; just set limit to be position + BLOCK_SIZE and keep going.If you want ultimate speed, cannot count on ASCII files, and don't want to write your own specialized decoder, this is the way to go. You are right, but my needs doesn't allow me to perform the operations you described: the CharBuffer I want to get out from the big log file is going to be parsed by regexp... The result works correctly with ASCII files only, but log files are ASCII compliant usually...

Java regex for support Unicode. Regex - Unicode equivalents for \w and \b in Java regular expressions. Java - I'd like to apply a regex efficiently to an entire file. Searching a File - Java Regular Expressions: Taming the java.util.regex Engine - 图书 - JAVA 编程资料牛鼻站. Building on the previous example, I decide to provide a utility for searching the content of a file and returning all matching strings within that file.

Searching a File - Java Regular Expressions: Taming the java.util.regex Engine - 图书 - JAVA 编程资料牛鼻站

I'll use FileChannels for the actual file I/O. Although a discussion of FileChannels is beyond the scope of this book, in my opinion they're the best way to access files in Java. My strategy is to use a FileChannel to open a file, read its content into a String, release the FileChannel, and then use the searchString method to parse the String. This is faster than reading through the file line by line and examining its content, though it is memory intensive. Listing 5-7 shows the code for doing this. Java - Memory-Mapped MappedByteBuffer or Direct ByteBuffer for DB Implementation. Java tip: How to read files quickly. Technologies: Java 5+ Java has several classes for reading files, with and without buffering, random access, thread safety, and memory mapping. Some of these are much faster than the others. This article benchmarks 13 ways to read bytes from a file and shows which ways are the fastest. A quick review of file reading classes Let's quickly run through several ways to open and read a file of bytes in Java.

FileInputStream with byte reads FileInputStream f = new FileInputStream( name ); int b; long checkSum = 0L; while ( (b=f.read()) ! Java NIO MappedByteBuffer OutOfMemoryException. Searching a large text file... (I/O and Streams forum at JavaRanch) SGREP: Boyer-Moore regular expression searching. Boyer-Moore scanner For finite automata Sometimes very fast Copyright 1998, Sean Barrett - sean at nothings dot org Abstract An analogous algorithm to Boyer-Moore string searching, but for regular expressions not strings (actually, for searching finite automata), allows searching in a text of n characters for a pattern whose shortest match is m characters while looking at only n/m characters in the best case, n worst case.

SGREP: Boyer-Moore regular expression searching

Searching while allowing k errors occurs in best case examining kn/m characters. Worst case performance could be pretty bad, as a new finite automaton is (lazily) constructed, whose possible states number on the order of m*2^(s^2)*2^s where s is the number of states in the original nfa; this should be compared to 2^s, the number of possible states in the traditional search DFA. Performance The algorithm described provides a tight inner loop quite comparable to both traditional DFA scanners and to traditional Boyer-Moore scanners.

Acknowledgements The Algorithm text ?? Combining Boyer-Moore String Search with Regular Expressions. Grammar-Based Specification and Parsing of Binary File Formats. Grammar-Based Specification and Parsing of Binary File Formats William Underwood 2012, Vol. 7, No. 1, pp. 95-106 doi:10.2218/ijdc.v7i1.217 Abstract.

Grammar-Based Specification and Parsing of Binary File Formats

ANTLR - Binary support. ANTLR is overkill for binary file formats: I know of no binary file format that requires more than one (variable length) item of lookahead for processing, nor would I expect to find one--binary formats are intentionally designed and evolved.

ANTLR - Binary support

It is fairly simple to design a language for dealing with binary file formats and to support item (byte, various length integer, IEEE float and double numbers, etc) encode/decode logic for individual fields and thence to provide one or more backends for processing files. ASN.1 is an extreme example of this; when I implemented such a language, the grammar only took up two pages or so. For my language, backends included generation of C struct definitions, file reader/writer generation, and some others that I have forgotten.

ANTLR makes it easy to design, implement, and extend such DSLs, but you do not need the ANTLR machinery for processing the files.