SCM

Forum: archived-discussions

Monitor Forum | Start New Thread Start New Thread
RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-26 18:45
[forum:2284]
"May we keep our regexp engine?"
OK

RE: Reinventing the wheel? [ reply ]
By: Chris Kuklewicz on 2006-02-26 00:16
[forum:2282]
I have a question (mainly for Isaac) to help us with the replacement Haskell entry:

We used parallel regexps -- we have replaced that with one at a time

We used a combinator expression for regexps -- we have replaced that with string regexps.

We used a regexp evaluation engine written in Haskell (the code we copied from the CTK library) -- we would like to keep this instead of replacing it with Parsec calls.

May we keep our regexp engine?

RE: Reinventing the wheel? [ reply ]
By: Joshua Isom on 2006-02-24 19:30
[forum:2276]
The two regexes should have a nearly identical behavior. The syntax is different, so here it is in it's perl5 compliment.

P5: /(?: ( (?: > .*? ) ) | .*?(\n) )*/x
P6: /[ ( [ \> \N*: ] ) | \N*:(\n) ]*/

It then uses the perl5 equivalent of @+ and @- which stores the offsets of the beginning and ends of matches. Doing the 'match-change-match-change' method would require messing with the internals of PGE. If it were to mess with PGE internals, it would be able to use something more like perl5's s/^>.*\n|\n//go, which would have the exact same behavior, but the added lines of code and risk of not ensuring future compatability would outweigh that in my opinion. If in PGE a method to be able to do it is added, I would use it, but it does not yet exist.

RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-24 18:58
[forum:2275]
the conventional '^>.*\n|\n'

please also provide an implementation with "the conventional"

RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-24 18:56
[forum:2274]
"unreasonable to disallow the use of functional or table based substitutions. They are used very often in practice."

And we could look at functional or table based substitutions as yet another piece of functionality that isn't currently being exercised by these tests.

The Computer Language Shootout /is/ horribly compromised by "dumbing-down" to some arbitrary level, so that many language implementations can play.


RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-24 18:30
[forum:2273]
"setting a precendent for the shootout"
The shootout weighs apples & oranges based on adhoc decisions about what should be allowed on the scales. Of course, this is completely unsatisfactory and so are the obvious alternatives. (There's no escaping the basic silliness of weighing apples & oranges.)

"Text.Regex ... It's not the only regular expressions lib distributed with the Haskell base libraries"
Hmmm has anyone tried doing regex-dna with these other regex libs?

"makes the shootout less fun"
There's always the 'interesting alternative programs' fun-room.

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-24 12:14
[forum:2272]
Josh, sorry to intrude, but did you notice that Text.Regex is marked "Stability: epxerimental"? I read that as "if you have the choice, use something different".

And I'd argue that Haskell's native regexes go by the name of Parsec (or ReadP, whatever you like better). Sure, they have unconventional syntax, and they provide far more than regexes, but they are still Haskell-native regexes.

--Udo

RE: Reinventing the wheel? [ reply ]
By: Don Stewart on 2006-02-24 04:22
[forum:2269]
"Code that uses that library, rather than the library that is part of the GHC implementation, does not test the performance of the GHC implementation. If the CTK library shipped with GHC, then the answer would be different."

I think this would be a new and difficult precendent to adopt. We would be setting a precendent for the shootout that:

a) You can only use what is "shipped with the language"

b) You cannot work around flaws in those libraries

c) (And possibly) Except if your language doesn't provide anything -- then you can implement whatever extra code you want, possibly even installing it into cvs).

Firstly, c) with b) seems obviously unfair. If languages with no support can hand code workarounds, then languages with some support should be able to hand code as well.

Now, a) is a very difficult rule. What is "shipped with the language"? Only that which comes with the compiler? Does this rule out C's pthreads or GMP, or cpan modules? Can C only use libc?

It's unclear how to define this for each language -- as what is shipped is a matter of packaging and policy, not of language implementation quality.

For example, GHC distributes a smallish base library, with all other code offloaded to 'packages'. As yet we haven't attempted to use any of these packages. Perhaps we've been unnecessarily strict on ourselves?

If C can use pthreads or libgmp, an external package to GCC, then perhaps GHC/Haskell can use other external Haskell packages? How do we decide how 'external' is too external?

(Btw, I only cite C as they're the examples I've studied most closely).


Coming back to the Haskell case. Requiring that we use Text.Regex is particularly irksome for us I think, as:
* It is an imperative binding to C's regex.c
* It's rarely used
* It's not the only regular expressions lib distributed with the Haskell base libraries

Using it only tests the quality of C's regex library, it's not an indication of the quality of the Haskell language/compiler/implementation in any way.

Using CTK, Parsec or some other purely Haskell regex lib tells you much more about the quality of a _GHC-compiled Haskell implementation_ of regular expressions.

I argue that:
* Work arounds for missing libs, or poor libs, have always been allowed, with the penalty of more lines of code

* What consitutes a work-around is more often a matter of packaging/distribution policy, rather than language implementation quality. And is thus no something we can make sensible rules about.

* There is a grey area defining what constitutes 'part of the language' and what isn't. Forumlating a clear definition of this seems difficult or meaningless at best.

Finally, ruling out work-arounds makes the shootout less fun :)

-- Don

P.S. Isaac, can you rule on what should happen with this benchmark?

RE: Reinventing the wheel? [ reply ]
By: Josh G0ldfoot on 2006-02-24 03:27
[forum:2268]
Mike:

The problem with [BDHKMNRSVWY] is that some entries are using it and others are not. That makes comparisons between entries less valid: if some entries are doing 11 regex searches others are doing a single but more complicated search, then the same-way rule is violated. At the very least, all of our entries need to be running the same regex searches.

I am not sure I understand your point about C. Using standard libraries that ship as part of an implementation (like GCC's regex) is uncontroversial. Re-inventing and running away from them is not.

Don:

1) As I said the lack of regex string parsing is only a small part of the problem. The big problem is re-inventing the Text.Regex wheel.

2) Where the CTK library is installed does not make a difference. Code that uses that library, rather than the library that is part of the GHC implementation, does not test the performance of the GHC implementation. If the CTK library shipped with GHC, then the answer would be different.

RE: Reinventing the wheel? [ reply ]
By: Mike Pall on 2006-02-23 23:53
[forum:2267]
Ok, so by your definition C is out of the game.
C doesn't ship with a regex library. In fact C
doesn't have a single built-in function. So remove
all C entries from the contest. Happy now?

Sorry, but this is a complete non-argument.
If a language is extensible it's entirely
reasonable to allow the use of its extensions.

[Well, extensions that are generally used to solve
problems of its category. GMP, PCRE or sockets
are reasonable examples. An extension to solve
fannkuch in hand-tuned assembler is of course
pointless and/or cheating. Submit an assembler
entry instead.]

And about [BDHKMNRSVWY]: Yes, this _is_ a regex
substitution. All modern regex libraries/bindings
support functional substitution and/or table based
substitution (in addition to the more traditional
string substitution).

In Lua 5.1 you can use all three variants:

string.gsub("abcdef", "[bd]", "x%0x")
string.gsub("abcdef", "[bd]", function(c) return c..f(c) end)
string.gsub("abcdef", "[bd]", { b="foo", d="bar" })

And as I said: Lua is really minimal ...
Perl, Python, PCRE are all able to do this.

IMHO it's unreasonable to disallow the use of
functional or table based substitutions. They are
used very often in practice.

If a regex binding for a specific language
doesn't have such a feature: tough luck, you can
easily work around it, but it'll show in the score.
This is no different from having/not having '|'.

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 22:51
[forum:2266]
"We get a result that shows that Haskell requires gigantic programs to
do something other languages do in 30 lines"

No. The body of the entry is in fact 30 lines of code (roughly). But
there's the 100 line CTK regex library added to the rest of the entry.
(Everything from "And everything else is the CTK library"). GHC doesn't
require gigantic programs, it requires you import the standard CTK
library -- like C requires you to have pthreads installed for chameneos.

I've punished the entry in terms of lines of code here, by inlining the
CTK lib, rather than installing it. The latter would have been neater.

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 22:39
[forum:2265]
"Seeing a flaw in the GHC regex implementation, the Haskell GHC #2 entry
tries to give the GHC regex implementation a "good" score nonetheless by
running away from it. That's useless as a measure of implementation
performance."

Seems like a very useful result:
1) One of Haskell's libraries, Text.Regex, is useless for large
amounts of data, GHC #1 shows this

we (the Haskell community) didn't know this until we attempted the shootout.
But are we to then conclude that Haskell is useless for this problem? Many would.

No, by using the standard CTK regex library you get GHC #2, a very
interesting purely functional solution, with the result that:

2) You can still write reasonable regex programs in Haskell

Josh, two questions:

1) Would you be satisified if it used string-based regular expressions?
(I have written a variant that does indeed do this).

2) Would you still object if the CTK regex library was not appended
to the GHC #2 entry, but was instead installed on the system (a
la pthreads or gmp or hashtable.c)?

It's a standard Haskell library, after all, I just chose to
append it to the GHC #2 entry (Does this confuse people into
thinking I hacked it up specially -- I didn't, you can find it
from haskell.org if you want it).

My opinion is that requiring the regexes be in the form of strings
unfairly punishes purely functional languages that use higher-order
combinators for these problems -- but I'm willing to add a preprocessor
to the entry that converts strings into combinators.

Or we could add another regexdna benchmark to test how well languages
implement higher-order functions...

Cheers,
Don

RE: Reinventing the wheel? [ reply ]
By: Josh G0ldfoot on 2006-02-23 21:55
[forum:2264]
I am curious about how well each language implementation does regex searches. My curiosity is not satisfied by comparing entries that go out of their way to avoid using their language's implementation of regex searches. I don't agree that the standard should be "try to use whatever is used normally in your language." This isn't supposed to be an anthropological study of how people normally use programming languages. It's a study of the performance of language implementations.

Regarding the Haskell GHC #2 entry specifically, I remain convinced that it does not comply with the benchmark. Its lack of actual regex string parsing is only a small part of the problem; the problem is that the entry is not using GHC's regex module at all. Seeing a flaw in the GHC regex implementation, the Haskell GHC #2 entry tries to give the GHC regex implementation a "good" score nonetheless by running away from it. That's useless as a measure of implementation performance. We get a result that shows that Haskell requires gigantic programs to do something other languages do in 30 lines (something that is rarely true) and that suggests GHC offers programmers tools to handle regular expressions faster than C# Mono.

Also: When language implementations don't ship with regex, I think they should sit this benchmark out. That's some nifty Fortran programming, but it is not useful as a measure of GFortran's implementation of regex searches.

I do think it is OK for the Lua entry to use the | workaround, now that I understand that Lua does not offer | natively. It is a workaround, not a running-away, and the result is very close to regexp parsing that other languages offer. I still disapprove of the Lua entry's use of [BDHKMNRSVWY].

RE: Reinventing the wheel? [ reply ]
By: Joshua Isom on 2006-02-23 04:12
[forum:2259]
I figured this would be a thread I could weigh in somewhat well, perhaps. I wrote the PIR version of regex-dna(it may timeout but it is correct). Parrot supports binding to pcre(albiet not for me), but contains PGE, Parrot Grammar Engine, which actually contains several regex styles(Perl 5, Shell Glob, and primarily, Perl 6 Rules), and written solely in PIR(just like PCRE is written in C). I choose to use the Perl 6 Rules, which do not follow the regex man page. Asside from comments I should have deleted before submitting, some comments in there I would like to point out. "Ok, using a regex to match a single fixed character is probably excessive. But it's what's wanted..." Also, the specs say "regex substitution" but regexes are used for matching, and in many cases capturing. PIR lacks and regex based substitution other than a custom one, which for a relatively simple regex such as this, the beginning and ending offsets of a match are sufficient. The P6Rule patterns are very different so it takes the extra measure of storing the "expected" output regexes to get the proper output. Since not all regexes are the same syntax, even printing the regex seems odd.

If shell glob supported something like '{[cgt]gggtaaa,tttaccc[acg]}', would it be valid still? Perhaps the benchmark should be modified to use a wider range of regex features to dissuade custom engines. Perhaps not say "custom engines are not allowed," but instead make it very difficult. If a "native" regex does not support a necessary option, compensating for it should be allowed.

Regarding question 2, the P6Rule I choose(which help from others) is '[ ( [ \> \N*: ] ) | \N*:(\n) ]*' White space is ignored(to encourage readable regexes), and it looks little like the conventional '^>.*\n|\n' The effect is largely the same. But as an optimization, it's a "global" match. It matches the entire fasta file. But by making use of capturing rules, it stores the beginning and ending of each match and later removes them(the perl 5 s/// method would be match- change-match-change). The match-change-match-change would have required more complexity and likely slowed it down. Would that be an acceptable optimization? There is no regex based substitution, only matching/capturing. I feel that abiding by the specs should be noted, as with the single character matching, or disqualify the program. After all, it can demonstrate how well the regex engine can optimize a single character match.

A custom engine would of course be faster. But using the "native" method where available is what people will use. It benchmarks the language implementations, not the program. And after all, isn't that part of the intent? True, the compiler options effect the timings of the program(a fair fault I find with the benchmarks is inconsistency in that regard), but one of the things the benchmarks are supposed to do is show the language speeds. Many of the programs are math oriented, or do things that use few high level concepts. This one is a benchmark that does not measure such things, so in a sense is a benchmark of the language's regex speeds. For C, it should use a "common" regex library instead of a custom one. For PIR, the "native" regex library is PGE which is written in PIR(it's continually developed and not yet optimized for speed). Binding to pcre is possible, but not the native engine. But the benchmark, in my opinion, should show "real language usage," with perhaps some exceptions such as the partialsums benchmark.

But that's just my two cents, or so.

RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-23 04:00
[forum:2258]
"Showing advantages and disadvantages of a language and its libraries"

Sorry.
FAQ "We are not trying to ...
showcase the capabilities of different languages..."

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 02:44
[forum:2257]
"Forcing that a specific external library with full PCRE functionality _must_ be used makes the who benchmark pointless."

Yes.

So then, I think we've established that a there is 1 new requirement to add to the spec:

* All parts of the problem should be solved with regular expression matching and substitution. String search or substitution is illegal.

But we should leave open the particular form the regular expression functionality takes, whether it be:

* pcre
* regex(3)
* combinator based regexes
* other forms, i.e. lua's missing |

You should try to use whatever is used by normally in your language.

-- Don

RE: Reinventing the wheel? [ reply ]
By: Mike Pall on 2006-02-23 02:25
[forum:2256]
Forcing that a specific external library with full PCRE functionality _must_ be used makes the whole benchmark pointless.

Because then it becomes a contest of who has the best binding to libpcre ... instead of demonstrating how everyday users of a language do regular expression processing.

Case in point: Lua's builtin regex functions do not support '|'. This is a consequence of Lua's minimalistic design. Still every Lua user is successfully using these functions.

'|' is not missed much by Lua users. It can easily be worked around. Only rarely someone asks for a PCRE binding (usually because of the different syntax).

Note that in this particular benchmark it's actually slower to scan the huge string twice for every pattern (because of cache eviction). So Lua _does_ pay a price for it's minimalism (less so than you would think in real-world scenarios).

Showing advantages and disadvantages of a language and its libraries is a part of what I want to see in the shootout. And I find idiomatic solutions very interesting and thought-provoking. (*1)

Showing who has the best binding generator for a specific C library is boring. (*2)

[Disclaimer: I wrote/tuned almost all of the Lua entries.]

(*1) E.g. I'm thinking about generating real machine code for regex patterns in a future version of LuaJIT. This isn't too dificult and would be very useful. Text processing in Lua is very regex-centric due to the minimal set of available string functions.

(*2) Every language with a decent GMP binding would probably get the same (highest) score on pidigits. But the different languages use very different mpn bindings. This is the only thing which keeps this benchmark somewhat interesting.

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 02:13
[forum:2255]
2) I was thinking of the strength reducing optimisations, for
k^0.5
it's ok to use the power or sqrt functions, for example.

So perhaps we could specify some alternative regular expressions that
are also considered valid, along the lines of Josh's examples.


RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 02:09
[forum:2254]
I don't understand this point. By PCRE do you mean PCRE _represented as strings_?
Would you be inclined to disallow PCRE represented with combinators?

Why would this be ok:

clean = (regex "." `action` \[c] -> Just c)
>||< (regex "\n|>[^\n]+\n" `action` const Nothing)

when the semantically identical:

clean = (dot `action` \[c] -> Just c)
>||< (char '\n' >|< char '>' +> anyButNL `star` char '\n' `action` const Nothing)

is not legal?

I would strongly argue that: it's the PCRE semantics that matter, not the syntax

Cheers,
Don

RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-23 01:45
[forum:2253]
"I'll cite..."
Which just moves me to add "PCRE" to the description ;-)

RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-23 01:42
[forum:2252]
1) "C's hashtable" - afaik (which isn't far) C doesn't provide a hashtable, so there's little option but to implement one.

2) "it's ok in other benchmarks (partial-sums comes to mind)" - not sure exactly what you mean, code example...

3) chameneos - yes you have to use something that looks more like concurrency than an integer queue, but that something could be POSIX threads, green threads, coroutines ...

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 01:38
[forum:2251]
On the question of "how specialised should a custom library be". Hmm.
It's a good question.

a) A full regex(3) library would be ok.

b) Enough to solve the problem would be ok, i.e.:

* + | [] () $ ^ [^]

c) Only implementing | and [] would not be ok.

------------------------------------------------------------------------

A 4th point:

4) The _syntax_ of the regular expression language is not important, as long as
the _semantics_ hold. That is, you can implement the operators from point b) above.

Einar and I clearly hold the view that combinator based patterns are valid.
An embedded domain-specific combinator language has well-known advantages:
you get statically typesafe regular expressions (instead of using an untyped
string to represent regular expressions). I'll cite:

Graham Hutton. Higher-order functions for parsing. (1992)
Journal of Functional Programming 2: 232-343.

Jeroen Fokker. Functional Parsers. (1995)
Lecture Notes of the Baastad Spring school on Functional Programming.

Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996)
Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham.

Steve Hill. Combinators for parsing expressions. (1996)
Journal of Functional Programming 6(3): 445-463.

Andrew Partridge and David Wright.
Predictive parser combinators need four values to report errors. (1996)
Journal of Functional Programming 6(2): 355-364.

Doaitse Swierstra and Luc Duponcheel.
Deterministic, Error-Correcting Combinator Parsers. (1996)
Advanced Functional Programming. LNCS 1129: 185-207.

Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999)
Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138.

Doaitse Swierstra and Pablo Azero.
Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999)
SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129.

Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001)
Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183.

-- Don

RE: Reinventing the wheel? [ reply ]
By: Nobody on 2006-02-23 01:27
[forum:2250]
In response to Josh's thoughtful questions, my views are:

Q1. Yes. Writing your own lib is ok.
Precendent: writing your own lib has been an option across many benchmarks.
(C's hashtable, and many others)

Q2. Yes. Substitution of more efficient regexes should be ok.
Precedent: it's ok in other benchmarks (partial-sums comes to mind).

Q3. No. Non-regex search should never be allowed.
Precedent: you _have_ to use threads in the chameneos threads benchmark
(and not integer queues for example), so you should have to use regexes here.

-- Don

RE: Reinventing the wheel? [ reply ]
By: Isaac Gouy on 2006-02-23 01:21
[forum:2249]
I'm likely to allow some flexibility where language implementations don't provide full pcre - iirc Lua doesn't provide |

Apart from that your viewpoint seems reasonable.

RE: Reinventing the wheel? [ reply ]
By: Josh G0ldfoot on 2006-02-23 00:07
[forum:2248]
1) No, 2) Yes, and 3) No, which is why I think all 5 violate the "same way" rule and should be Interesting Alternatives. It's regex-dna, not strstr-dna.

Older Messages
Powered By FusionForge