Check out Glinski's Hexagonal Chess, our featured variant for May, 2024.


[ Help | Earliest Comments | Latest Comments ]
[ List All Subjects of Discussion | Create New Subject of Discussion ]
[ List Latest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]

Comments/Ratings for a Single Item

LatestLater Reverse Order EarlierEarliest
Piece Values[Subject Thread] [Add Response]
Derek Nalls wrote on Wed, Jun 18, 2008 07:06 AM EDT:
Using the mirror of Embassy Chess as a *.fen, TJChess10x8 runs fine now
under Winboard F.  Thanks!

Reinhard Scharnagl wrote on Wed, Jun 18, 2008 06:53 AM EDT:
H.G.M.: '... This move blundered away the Queen with which Smirf was
supposed to mate, after which Fairy-Max had no trouble winning with an
Archbishop agains some five Pawns. ...'

Well, there still is a mating bug within SMIRF. Though I have improved
SMIRF's behavior near mating situations (please request a copy of this
unpublished engine if needed for key owners' testings), it still seems
to
be there. There might be a minimal chance, that it sometimes could
be caused by a communication problem using the adapter. But I am
still convinced, that it is caused by a internal bad evaluation storing
design, which hopefully would be corrected in Octopus sometimes ...

H. G. Muller wrote on Wed, Jun 18, 2008 03:24 AM EDT:
OK, I see the problem now. I forgot that the Embassy array is a mirrored
one, with the King starting on e1, rather than f1. And that to avoid any
problems with it in Battle of the Goths, I did not really play Embassy,
but the fully equivalent mirrored Embassy. And with that one, none of the
engines had problems, of course.

Actually it seems that it is not TJchess that is in error here: e1b1 does
seem a legal castling in Embassy. It is WinBoard_F which unjustly rejects
the move. Most likely because of the FEN reader ignoring specified
castling rights for which it does not find a King on f1 and a Rook in the
indicated corner.

The fact that you don't have this problem with Joker80 is because Joker80
is buggy. (Well, not really; it is merely outside its specs. Joker80
considers all castlings with a non-corner Rook and King not in the f-file
as CRC castlings, which are only allowed in variant caparandom, but not in
variants capablanca or *UNSPEAKABLE*. And Joker80 does not support
caparandom yet.) So the fact that you don't see any problems with Joker80
is because it will never castle when you feed it the Embassy setup, so that
WinBoard doesn't get the chance to reject the castling as illegal. And if
the opponent castles, WinBoard would reject it as illegal, and not pass it
on to Joker80.

I guess the fundamental fix will have to wait until I implement variant
caparandom in WinBoard; I think that both WinBoard and Joker80 are correct
in identifying the Embassy opening position as not belonging to Capablanca
Chess, but needing the CRC extension of castling. (Even if it is only a
limited extension, as the Rooks are still in thre corner.) And after I fix
it in WinBoard, I still would have to equip Joker80 with CRC capability
before you could use it to play the Embassy setup.

It is not very high on my list of priorities, though, as I see little
reason to play Embassy rather than mirrored Embassy.

Derek Nalls wrote on Tue, Jun 17, 2008 08:14 PM EDT:
Hecker:

It was fairly easy for me to replicate the bug I experienced.  In fact, I
have never successfully played a computer vs. computer game to completion
using TJChess10x8 in my life.  So, you should be able to replicate the bug
I experienced using the information I have provided.  I hope you can fix it
as well.

Bug Report
TJChess10x8
http://www.symmetryperfect.com/report

H. G. Muller wrote on Tue, Jun 17, 2008 05:40 PM EDT:
| However, TJChess cannot handle my favorite CRC opening setup, 
| Embassy Chess, without issuing false 'illegal move' warnings and 
| stopping the game.

Remarkable. I played this opening setup too, in Battle of the Goths, and
never noticed any problems with TJchess. It might have been another
version, though.

If you have somehow saved the game, be sure to send it to Tony, so he can
fix the problem.

Derek Nalls wrote on Tue, Jun 17, 2008 01:57 PM EDT:
'Of course you could also use Joker80 or TJchess10x8, which do not suffer
from such problems.'
____________________

While you were on vacation, I started a series of 'minimized
asymmetrical playtests' using SMIRF.  So, I will complete them using SMIRF.

Joker80, running under Winboard F, has never acted buggy in computer
vs. computer games.  However, TJChess cannot handle my favorite CRC
opening setup, Embassy Chess, without issuing false 'illegal move'
warnings and stopping the game.

H. G. Muller wrote on Tue, Jun 17, 2008 01:27 PM EDT:
Well, never mind. The symmetrical playtesting would not have given any
conclusive results with anything less than 2000 games anyway.

The asymmetrical playtesting sounds more interesting. I am not completely
sure what Smirf bug you are talking about, but in the Battle of the Goths
Championship it happened that Smirf played a totally random move when it
could give mate in 3 (IIRC) according to both programs (Fairy-Max was the
lucky opponent). This move blundered away the Queen with which Smirf was
supposed to mate, after which Fairy-Max had no trouble winning with an
Archbishop agains some five Pawns. 

This seems to happen when Smirf has seen the mate, and stored the tree
leading to it completely in its hash table. It is then no longer
searching, and it reports score and depth zero, playing the stored moves
(at least, that was the intention).

I have never seen any such behavior when Smirf was reporting non-zero
search depth, and in particular, the last non-zero-depth score before such
an occurence (a mate score) seemed to be correct. So I don't think there
is much chance of an error when you believe the mate announce,emt and call
the game.

Of course you could also use Joker80 or TJchess10x8, which do not suffer
from such problems.

Derek Nalls wrote on Tue, Jun 17, 2008 11:44 AM EDT:
Muller:

Thank you for the helpful response.  Frankly, I considered my own question
so obvious as to be borderline-stupid but I just wanted to be certain.

The following entries within the 'winboard.ini' file should enable me to
playtest (limited) randomized and non-randomized versions of Joker80
against one another.  Does it look alright?  If/When I run out of more
pressing playtesting missions, I may undertake this one after all.

/firstChessProgramNames={'Joker80 22' /firstInitString='new\n'
'Joker80 22'
}
/secondChessProgramNames={'Joker80 22' /secondInitString='new\n'
'Joker80 22'
}

Unfortunately, I no longer plan to playtest sets of CRC piece values by
Muller, Scharnagl and Nalls against one another.  I think having the pawn
set to 85 and the queen set to 950 (as required by Joker80) for all three sets of material values would have the unintentional side effect of equalizing their scales (which are normally different).  This means that the Muller set would, in fact, be tested against something other than a true, accurate representation of the Scharnagl and Nalls sets.

I am currently in the midst of conducting several 'minimized asymmetrical
playtests' using SMIRF at moderate time controls.  I want to tentatively
determine who is correct in disagreements between our models involving 2:1
or 1:2 exchanges (with supreme pieces).  I have to avoid its checkmate bug,
though.  This requires me to take back one move whenever the program
declares checkmate and 'call the game' if a sizeable material and/or
positional advantage indisputably exists for one player.  Fortunately,
this is almost always the case.  I will give a report in a few-several
weeks.

H. G. Muller wrote on Tue, Jun 17, 2008 02:51 AM EDT:
George Duke:
| Has initial array positioning already entered discussion for 
| value determinations?

No, it hasn't, and I don't think it should, as this discussion is about
Piece Values, and not about positional play. Piece values are by
definition averages over all positions, and thus independent on the
placement of pieces on the board.

Note furthermore that the heuristic of evaluation is only useful for
strategic characteristics of a position, i.e. characteristics that tend to
be persistent, rather than volatile. Piece placement can be such a trait,
but not always. In particular, in the opening phase, pieces are not locked
in the places they start, but can find plenty better places to migrate to,
as the center of the board is still complete no-man's land. Therefore, in
the opening phase, the concept of 'tempo' becomes important: if you waste
too much time, the opponent gets the chance to conquer space, and prevent
your pieces that were badly positioned in the array to properly develop.

I did some asymmetric playtesting for positional values in normal Chess, swapping Knights and Bishops for one side, or Knights and Rooks. I was not able to detect any systematic advantage the engines might have been deriving from this. In my piece value testing I eliminate positionsal influences by playing from positions that are as symmetric as possible given the material imbalance. And the effect of starting the pieces involved in the imbalance in different places is averaged out by playing from shuffled arrays, so that each piece is tried in many different locations.

H. G. Muller wrote on Tue, Jun 17, 2008 02:33 AM EDT:
Derek:
| Could you please give me example lines within the 'winboard.ini' 
| file that would successfully do so?  I need to make sure every 
| character is correct.

Sorry for the late response; I was on holiday for the past two weeks. The
best way to do it is probably to make the option dependent on the engine
selection. That means you have to write it behind the engine name in the
list of pre-selectable engines like:

/firstChessProgramNames={...
'C:/engines/joker/joker80.exe 23' /firstInitString='new\n'
...
}

And something similar for the second engine, using /secondInitString. The
path name of the joker80 executable would of course have to be where you
installed it on your computer; the argument '23' sets the hash-table
size. you could add other arguments, e.g. for setting the piece values,
there as well. Note the executable name and all engine argument are
enclosed by the first set of quotes (which are double quotes, but these
for some reason refuse to  print in this forum), and everything after this
first syntactical unit on the line is interpreted as WinBoard arguments
that should be used with this engine when it gets selected. Note that
string arguments are C-style strings, enclosed in double quotes, and
making use of escape sequences like '\n' for newline. The defauld value
for the init strings is 'new\nrandom\n'.

Gary Gifford wrote on Mon, Jun 2, 2008 04:44 PM EDT:
When GD writes: 'Computers will never write rhymed lines this century where every syllable matches in rhyme like: ''The avatar Horus' all-seeing Eye/ We have a star-chorus rallying cry.'' Granted most would not like style of writing, but still Computer cannot do it, rhyme every word with meaning.'

I bet if you offered a $20,000 reward we'd see many programs coming to meet the poetic challenge within a matter of months. You can read about computer generated writing here:

http://www.evolutionzone.com/kulturezone/c-g.writing/index_body.html

Anyway, I believe that computers are up to such a poetic task... it just takes a motivated programmer.

Back to CVs: Chess is a great game. And just because computers can play it far better than most, are we to discard it? I don't think so; not as long as humans vs. humans and enjoy the game while doing so. The same goes with other variants.

As for the poetry, just because computers don't write that style certainly doesn't motivate me to do so.


George Duke wrote on Mon, Jun 2, 2008 03:27 PM EDT:
With Centaur and Champion(RN) the array must affect values on 8x10
especially. Detraction of 0.1 or more for both cornered, one would expect.
In Falcon Chess, of the 453,600 initial arrays, cornered positions for
Falcon lower value relatively. Cheops' 'FRNBQ...' or Pyramids'
'FBRNQ...' each take away 0.1 or 0.2 of more general 6.5. Templar's
'RBFNQ...' and Osiris' 'RNBQFF...' are harder to distinguish from
standard 'RNBFQ...' and 'RNFBQ...' Has initial array positioning
already entered discussion for value determinations?

George Duke wrote on Mon, Jun 2, 2008 02:29 PM EDT:
Let's open the discussion to designers not actively programming 
now. There is a lot of two-tracking in CVPage. One example of double track
is that most designers see their work as art and become prolificists
(Betza, Gilman), the more ''paintings'' in their portfolio the better;
whereas few others want to replace standard FIDE form logically (Duke,
Trice, FischerRandom, Duniho's Eurasian, Joyce's Great Shatranj). The two camps talk at cross-purposes. Two other(different) opposite tracks may be seen this thread, namely, between player and programmer. Staightforward heuristic for player (usually designer too hereabouts), to make ongoing alterable piece-value estimates, certainly refining if possible to within 0.1 of a Pawn, their being so many hundreds of CVs to compute, of course will not do in itself for programmer. It is interesting, that's all, that the player's recipe is rejected immediately by the programmer. Player would gravitate to '1)' and '4)' rather than programmer-popular '2)' and '3)'. Another topic to relate here is proven fallacy after 400 years of emphasized Centaur(BN) and Champion (RN) anyway, discussed much in 2007, to be resurrected in follow-up. // In response to Gifford's:  Computers will never write rhymed lines this century where every syllable matches in rhyme like: ''The avatar Horus' all-seeing Eye/ We have a star-chorus rallying cry.'' Granted most would not like style of writing, but still Computer cannot do it, rhyme every word with meaning. Similarly, we need games Computer cannot play well, or be expected to ever, using hidden information like Kriegspiel if that's what it takes, or Rules changing within score, or something else. Surely the main reason for vanishing interest in Mad Queen is Computer dominance in all aspects.

Derek Nalls wrote on Mon, Jun 2, 2008 09:43 AM EDT:
Upon reflection, I have no conceivable reason to be distrustful of using
Joker80 IF I shut-off its limited randomization of move selection which
Winboard F activates by default.  

Could you please give me example lines within the 'winboard.ini' file
that would successfully do so?  I need to make sure every character is
correct.

H. G. Muller wrote on Sun, Jun 1, 2008 01:24 PM EDT:
George Duke:
| However, the reality is if one is playing many CVs, precisely 
| Number One, not any of the other 3, is far and away the most valuable
| and reliable tool, effectively building on experience. Time is also
| factor, and unless Player can adjust quickly, without extensive
| playtesting, and make ballpark estimates of values, all is lost on 
| new enterprise. We recommend just this Method One, increasing
| facility at it, for serious CV play, and in turn the designer
| needs to try to keep the game somewhat out of reach for Computer.

Well, I guess that it depends on what your standards are. If you are
satisfied with values that are sometimes off by 2 full Pawns, (as the case
of the Archbishop demonstrates to be possible), I guess method #1 will do
fine for you. But, as 2 Pawns is almost a totally winning advantage, my
standards are a bit higher than that. If I build an engine for a CV, I
don't want it to strive for trades that are immediately losing.

Gary Gifford wrote on Sun, Jun 1, 2008 05:53 AM EDT:
In response to GD's previous comment, his very last line, '...for serious CV play, and in turn the designer needs to try to keep the game somewhat out of reach for Computer.'

From what I have seen in regard to both variants and programmers, it seems logical to conclude that any game a human mind can play, a program can be written for. The program may be flawed, but the bugs can be worked out.

In my opinion, designers need not worry about computers. If you make a great game, likely someone will get a computer to play it. That is not to say all great games end up having associated programs... but they could.


George Duke wrote on Sat, May 31, 2008 04:10 PM EDT:
''Educated guessing based on known 8x8 piece values and assumptions on
synergy values of compound pieces'' -- [immediately] Muller rejects it out
of hand from his list of four 3.May.2008. ''We can safely dismiss method
(1) as unreliable...'' Then he touts the more scientific, roughly: 2)
board-averaged piece mobilities 3) best fit from computer-computer games
deliberately imbalanced 4) Playtesting.  However, the reality is if one is
playing many CVs, precisely Number One, not any of the other 3, is far and
away the most valuable and reliable tool, effectively building on
experience. Time is also factor, and unless Player can adjust quickly,
without extensive playtesting, and make ballpark estimates of values, all
is lost on new enterprise. We recommend just this Method One, increasing
facility at it, for serious CV play, and in turn the designer needs to try
to keep the game somewhat out of reach for Computer.

H. G. Muller wrote on Tue, May 27, 2008 01:13 PM EDT:
No engine I know of prunes in the root, in any iteration. They might reduce
the depth of poor moves compared to that of the best move by one ply, but
they will be searched in every iteration except a very early one (where
they were classified as poor) to a larger depth then they were ever
searched before. So at any time their score can recover, and if it does,
they are re-searched within the same iteration at the un-reduced depth.

This is absolutely standard, and also how Joker80 works. Selective search,
in engines that do it, is applied only very deep in the tree. Never close
to the root.

Derek Nalls wrote on Tue, May 27, 2008 12:11 PM EDT:
I have read that most computer chess programmers use the brute force method
initially when the plies can be cut thru quickly and then switch to use
advanced pruning techniques to focus the search from then on.  This lead
to my mis-interpretation that Joker80 would have more moves under
consideration as the best at short time controls than long time controls. 

Some moves that score highly-positive after only a few-several plies will
score lowly-positive, neutral or negative after more plies.  Thus, I do
not see how the number of moves under consideration as the best could
prevent being reduced slightly with plies completed.  As a practical
concern, there is rarely any benefit in accepting the CPU load associated
with, for example, checking a low-score positive move returned after
13-ply completion thru 14-ply completion (for example) when other
high-score positive moves exist in sufficient number.

H. G. Muller wrote on Tue, May 27, 2008 02:14 AM EDT:
Derek:
| The moral of the story is that randomization of move selection 
| reduces  the growth in playing strength that normally occurs with 
| time and plies completed.

This is not how it works. For one, you assume that at long TC there would
be fewer moves to chose from, and they would be farther apart in score.
This is not the case. The average distribution of move scores in the root
depends on the position, not on search depth.

And in cases were the scores of the best and second-best move are far
apart, the random component of the score propagating from the end-leaves
to the root is limited to some maximum value, and thus could never cause
the second-best move to be preferred over the best move. The mechanism can
only have any effect on moves that would score nearly equal (within the
range of the maximum addition) in absence of the randomization.

For moves that are close enough in score to have an effect on, the random
contribution in the end-leaves will be filtered by minimax while trickling
down to the root in such a way that it is no longer a homogeneously
distributed random contribution to the root score, but on average
suppresses scores of moves leading to sub-trees where the opponent had a
lot of playable options, and we only few, while on average increasing
scores where we have many options, and the opponent only few. And the
latter are exactly the moves that, in the long term, will lead you to
positions of the highest score.

Tony Hecker wrote on Tue, May 27, 2008 12:29 AM EDT:
I'm not very familiar with H.G.'s randomization technique, so I really
have no idea how well it works.  It sounds like he adds small random
values to leaf node evaluations, which is of course different than
selecting a random 'good' move from the root of the search.

Note that it is definitely true that randomness can be helpful for a chess
engine, even though it might seem counter-intuitive.

For example, basically all strong chess engines (as far as I know) use
random (pseudo-random) Zobrist keys for hashing.  The random keys may be
generated at run-time, or pre-generated, but they are random either way. 
Using different random keys will cause the engine to give slightly
different results without necessarily changing the engine's overall
strength.

Obviously, if used incorrectly, randomness could severely hurt an
engine's strength as well.  For example, if an engine just plays random
moves.  :)

Derek Nalls wrote on Mon, May 26, 2008 07:36 PM EDT:
Rest assured, I intend to drop this futile topic of conversation soon and
leave you alone.

The following is my impression of how the limited randomization of 
move selection that you have described as being at work within Joker80
must be harmful to the quality of moves made (on average) at long 
time controls.  Since you have experience and knowledge as the
developer of Joker80, I will defer to you the prerogative to correct 
errors in my inferred, general understanding of its workings.
_______________________________________________________

short time control
1x

At an example time control of 10 seconds per move (average),
Joker80 cuts thru 8 plies before it runs out of time and must
produce a move.  At the moment the time expires, it has selected 12 
high-scoring moves as candidates out of a much larger number of 
legal moves available.  Generally, all of them score closely together
with a few of them even tied for the same score.  So, when Joker80 
randomly chooses one move out of this select list, it has probably not 
chosen a move (on average) that is beneath the quality of the best 
move it could have found (within those severe time constraints)
by anything except a minor amount.  In other words, the damage to 
playing strength via randomization of move selection is minimized 
under minimal time controls.
___________________________

long time control
360x

At an example time control of 60 minutes per move (average),
Joker80 cuts thru 14 plies (due to its sophisticated advance pruning
techniques) before it runs out of time and must produce a move.  
At the moment the time expires, it has selected only 4 high-scoring 
moves as candidates out of a much larger number of legal moves 
available.  Generally, all of them score far apart with a probable 
best move scored significantly higher than the probable second best 
move.  So, when Joker80 randomly chooses one move out of this 
select list, the chances are 3/4 that it has ignored its probable best
move.  Furthermore, it may not have chosen the probable second best move,
either.  It just as likely could have chosen the probable third or fourth
best move, instead.  Ultimately, it has probably chosen a move 
(on average) that is beneath the quality of the best move it may have 
successfully found by a moderate-major amount.  In other words, 
the damage to playing strength via randomization of move selection is 
maximized under maximal time controls.
_______________________________________

The moral of the story is that randomization of move selection reduces 
the growth in playing strength that normally occurs with time and plies 
completed.

Derek Nalls wrote on Mon, May 26, 2008 06:42 PM EDT:
'It would be very educational then to get yourself acquainted with the
current state of the art of Go programming ...'

Go is a connection game that is not related to Chess or its variants.
The only thing Go has in common with Chess is that it is played upon a 
board using pieces.  You did not directly address my comment.

H. G. Muller wrote on Mon, May 26, 2008 06:10 PM EDT:
| I just cannot understand how any rational, intelligent man could 
| believe that introducing chaos (i.e., randomness) is beneficial
| (instead of detrimental) to achieving a goal defined in terms of 
| filtering-out disorder to pinpoint order.

It would be very educational then to get yourself acquainted with the
current state of the art of Go programming, where Monte-Carlo techniques
are the most successful paradigm to date...

| When you reduce the power of your algorithm in any way to 
| filter-out inferior moves, you thereby reduce the average 
| quality of the moves chosen and consequently, you reduce 
| the playing strength of your program- esp. at long time controls.  

Exactly. This is why I _enhance_ the power of my algorithm to filter out
inferior moves. As the inferior moves have a smaller probability to draw a
large positive random bonus than the better moves. They thus have a lower
probability to be chosen, which enhances the average quality of the moves,
and thus playing strength. At any time control.

It is a pity this suppression of inferior moves is only probabilistic, and
some inferior moves by sheer luck can still penetrate the filter. But I
know of no deterministic way to achieve the same thing. So something ais
better as nothing, and I settle for the inferior moves only getting a
lower chance to pass. Even if it is not a zero chance, it is still better
than letting them pass unimpededly.

| In any event, the addition of the completely-unnecessary module of 
| code used to create the randomization effect within Joker80 that 
| you desire irrefutably makes your program larger, more complicated 
| and slower.  Can that be a good thing?

Everything you put into a Chess engine makes it larger and slower. Yet,
taking almost everything out, only leaves you with a weak engine like
micro-Max 1.6. The point is that putting code in also can make the engine
smarter, improve its strategic understanding, reduce its branching ratio,
etc. So if it is a good thing or not does not depend on if it makes the
engine larger, motre complicated, or slower. It depends on if the engine
still fits in the available memory, and from there produces better moves
in the same time. Which larger, more complicated and slower engines often
do. As always, testing is the bottom line.

Actually the 'module of code' consists only of only 6 instructions, as I
derive the pseudo-random number from the hashKey.

But the point you are missing is this: I have theoretical understanding of
how Chess engines work, and therefore are able to extrapolate their
behavior with high confidence from what I observe under other conditions
(i.e. at fast TC). Just like I don't have to travel to the Moon and back
to know its distance from the Earth, because I understand geometry and
triangulation. So I know that if including a certain evaluation term gives
me more accurate scores (and thus more reliable selection of the best move)
from 8-ply search trees, I know that this can only give better moves from
18-ply search trees. As the latter is nothing but millions of 8-ply search
trees grafted on the tips of a mathematically exact 10-ply minimax
propagation of the score from the 8-ply trees towards the root. 

Anyway, it is not of any interest to me to throw months of valuable CPU
time to answer questions I already know the answer to.

Derek Nalls wrote on Mon, May 26, 2008 03:04 PM EDT:
'Joker80's strength increases with time as expected, 
in the range from 0.4 sec to 36 sec per move, 
in a regular and theoretically expected way.'

'The effect you mention is observed NOT to occur
and thus cannot explain anything that was observed to occur.'

Admittedly, I have no proof ... yet.  Of course, this is due to Joker80
never have been playtested at truly long time controls (to my point of
view).
_______________________________________________________________

'Now if you want to conjecture that this will all miraculously become
very different at longer TC, you are welcome to test it and show us convincing results. I am not going to waste my computer time on such a wild and expensive goose chase.'

I respect your bravery to issue the challenge.  Although I would surely
find the results of a randomized Joker80 vs. non-randomized Joker80
tournament at 60 minutes per move (on average) interesting, I am not
willing either to invest a few (3-4) months of my computer time that I
estimate it would require to playtest 16 games under acceptable, reliable
conditions.

My refusal is due to it not being extremely important or worthwhile to me
just to keep the chess variant community from losing one potentially great
talent to numerology (or some such).  Besides, I have nothing to gain and
nothing new to learn by conducting this long, difficult experiment.  
Only you stand to benefit tangibly from its results.

I just cannot understand how any rational, intelligent man could believe
that introducing chaos (i.e., randomness) is beneficial (instead of
detrimental) to achieving a goal defined in terms of filtering-out
disorder to pinpoint order.  

When you reduce the power of your algorithm in any way to filter-out
inferior moves, you thereby reduce the average quality of the moves chosen
and consequently, you reduce the playing strength of your program- 
esp. at long time controls.  In other words, you are counteracting a
portion of everything desirable that you achieve thru advanced pruning
techniques used elsewhere within your program.

Since you argue that randomization is no problem at all and I argue
that randomization is a moderate-major problem, everything we say to 
one another is becoming purely argumentative.  Only tests (that neither 
one of us intend to perform) can prove who is correct and settle the
issue.
___________________________________________________________________

'As I explained, it is very easy to switch this feature off. 
But you should be prepared for significant loss of strength if you do
that.'

To the contrary, you should be prepared for a significant gain of strength
if you do that.  Notably, you do not dare.

In any event, the addition of the completely-unnecessary module of code 
used to create the randomization effect within Joker80 that you desire 
irrefutably makes your program larger, more complicated and slower.  
Can that be a good thing?

H. G. Muller wrote on Mon, May 26, 2008 12:47 PM EDT:
Derek Nalls:
| Nonetheless, completing games of CRC (where a long, close, 
| well-played game can require more than 80 moves per player) 
| in 0:24 minutes - 36 minutes does NOT qualify as long or even, 
| moderate time controls.  In the case of your longest 36-minute games, 
| with an example total of 160 moves, that allows just 13.5 seconds per 
| move per player.  In fact, that is an extremely short time by any 
| serious standards.  

In my experience most games on the average take only 60 moves (perhaps
because of the large strength difference of the players). As early moves
are more important for the game result as late moves (even the best moves
late in the game do not help you if your position is already lost), most
engines use 2.5% of the remaining time for their next move (on average,
depending on how the iterations end compared to the target time). That
would be nearly 54 sec/move at 36 min/game in the decisive phase of the
game. That is more than you thought, but admittedly still fast. Note,
however, that I also played 60-min games in the General Championship
(without time odds), and that Joker80 confirms its lead over the
competitors it manifested at faster time controls.

But I don't see the point: Joker80's strength increases with time as
expected, in the range from 0.4 sec to 36 sec per move, in a regular and
theoretically expected way. This is over the entire range where I tested
the dependence of the scoring percentage of various material imbalances,
which extended to only 15 sec/move, and found it to be independent of TC.
So your 'explanation' for the latter phenomenon is just nonsense. The
effect you mention is observed NOT to occur, and thus cannot explain
anything that was observed to occur.

Now if you want to conjecture that this will all miraculously become very
different at longer TC, you are welcome to test it and show us convincing
results. I am not going to waste my computer time on such a wild and
expensive goose chase. Because from the way I know the engines work, I
know that they are 'scalable': their performance at 10 ply results from
one ply being put in front of 9-ply search trees. And that extra ply will
always help. If they have good 9-ply trees, they will have even better
10-ply trees. But you don't have to take my word for it. You have the
engine, and if you don't want to believe that at 1 hour per move you will
get the same win probability as at 1 sec/move, or that at 1 hour per move
it won't beat 10 min//move, just play the games, and you will see for
yourself. It would even be appreciated if you publish the games here or on
your website. But, needless to say, one or two games won't convince anyone
of anything.

| 'since I am not a computer chess programmer, I cannot possibly 
| know what I am talking about when I dare criticize an important 
| working of your Joker80 program'
Well, you certainly make it appear that way. As, despite the elaborate
explanation I gave of why programs derive extra strength from this
technique, you still draw a conclusion that in practice was already shown
to be 100% wrong earlier. And if you think you will run into the problem
you imagine at enormously longer TC, well, very simple: don't use
Joker80, but use some other engine. You are on your own there, as I am not
specifically interested in extremely long TC. There is always a risk in
using equipment outside the range of conditions for which it was designed
and tested, and that risk is entirely yours. So better tread carefully,
and make sure you rule out the percieved dangers by concise testing.

| You must decide upon and define the primary function of your 
| Joker80 program.

I do not see the dilemma you sketch. The purpose is to play ON AVERAGE the
best possible move. If you do that, you have the best chance to win the
game. If I can achieve that through a non-deterministic algorithm better
than through a deterministic one, I go for the nondeterministic method.
That it also diversifies play, and makes me less sensitive to prepared
openings from the opponent, is a win/win situation. Not a compromise.

As I explained, it is very easy to switch this feature off. But you should
be prepared for significant loss of strength if you do that.

Derek Nalls wrote on Mon, May 26, 2008 10:54 AM EDT:
I am slightly relieved and surprised that Joker80 measurably improves the
quality of its moves as a function of time or plies completed over a range
of speed chess tournaments.  Nonetheless, completing games of CRC (where a
long, close, well-played game can require more than 80 moves per player)
in 0:24 minutes - 36 minutes does NOT qualify as long or even, moderate
time controls.  In the case of your longest 36-minute games, with an example total of 160 moves, that allows just 13.5 seconds per move per player.  In fact, that is an extremely short time by any serious standards.  

I consider 10 minutes per move a moderate time that produces results of
marginal, unreliable quality and 60-90 minutes per move a long time that
produces results of acceptable, reliable quality.  Ask Reinhard Scharnagl or ET about the longest time per move they have used testing openings with their programs playing 'Unmentionable Chess'- 24 hours per move!

It is noteworthy that you are now resorting to playing dirty by using the
'exclusivist argument' that essentially 'since I am not a computer
chess programmer, I cannot possibly know what I am talking about when I
dare criticize an important working of your Joker80 program'.  What you
fail to take into account is that I am a playtester with more experience
than you at truly long time controls.  If you will not listen to what I am
trying to tell you, then why will you not listen to Scharnagl?  After all,
he is also a computer chess programmer with a lot of knowledge in
important subject matters (such as mathematics).

You really should not be laughing.  This is a serious problem.  Your
sarcastic reaction does nothing to reassure my trust or confidence that
you will competently investigate it, confirm it and fix it.

Now, please do not misconstrue my remarks?  My intent is not to overstate
the problem.  I realize Joker80 in its present form is not a totally
random 'woodpusher'.  It would not be able to win any short time control
tournaments if that were the case.  In fact, I believe you when you state
that you have not experienced any problems with it but ... I think this is
strictly because you have not done any truly long time control playtesting with it.

You must decide upon and define the best primary function for your Joker80
program:

1.  To pinpoint the single, very best move available from any position. 
[Ideally, repeats could produce an identical move.]

OR

2.  To produce a different move from any position upon most repeats. 
[At best, by randomly choosing amongst a short list of the best available
moves.]

These two objectives are mutually exclusive.  It is impossible and
self-contradictory for a program to somehow accomplish both.  Virtually
every AI game developer in the world except you chooses #1 as preferable
to #2 by a long shot in terms of the move quality produced on average.  

If you do not even commit your AI program to TRYING to find the single
best move available because you think variety is just a whole lot more
interesting and fun, then it will be soft competition at truly long time
controls facing other quality AI programs that are frequently-sometimes
pinpointing the single, best move available and playing it against you.

H. G. Muller wrote on Mon, May 26, 2008 04:09 AM EDT:
Derek: 'I hope you can handle constructive advice.'

It gives me a big laugh, that's for sure.

Of course none of what you say is even remotely true. That is what happens
if you jump to conclusions regarding complex matters you are not
knowledgeable about, without even taking the trouble to verify your ideas.


Of course I extensively TESTED how the playing strength of Joker80, (and
all available other engines), varied as a function of time control. This
was the purpose of several elaborate time-odds tournament I conducted,
where various versions of most engines participated that had to play their
games in 36, 12, 4, 1:30, 0:40 or 0:24 min, where handicapped engines were meeting non-handicapped ones in a full round robin. (I.e. the handicaps were factors 3, 9, 24, 54 or 90, where only the strongest engines were handicapped upto the very maximum, and the weakest only participated in an unhandicapped version). 

And of course Joker80 behaves similar to any Shannon-type engine that is reasonably free of bugs: its playing strength measured in Elo monotonically increases in a logarithmic fashion, approximately to the formula rating = 100*ln(time). So Joker80 at 5 min/move crushes Joker80 at 1 sec per move, as you could have easily found out for yourself. So that much for your nonsense about Joker80 failing to improve its move quality with time. For some discussion on one of the tournaments, see:

http://www.talkchess.com/forum/viewtopic.php?t=19764&postdays=0&postorder=asc&topic_view=flat&start=34

At that time Fairy-Max still had a hash-table bug that made it hang (and
subsequently forfeit on time) that was striking at a fixed rate per
second, so that Fairy-Max started to forfeit more and more games at longer
TC. Since then the bug has been identified and repaired, and now also
Fairy-Max performs progressively better at longer TC.

So nice try, but next time better save your breath for telling the surgeon
how to do his job before he will perform open heart surgery on you. Because
he has no doubt much more to learn from you regarding cardiology than I
have in the area of building Chess engines...

Things are as they are, and can become known by observation and testing.
Believing in misconceptions born out of ignorance is not really helpful.
Or, more explicitly: if you think you know how to build better Chess
engines than other people, by all means, do so. It will be fun to confront
your ideas with reality. In the mean time I will continue to build them as
I think best, (and know is best, through extensive testing), so you should have every chance to surpass them. Lacking that, you could at least _use_ the engines of others to check out if your theories of how they behave have any reality value. You don't have to depend on the time-odds tourneys and other tests I conduct. You might not even be aware of them, as the developers of Chess engines hardly ever publish the thousands of games they do for testing if their ideas work in practice.

Derek Nalls wrote on Sun, May 25, 2008 06:03 PM EDT:
The reason you have never been able find any correlation between winning
probabilities for one army and time controls [contrary to the experiences
of people using other AI programs] in asymmetrical playtests using Joker80
is that you have destructively randomized the algorithm within your program
to such an extent that it fails to measurably improve the quality of its
moves as a function of time or plies completed.  A program with serious
problems of this nature may do well in speed chess but at truly long time
controls against quality programs that improve as they should with time or
plies per move, it cannot consistently win.

I have two useful, important pieces of news for you:

1.  All of the statistical data you have generated using Joker80 (appr.
20,000+ games) is corrupt.  It must all be thrown out and started over
from scratch after you repair Joker80.

2.  All of your material values for CRC pieces are unreliable since they
are based upon and derived from #1 (corrupt statistical data).

I hope you can handle constructive advice.

H. G. Muller wrote on Sun, May 25, 2008 11:07 AM EDT:
I would have thought that 'twice the same flip in a row' was pretty
unambiguous, especially in combination with the remark about two-sided
testing. But let's not quibble about the wording.

The point was that for two-sided testing, if you suspect a coin to be
loaded, but have no idea if it is loaded to produce tail or heads, thw two
flips tell you exactly nothing. They are either the same or different, and
on an unbiased coin that would occur with equal probability. So the
'confidence' of any conclusion as to the fairness of the coin drawn from
the two flips would be only 50%. I.e. not better than totally random, you
might as well have guessed if it was fair or not without flipping it at
all. That would also have given you a 50% chance of guessing correct.

Derek Nalls wrote on Sun, May 25, 2008 10:14 AM EDT:
Well, when you said ...

'Actually the chance for twice the same flip in a row is 1/2.'

... that was vague and misleading.

I thought you meant 'heads' twice OR 'tails' twice equals a chance of
1/2 instead of the sum of 'heads' twice AND 'tails' twice equals a chance
of 1/2.

Since English is a second language to you, of course I will overlook this
minor mis-communication and even apologize for implicitly accusing you 
of incompetence.  However, you should expect that you will draw critical 
reactions from others when you have previously, falsely, explicitly
accused them of incompetence in a subject matter.

Tony Hecker wrote on Sun, May 25, 2008 09:47 AM EDT:
'Actually the chance for twice the same flip in a row is 1/2.'

H.G. is correct here.
- The probability of two heads in a row is 1/4.
- The probability of two tails in a row is 1/4.
- The probability of two same flips in a row is the sum of these two
outcomes: 1/4 + 1/4 = 1/2.

Another way to think about it:
With two coin flips, there are 4 equally likely outcomes: HH, HT, TH, TT.
In 2 of the 4 (equally likely) outcomes, the same flip result occurs twice
in a row.

H. G. Muller wrote on Sun, May 25, 2008 07:13 AM EDT:
Indeed, it is a stochastic way to simulate mobility evaluation. In the presence of other terms it should of course not be made so large that it dominates the total evaluation. Like explicit mobility terms should not dominate the evaluation. But its weight should not be set to zero either: properly weighted mobility might add more than 100 Elo to an engine.

Joker has no explicit mobility in its evaluation, and relies entirely on
this probabilistic mechanism to simulate it. The disadvantage is that,
because of the probabilistic nature, it is not 100% guaranteed to always
take the best decision. On rare occasions the single acceptable end leave
does draw a higher random bonus than one-hundred slightly better positions
in another branch. OTOH it is extremely cheap to implement, while explicit
mobility is very expensive. As a result, I might gain an extra ply in
search depth. And then it becomes superior to explicit mobility, as it
only counts tactically sound moves, rather than just every move. So it is
like safe mobility verified by a full Quiescence Search.

In my assesment, the probabilistic mobility adds more strength to Joker
than changing the Rook value by 50cP would add or subtract. This can be
easily verified by play-testing. It is possible to switch this evaluation
term off. In fact, you have to switch it on, but WinBoard does this by
default. To prevent it from being switched on, one should run WinBoard
with the command-line option /firstInitString='new'. (The default
setting is 'new\nrandom'. If Joker is running as second engine, you
will of course have to use /secondInitString='new'.)

Reinhard Scharnagl wrote on Sun, May 25, 2008 06:39 AM EDT:
Harm wrote: ... 'OTOH, a program that evaluates every position as a
completely random number starts to play quite reasonable ches, once the
search reaches 8-10 ply. Because it is biased to seek out moves that lead
to pre-horizon nodes that have the largest number of legal moves, which
usually are the positions where the strongest pieces are still in its
possession.' ...

This is nothing but a probability based heuristic simulating a mobility
evaluation component. But having a working positional evaluation,
especially when also covering mobility, that randomizing method is not
orthogonal to the calculated much more appropriate knowledge. Thus you
will overlay a much better evaluation by a disturbing noise generator. 

Nevertheless this approach might have advantages through the opening,
preventing some else working implementations of preinvestigated killer
combinations.

H. G. Muller wrote on Sun, May 25, 2008 05:14 AM EDT:
'Do not you realize that forcing Joker80 to do otherwise must reduce its
playing strength significantly from its maximum potential?'

On the contrary, it makes it stronger. The explanation is that by adding a
random value to the evaluation, branches with very many equal end leaves
have a much larger probability to have the highest random bonus amongst
them than a branch that leads to only a single end-leaf of that same
score.

The difference can be observed most dramatically when you evaluate all
positions as zero. This makes all moves totally equivalent at any search
depth. Such a program would always play the first legal move it finds, and
would spend the whole game moving its Rook back and forth between a1 and
b1, while the opponent is eating all its other pieces. OTOH, a program
that evaluates every position as a completely random number starts to play
quite reasonable ches, once the search reaches 8-10 ply. Because it is
biased to seek out moves that lead to pre-horizon nodes that have the
largest number of legal moves, which usually are the positions where the
strongest pieces are still in its possession.

It is always possible to make the random addition so small that it only
decides between moves that would otherwise have exactly equal evaluation.
But this is not optimal, as it would then prefer a move (in the root) that
could lead (after 10 ply or so) to a position of score 53 (centiPawn),
while all other choices later in the PV would lead to -250 or worse, over
a move that could lead to 20 different positions (based on later move
choices) all evaluating as 52cP. But, as the scores were just
approximations based on finite-depth search, two moves later, when it can
look ahead further, all the end-leaf scores will change from what they
were, because those nodes are now no longer end-leaves. The 53 cP might
now be 43cP because deeper search revealed it to disappoint by 10cP. But
alas, there is no choice: the alternatives in this branch might have
changed a little too, but now all range from -200 to -300. Not much help,
whe have to settle for the 43cP... 

Had it taken the root move that keeps the option open to go to any of the
20 positions of 52cP, it would now see that their scores on deeper search
would have been spread out between 32cP and 72cP, and it could now go for
the 72cP. In other words, the investment of keeping its options open
rather than greedily commit itself to going for an uncertain, only
marginally better score, typically pays off. 

To properly weight the expected pay-back of keeping options that at the
current search depth seem inferior, it must have an idea of the typical
change of a score from one search depth to the next. And match the size of
the random eval addition to that, to make sure that even sligtly (but
insignificantly) worse end-leaves still contribute to enhancing the
probability that the branch will be chosen. Playing a game in the face of
an approximate (and thus noisy) evaluation is all about contingency
planning.

As to the probability theory, you don't seem to be able to see the math
because of the formulae...

P(hh) = 0.5*0.5 = 0.25
P(tt) = 0.5*0.5 = 0.25
______________________+
P(two equal)    = 0.5

Derek Nalls wrote on Sat, May 24, 2008 10:16 AM EDT:
'... in Joker the source of indeterminism is much less subtle: it is
programmed explicitly.'

This renders Joker80 totally unsuitable for my playtesting purposes.  [I
am just relieved that you told me this bizarre fact now before I invested
large amounts of computer time and effort.]

It is critically important that any AI program attempt (to its greatest
capability) to pinpoint the single, very best possible move in the time allowed upon every move in the game even if this means that it would
often-sometimes repeat an identical move from an identical position.

Do not you realize that forcing Joker80 to do otherwise must reduce its
playing strength significantly from its maximum potential?

Derek Nalls wrote on Sat, May 24, 2008 09:39 AM EDT:
'Actually the chance for twice the same flip in a row is 1/2.'
______________________________________________________

Really?
You obviously need a lesson on probability.
Let us start with elementary stuff.

Mathematical Ideas
fifth edition
Miller & Heeren
1986

It is an old college textbook from a class I took in the mid-90's.
[Yes, I passed the class.]
______________________

It says interesting things such as-

'The relative frequency with which an outcome happens 
represents its probability.'

'In probability, each repetition of an experiment is a trial.
The possible results of each trial are outcomes.'
____________________________________________

An example of a probability experiment is 'tossing a coin'.
Each 'toss' (trial of the experiment) has only two equally-possible 
outcomes, 'heads' or 'tails' ... assuming the condition that the 
coin is fair (i.e., not loaded).

probability = p
heads = h
tails = t
number of tosses = x
addition = +
involution = ^

[This is a substitute upon a single line for superscript representation 
of an exponent to the upper right of a base.]

probability of heads = p(h)
probability of tails = p(t)

p(h) is a base.
p(t) is a base.

x is an exponent.

p(h) = 0.5
p(t) = 0.5
_________________

What follows are examples of the chances of getting the same result
upon EVERY consecutive toss.

1 time
x = 1

p(h) ^ x = 0.5 ^ 1 = 0.5
p(t) ^ x = 0.5 ^ 1 = 0.5

Note:  In this case only ...
p(h) + p(t) = 1.0

2 times
x = 2

p(h) ^ x = 0.5 ^ 2 = 0.25
p(t) ^ x = 0.5 ^ 2 = 0.25

3 times
x = 3

p(h) ^ x = 0.5 ^ 3 = 0.125
p(t) ^ x = 0.5 ^ 3 = 0.125

Etc ...
______________________

By a function that is the inverse of successive exponents of base 2,
the chance for consecutive tosses to yield the same result rapidly
becomes extremely small.

When this occurs, there are only two possibilities- 'random good-bad
luck' or an unfair advantage-disadvantage exists (i.e., 'the coin is loaded').  The sum of these two possibilities always equals 1.

random luck (good or bad) = l
unfair (advantage or disadvantage) = u

luck (heads) = l(h)
luck (tails) = l(t)

unfair (heads) = u(h)
unfair (tails) = u(t)

p(h) ^ x = l(h)
p(t) ^ x = l(t)

l(h) + u(h) = 1
l(t) + u(t) = 1

Therefore, as the chances of 'random good-bad luck' become extremely low in the example, the chances of an advantage-disadvantage existing for 'one side of the coin' or (if you follow the analogy) 'one side of the gameboard' or 'one player' or 'one set of piece values' become likewise extremely high.

Only if it can be proven that an advantage-disadvantage does not exist for one player, then can it be accepted that the extremely unlikely event by
'random good-bad luck' is indeed the case.

It is essential to understand that random good luck or random bad luck
cannot be consistently relied upon.  From this fact alone, firm
conclusions can be responsibly drawn with a strong probability of
correctness.
____________________________________________________________

1 time
x = 1

p(h) ^ x = 0.5
u(h) = 0.5

p(t) ^ x = 0.5
u(t) = 0.5

2 times
x = 2

p(h) ^ x = 0.25
u(h) = 0.75

p(t) ^ x = 0.25
u(t) = 0.75

3 times
x = 3

p(h) ^ x = 0.125
u(h) = 0.875

p(t) ^ x = 0.125
u(t) = 0.875

Etc ...

H. G. Muller wrote on Sat, May 24, 2008 05:49 AM EDT:
Derek:
| Conclusions drawn from playing at normal time controls are 
| irrelevant compared to extremely-long time controls.

First, that would only be true if the conclusions would actually depend on
the TC. Which is a totally unproven conjecture on your part, and in fact
contrary to any observation made at TCs where such observations can be
made with any accuracy (because enough games can be played). This whole thing reminds me of my friend, who always claims that stones fall upward. When I then drop a stone to refute him, he jsut shrugs, and says it proves nothing because the stone is 'not big enough'. Very conveniently for him, the upward falling of stones can only be observed on stones that are too big for anyone to lift...
  But the main point is of course, if you draw a conclusion that is valid
only at a TC that no one is interested in playing, what use would such a
conclusion be?

| The chance of getting the same flip (heads or tails) twice-in-a-row 
| is 1/4.  Not impressive but a decent beginning.  Add a couple or a 
| few or several consecutive same flips and it departs 'luck' by a 
| huge margin.

Actually the chance for twice the same flip in a row is 1/2. Unless you
are biased as to what the outcome of the flip should be (one-sided
testing). And indeed, 10 identical flips in a row would be unlikely to
occur by luck by a large margin. But that is rather academic, because you
won't see 10 identical results in a row between the subtly different
models. You will see results like 6-4 or 7-3, which will again be very
likely to be a result of luck (as that is exactly what they are the result
of, as you would realize after 10,000 games when the result is standing at
4,628-5,372).

Calculate the number of games you need to typically get a result for a
53-47 advantage that could not just as easily have been obtained from a
50-50 chance with a little luck. You will be surprised...

| I have wondered why the performance of computer chess programs is
| unpredictable and varied even under identical controls.  Despite 
| their extraordinary complexity, I think of computer hardware, 
| operating systems and applications (such as Joker80) as deterministic.

In most engines there alwas is some residual indeterminism, due to timing
jitter. There are critical decision points, where the engine decides if it
should do one more iteration or not (or search one more move vs aborting
the iteration). If it would take such decisions purely on internal data,
like node count, it would play 100% reproducible. But most engines use the
system clock, (to not forfeit on time if the machine is also running other
tasks), and experience the timing jitter caused by other processes
running, or rotational delays of the hard disk they had been using. In
multi-threaded programs this is even worse, as the scheduling of the
threads by the OS is unpredictable. Even the position where exactly the
program is loaded in physical memory might have an effect.

But in Joker the source of indeterminism is much less subtle: it is
programmed explicitly. Joker uses the starting time of the game as the
seed of a pseudo-random-number generator, and uses the random numbers
generated with the latter as a small addition to the evaluation, in order
to lift the degeneracy of exactly identical scores, and provide a bias for
choosing the move that leads to the widest choice of equivalent positions
later.

The non-determanism is a boon, rather than a bust, as it allows you to
play several games from an identical position, and still do a meaningful
sampling of possible games, and of the decisions that lead to their
results. If one position would always lead to the same game, with the same
result (as would occur if you were playing a simple end-game with the aid
of tablebases), it would not tell you anything about the relative strength
of the armies. It would only tell you that this particular position was won
/ drawn. But noting about the millions of other positons with the same
material on the board. And the value of the material is by definition an
average over all these positions. So with deterministic play, you would be
forced to sample the initial positions, rather than using the indeterminism
of the engine to create a representative sample of positions before
anything is decided.

| In fact, to the extent that your remarks are true, they will 
| support my case if my playtesting is successful that the 
| unlikelihood of achieving the same outcome (i.e., wins or 
| losses for one player) is extreme.
This sentence is to complicated for me to understand. 'Your case' is
that 'the unlikelyhood of achieving the same outcome is extreme'? If the
unlikelyhood is extreme, is that the same as that the likelyhood is
extreme? Is the 'unlikelyhood to be the same' the same as the
'likelyhood to be different'? What does 'extreme' mean for a
likelyhood? Extremely low or extremely high? I wonder if anything is
claimed here at all...

I think you make a mistake by seeing me as a low-quality advocate. I only
advocate minimum quantity to not make the results inconclusive.
Unfortunately, that is high, despite my best efforts to make it as low as
possible through asymmetric playtesting and playing material imbalances in
pairs (e.g. 2 Chancellors agains two Archbisops, rather than one vs one).
And that minimum quantity puts limits to the maximum quality that I can
afford with my limited means. So it would be more accurate to describe me
as a minimum-(significant)-quantity, maximum-(affordable)-quality
advocate...

Derek Nalls wrote on Fri, May 23, 2008 06:22 PM EDT:
'If the result would be different from playing at a a more 'normal' TC,
like one or two hours per game, it would only mean that any conclusions 
you draw on them would be irrelevant for playing Chess at normal TC.'

Conclusions drawn from playing at normal time controls are irrelevant
compared to extremely-long time controls.  It is desirable to see what
secrets can be discovered from a rarely viewed vantage of extremely
well-played games.  Are not you interested at all to analyze move-by-move
games played better than almost any pair of human players are capable?

You do not seem to understand that I, too, am discontent with the
probability of a small number of wins or losses in a row.  This is a
compensation that reduces the chance that the games were randomly
played to the greatest extent attainable and consequently, the winner 
or loser randomly determined.
_____________________________

'... playing 2 games will be like flipping a coin.'

Correction-

Playing 1 game will be like flipping a coin ... once.
Playing 2 games will be like flipping a coin ... twice.

The chance of getting the same flip (heads or tails) twice-in-a-row is
1/4.  Not impressive but a decent beginning.  Add a couple or a few or several consecutive same flips and it departs 'luck' by a huge margin.
_______________________________________________________________

'The result, whatever it is, will not prove anything, as it would be
different if you would repeat the test. Experiments that do not give a
fixed outcome will tell you nothing, unless you conduct enough of them to
get a good impression on the probability for each outcome to occur.'

I have wondered why the performance of computer chess programs is
unpredictable and varied even under identical controls.  Despite their
extraordinary complexity, I think of computer hardware, operating systems
and applications (such as Joker80) as deterministic.

The details of the differences in outcomes do not concern me.  In fact,
to the extent that your remarks are true, they will support my case if my
playtesting is successful that the unlikelihood of achieving the same
outcome (i.e., wins or losses for one player) is extreme.

I am pleased to report that I estimate it will be possible, over time, to
generate enough experiments using Joker80 to have meaning for a
high-quality, low-quantity advocate (such as myself) and even a
moderate-quality, moderate-quantity advocate (such as Scharnagl).  As for
a low-quality, high-quantity advocate (such as you), you will always be
disappointed as you are impossible to please.

Derek Nalls wrote on Fri, May 23, 2008 05:38 PM EDT:
I have recently been sufficiently convinced via asymmetrical playtesting
(still underway) that the 2 rooks : 1 queen advantage in material values
is appr. the same in CRC as in FRC.  [I used to think it was higher in
CRC.] Consequently, I revised my model (again) and my CRC piece values:

universal calculation of piece values
http://www.symmetryperfect.com/shots/calc.pdf

CRC
material values of pieces
http://www.symmetryperfect.com/shots/values-capa.pdf

FRC
material values of pieces
http://www.symmetryperfect.com/shots/values-chess.pdf

This change was implemented by raising the value of the queen in CRC- not
by lowering the value of the rook.

revised Joker80 values
Nalls standard CRC model
P85=268=307=518=818=835=950

H. G. Muller wrote on Fri, May 23, 2008 05:36 AM EDT:
Derek Nalls:
| This might require very deep runs of moves with a completion time 
| of a few weeks to a few months per pair of games to achieve 
| conclusive results.

It still escapes me what you hope to prove by playing at such an
excessively long Time Control. If the result would be different from
playing at a a more 'normal' TC,  like one or two hours per game, (which
IMO will not be the case), it would only mean that any conclusions you draw
on them would be irrelevant for playing Chess at normal TC.

Furthermore, playing 2 games will be like flipping a coin. The result,
whatever it is, will not prove anything, as it would be different if you
would repeat the test. Experiments that do not give a fixed outcome will
tell you nothing, unless you conduct enough of them to get a good
impression on the probability for each outcome to occur.

H. G. Muller wrote on Fri, May 23, 2008 04:16 AM EDT:
'Because of all this, I suggest evaluating entire configuration of
pieces,
rather than a single piece.'

This is exactly what Chess engines do. But it is a subject that transcends
piece values. Material evaluation is supposed to answer the question:
'what combination of pieces would you rather have, without knowing where
they stand on the board'. Piece values are an attempt to approximate the
material evaluation as a simple sum of the value of the individual pieces,
making up the army.

It turns out that material evaluation is by far the largest component of
the total evaluation of a Chess position. And this material evaluation
again can be closely approximated by a sum of piece values. The most
well-known exception is the Bishop pair: having two Bishops is worth about
half a Pawn more than double the value of a single Bishop. Other
non-additive terms are those that make the Bishop and Rook value dependent
on the number of Pawns present. To account for such effects some engines
(e.g. Rybka) have tabulated the total value of all possible combinations
of material (ignoring promotions) in a 'material table'. Such tables can
then also account for the material component of the evaluation that gives
the deviation from the sum of piece values due to cooperative effects
between the various pieces.

Useful as this may be, it remains true that piece values are by far the
largest contribution to the total evaluation. The only positional terms
that can compete with it are passed pawns (a Pawn on 7th rank is worth
nearly 2.5 normal Pawns) and King Safety (having a completely exposed King
in the middle game, when the opponent still has a Queen or similar
super-piece, can be worth nearly a Rook).

Rich Hutnik wrote on Thu, May 22, 2008 09:56 PM EDT:
Perhaps we need to look back to exactly why we need piece values.  Is it to
balance different armies, or just because people are curious?  Is the
objective to turn Chess Variants into a single balanced game, or something
else?  Maybe need to think of the reason for the discussion, so then you
can perhaps find a way to cut the Gordian knot instead of trying to
untangle it.

Derek Nalls wrote on Thu, May 22, 2008 08:47 PM EDT:
Originally, I planned two 'internal playtests'.  [By this self-invented
term I mean playtests of the standard model of a person against a special
model that I have compelling reasons to think may be superior by a
provable margin.]

The first planned test involves the standard CRC model of Muller against a
special CRC model with a higher, closer-to-conventional rook value.  Upon
closer examination, I suspected that the discrepancy was possibly too
small to be detected even with very long time controls.  So, I announced
that this test was cancelled.

Notwithstanding, I may change my mind and return to this unsolved mystery
if Joker80 demonstrates unusually-high aptitude as a playtesting tool. 
This might require very deep runs of moves with a completion time of a few
weeks to a few months per pair of games to achieve conclusive results.

The second planned test involves the standard CRC model of Scharnagl
against a special CRC model with a higher, unconventional archbishop
value.

Scharnagl currently assigns the archbishop with a material value of appr.
77% that of the chancellor in his standard CRC model.

Muller currently assigns the archbishop with a material value of greater
than 97% that of the chancellor in his standard CRC model.

Nalls currently assigns the archbishop with a material value of lesser
than 98% that of the chancellor in his standard CRC model.

I devised a special CRC model using identical material values for every
piece in the standard CRC model by Scharnagl except that it assigns the
archbishop with a material value of exactly 95% that of the chancellor
(18% or 1.65 pawns higher).  [Note that this figure is slightly more
moderate than those by Muller & Nalls.]  A discrepancy this large should
be detectable at short-moderate time controls.  This test is now
underway.

If either of these tests are successful at establishing or implicating a
probability that the special models play stronger than the standard
models, then revisions to the standard models may occur.  At that
juncture, we would be ready to begin 'external playtests'.  [By this
self-invented term I mean playtests of the standard models of different
persons against one another.]

Gary Gifford wrote on Thu, May 22, 2008 05:37 PM EDT:
Rich suggested '...evaluating entire configuration of pieces, rather than a single piece.'

I believe that is correct [that is what programs like Fritz and Chess Master seem to do... evaluating the two configurations and giving a score for the deviation] but also I would say, evaluate the pieces within the given position. The values are relative and change with every move.

The lowly pawn about to queen is a fine example. The Knight that attacks 8 spaces compared to one that attacks 4 is another, as is the 'bad' [blockaded] Bishop.

Another concept is that of brain power. For example, the late Bobby Fischer's Knights would be much more powerful than mine... not in potential, but in reality of games played. Pieces have potential, but the amount of creative power behind them is an important factor.


Rich Hutnik wrote on Thu, May 22, 2008 05:24 PM EDT:
It seems like a normal FIDE pawn, but by simply shifting all the pawns up
one row, the value of all them changes.  In other words, their value is
dependent upon their proximity to other pawns.  In light of this, are
pieces worth the same in every configuration of Chess960?

This issue is more complicated than it appears.  Take Near vs Normal
Chess, for example.  Which side has an advantage?  The Near side moves
everything up one row, but drops castling, but has a back row to either
drop the king back or mobilize the rooks.  And, against this, Near can En
Passant the pawns of Normal, but Normal can't do the same to Near.

Because of all this, I suggest evaluating entire configuration of pieces,
rather than a single piece.

H. G. Muller wrote on Thu, May 22, 2008 03:05 PM EDT:
'Let me provide another challenge for people here regarding pawns.  How
much is a pawn that moves only one space forward (not initial 2) but
starts on the third row instead of second worth in contrast to a normal
chess pawn?  How much is it worth alone, and then in a line of pawns that
start on the third row?'

But this is a totally normal FIDE Pawn...

It would get a pretty large positional penalty if it was alone
(isolated-pawn penalty). In a complete line of pawns on the 3rd rank it
would be worth a lot more, as it would not be isolated, and not be
backward. All in all it would be fairly similar to having a line of Pawns
on second rank, as the bonus for pushing the Pawns forward 1 square is
approximately cancelled by not having Pawn control anymore over any of the
squares on the 3rd rank.

Rich Hutnik wrote on Thu, May 22, 2008 01:28 PM EDT:
I believe the value of a piece should relate to its mobility first and
foremost.  If one were to end up rating a piece, come up with a value of 1
for the most pathetic potential piece in the game, and then adjust
accordingly.  How about a pawn that starts out on the second space and
only moves backwards one as its move and doesn't capture?  That pawn has
a value of one.  How much more is an Asian chess pawn that moves only one
space forward, and doesn't promote worth in contrast?

To base it on a normal chess pawn is to not provide a full solution for
the variant community.

Let me provide another challenge for people here regarding pawns.  How
much is a pawn that moves only one space forward (not initial 2) but
starts on the third row instead of second worth in contrast to a normal
chess pawn?  How much is it worth alone, and then in a line of pawns that
start on the third row?

H. G. Muller wrote on Thu, May 22, 2008 04:13 AM EDT:
'Do you think these piece values will work smoothly with Joker80 running
under Winboard F yet remain true to all three models?'

Yes, I think these values will not conflict in anyway with any of the
hard-wired value approximates that are used for pruning decisions. At
least not to the point where it would lead to any observable effect on
playing strength. (Prunings based on the piece values occur only close to
the leaves, and engines are usually quite insensitive as to how exactly
you prune there.)

H. G. Muller wrote on Thu, May 22, 2008 04:07 AM EDT:
'I cannot speak for Reinhard Scharnagl at all, though.'

This is exactly the problem. 'base value' for Pawns is a very
ill-defined concept, as it is the smallest of all piece base values, while
the positional terms regarding to Pawns are usually the largest of all
positional terms. And the whole issue of pawn-structure evaluation in
Joker is so complex that I am not even sure if the average of positional
terms (over all pawns and over a typical game) is positive or negative.
Pawns get penalties for being doubled, or having no Pawns next or behind
them on neigboring files. They get points for advancing, but they get
penalties for creating squares that no longer can be defended by any Pawn.
My guess is that in general, the positional terms are slightly positive,
even for non-passers not involved in King Safety.

A statement like 'a Knight is worth exactly 3 Pawns' is only meaningful
after exactly specifying which kind of pawn. If the Scharnagl model
evaluates all non-passers exactly the same (except, perhaps, edge Pawns),
then the question still arises how to most-closely approximate that in
Joker80, which doesn't. And simply setting the Joker80 base value equal
to the single value of the Scharnagle model is very unlikely to do it. 

Good differentiation in Pawn evaluation is likely to impact play strength
much more than the relative value of Pawns and Pieces, as Pawns are traded
for other Pawns (or such trades are declined by pushing the Pawn and
locking the chains) much more often than they can be traded for Pieces.

Derek Nalls wrote on Wed, May 21, 2008 08:33 PM EDT:
Muller:

Here is my latest revision to my 'winboard.ini' file.
Are these piece values acceptable to you?
Do you think these piece values will work smoothly with Joker80 running
under Winboard F yet remain true to all three models?
______________________________________________________

/firstChessProgramNames={'C:\winboard-F\Joker80\w\M-st\w-M-st 22
P85=300=350=475=875=900=950'
'C:\winboard-F\Joker80\w\M-sp\w-M-sp 22
P85=300=350=560=875=900=950'
'C:\winboard-F\Joker80\w\S-st\w-S-st 22
P85=302=339=551=694=902=950'
'C:\winboard-F\Joker80\w\S-sp\w-S-sp 22
P85=302=339=551=857=902=950'
'C:\winboard-F\Joker80\w\N-st\w-N-st 22
P85=284=326=548=866=884=950'
'C:\winboard-F\Joker80\w\N-sp\w-N-sp 22
P85=284=326=548=866=884=950'
'C:\winboard-F\TJchess\TJChess10x8'
}
/secondChessProgramNames={'C:\winboard-F\Joker80\b\M-st\b-M-st 22
P85=300=350=475=875=900=950'
'C:\winboard-F\Joker80\b\M-sp\b-M-sp 22
P85=300=350=560=875=900=950'
'C:\winboard-F\Joker80\b\S-st\b-S-st 22
P85=302=339=551=694=902=950'
'C:\winboard-F\Joker80\b\S-sp\b-S-sp 22
P85=302=339=551=857=902=950'
'C:\winboard-F\Joker80\b\N-st\b-N-st 22
P85=284=326=548=866=884=950'
'C:\winboard-F\Joker80\b\N-sp\b-N-sp 22
P85=284=326=548=866=884=950'
'C:\winboard-F\TJchess\TJChess10x8'
}

Derek Nalls wrote on Wed, May 21, 2008 08:13 PM EDT:
'If I were you, I would normalize all models to Q=950 but then replace
the pawn value everywhere by 85.'

Since this is what you (the developer of Joker80) recommend as optimum, 
this is what I will do.

Are you sure that replacing any pawn values different than 85 points
after renormalization to queen = 950 points still renders an accurate 
and complete representation, more or less, of the Scharnagl and Nalls 
models?

At par of queen = 950 points, the pawn value in the Nalls model
is not represented as being only 92.19% as high as that in the Muller 
model and the pawn value in the Scharnagl model is not represented
as being only 98.95% as high as that in the Muller model.

Thru it all ... If a perfect representation is not quite possible, 
I can accept that without reservation.
__________________________________

'I don't think you could say then that you deviate from the
model as the models do not really specify which type of Pawn they use as
a standard.'

Correctly calculating pawn values at the start of the game (much less, 
throughout the game) requires finesse as it is indeed a complex issue.
In fact, its excessively complexity is the reason my 66-page paper on
material values of pieces is silent in the case of calculating pawn values
in FRC & CRC.  Instead, someone needs to read an entire book from an 
outside source about calculating the material values of the pieces in 
Chess to sufficiently understand it.

Personally, I am content with the test situation as long as Joker80 
handles all pawns under all three models initially valued at 85 points
as fairly and equally as realistically possible.

I cannot speak for Reinhard Scharnagl at all, though.
________________________________________________

'The way you did it now would make the first Bishop to be traded of the 
value the model prescribes, but would make the second much lighter. 
If you would subtract half the bonus, then on the average they would 
be what the model prescribes.'

Now, I understand better.
It makes sense.
[I am glad I asked you.]

Yes, I will subtract 20 points (1/2 of the 'bishop pair bonus') from the
model-independant, material values for the bishop under the 
Scharnagl & Nalls models.

H. G. Muller wrote on Wed, May 21, 2008 04:29 PM EDT:
Is there any special reason you want to keep the Pawn value equal in all
trial versions, rather than, say, the total value of the army, or the
value of the Queen? Especially in the Scharnagl settings it makes almost
every piece rather light compared to the quick guesses used for pruning.

Note that there are so many positional modifiers on the value of a pawn
(not only determined by its own position, but also by the relation to
other friendly and enemy pawns) that I am not sure what the base value
really means. Even if I say that it represents the value of a Pawn at g2,
the evaluation points lost on deleting a pawn on g2 will depend on if
there are pawns on e- and i-file, and how far they are advanced, and on
the presence of pawns on the f- and h-file (which mighht become backward
or isolated), and of course if losing the pawn would create a passer for
the opponent.

If I were you, I would normalize all models to Q=950, but then replace
the
pawn value everywhere by 85 (I think the standard value used in Joker is
even 75). I don't think you could say then that you deviate from the
model, as the models do not really specify which type of Pawn they use as
a standard. My value refers to the g2 pawn in an opening setup. Perhaps
Reinhard's value refers to an 'average' pawn, in a typical pawn chain
occurring in the early middle game, or a Pawn on d4/e4 (which is the most
likely to be traded).

As to the B-pair: tricky question. The way you did it now would make the
first Bishop to be traded of the value the model prescribes, but would
make the second much lighter. If you would subtract half the bonus, then
on the average they would be what the model prescribes. The value is
indeed hard-wired in Joker, but if you really want, I could make it
adjustable through a 8th parameter.

Derek Nalls wrote on Wed, May 21, 2008 03:02 PM EDT:
Muller:

Please have another look at this except from my 'winboard.ini' file. 
There are standard and special versions of piece values by Muller,
Scharnagl & Nalls for the white and black players renormalized to pawn =
85 points.

The special version of the Muller model has a rook value exactly 85 points
or 1.00 pawn higher than the standard version.

The special version of the Scharnagl model has an archbishop value (736
points) at appr. 95% of the archbishop value (775 points) instead of 597
points at appr. 77% for the standard version.

The special version of the Nalls model is identical to the standard
version until some test is needed and planned.

Since I assume that the 'bishop pairs bonus' is hardwired into Joker80,
40 points has been subtracted from the model-independant, material values
of the bishop under all three models.  Is this correct?
_____________________________________________________

/firstChessProgramNames={'C:\winboard-F\Joker80\w\M-st\w-M-st 22
P85=300=350=475=875=900=950'
'C:\winboard-F\Joker80\w\M-sp\w-M-sp 22
P85=300=350=560=875=900=950'
'C:\winboard-F\Joker80\w\S-st\w-S-st 22
P85=260=269=474=597=775=816'
'C:\winboard-F\Joker80\w\S-sp\w-S-sp 22
P85=260=269=474=736=775=816'
'C:\winboard-F\Joker80\w\N-st\w-N-st 22
P85=262=279=505=799=815=876'
'C:\winboard-F\Joker80\w\N-sp\w-N-sp 22
P85=262=279=505=799=815=876'
'C:\winboard-F\TJchess\TJChess10x8'
}
/secondChessProgramNames={'C:\winboard-F\Joker80\b\M-st\b-M-st 22
P85=300=350=475=875=900=950'
'C:\winboard-F\Joker80\b\M-sp\b-M-sp 22
P85=300=350=560=875=900=950'
'C:\winboard-F\Joker80\b\S-st\b-S-st 22
P85=260=269=474=597=775=816'
'C:\winboard-F\Joker80\b\S-sp\b-S-sp 22
P85=260=269=474=736=775=816'
'C:\winboard-F\Joker80\b\N-st\b-N-st 22
P85=262=279=505=799=815=876'
'C:\winboard-F\Joker80\b\N-sp\b-N-sp 22
P85=262=279=505=799=815=876'
'C:\winboard-F\TJchess\TJChess10x8'
}

H. G. Muller wrote on Wed, May 21, 2008 01:49 PM EDT:
Well, I share that concern. But note that the low Rook value was not only
based on the result of Q-2R assymetric testing. I also played R-BP and
NN-RP, which ended unexpectedly bad for the Rook, and sets the value of
the Rook compared to that of the minor pieces. While the value of the
Queen was independently tested against that of the minor pieces by playing
Q-BNN.

The low difference between R and B does make sense to me now, as the wider
board should upgrade the Bishop a lot more than the Rook. The Bishop gets
extra forward moves, and forward moves are worth a lot more than lateral
moves. I have seen that in testing cylindrical pieces, (indicated by *),
where the periodic boundary condition w.r.t. the side edges effectifely
simulates an infinitely wide board. In a context of normal Chess pieces,
B* = B+P, while R* = R + 0.25P. OTOH, Q* = Q+2P. So it doesn't surprise
me that on wider boards R loses compared to Q and B.

I can think of several systematic errors that lead to unrealistically poor
performance of the Rook in asymmetric playtesting from an opening position.
One is that Capablanca Chess is a very violent game, where the three
super-pieces are often involved in inflicting an early chekmate (or nearly
so, where the opponent has to sacrifice so much material to prevent the
mate, that he is lost anyway). The Rooks initially offer not much defense
against that. But your chances for such an early victory would be strongly
reduced if you were missing a super-piece. So perhaps two Rooks would do
better against Q after A and C are traded. This explanation would do
nothing for explaining poor Rook performance of R vs B, but perhaps it is
B that is strong (it is also strong compared to N). The problem then would
be not so much low R value, but high Q value, due to cooperativity between
superpieces. So perhaps the observed scores should not be entirely
interpreted as high base values for Q, C and A, but might be partly due to
super-piece pair bonuses similar to that for the Bishop pair. Which I would
then (mistakenly) include in the base value, as the other super-pieces are
always present in my test positions.

Another possible source of error is that the engine plays a strategy that
is not well suited for playing 2R vs Q. Joker80's evaluation does not
place a lot of importance to keeping all its pieces defended. In general
this might be a winning strategy, giving the engine more freedom in using
its pieces in daring attacks. But 2R vs Q might be a case where this
backfires, and where you can only manifest the superiority of your Rook
force by very careful and meticulous, nearly allergic defense of your
troops, slowly but surely pushing them forward. This is not really the
style of Joker's play. So it would be interesting to do the asymmetreic
playtesting for Q vs 2R also with other engines. But TJchess10x8 only
became available long after I started my piece value project, TSCP-G does
not allow setting up positions (although now I know a work-around for
that, forcing initial moves with both ArchBishops to capture all pieces to
delete, and then retreating them before letting the engine play). And Smirf
initially could not play automatically at all, and when I finally made a WB
adapter for it so that it could, fast games by it where more decided by
timing issues than by play quality (many losses on time with scores like
+12!). And Fairy-Max is really a bit too simplistic for this, not knowing
the concept of a Bishop pair or passed pawns, besides being a slower
searcher.

Derek Nalls wrote on Wed, May 21, 2008 12:53 PM EDT:
As I moved to renormalize all of the values used in Joker80 (written into
the 'winboard.ini' file) with the pawn at a par of 85 points, I looked
at my notes again.  They reminded me that your use of the 'bishop pair'
refinement (with a bonus of 40 points) ramifies that the material value of
the rook is either 1.00 pawns or 1.47 pawns greater than the material value
of the bishop in CRC, depending upon whether or not only one bishop or both
bishops, respectively, remain in the game.  At that point, I realized that
I would be attempting to playtest for a discrepancy that I know from
experience is just too small to detect even at very long time controls. 
So, this planned test has been cancelled.

I am not implying that this matter is unimportant, though.  I remain
concerned for the standard Muller model whenever it allows the exchange of
its 2 rooks for 1 queen belonging to its opponent.

H. G. Muller wrote on Wed, May 21, 2008 08:48 AM EDT:
It looks OK to me.

One caveat: the normalization (e.g. Pawn = 100) is not completely
arbitrary, as the engine weights material against positional terms, and
doubling all piece values would effectively scale down the importance of
passers and King Safety.

In addition, the engine also uses some heavily rounded 'quick' piece
values internally, where B=N=3, R=5, A=C=8 and Q=9, to make a rough guess
if certain branches stand any chance to recoup the material it gave
earlier in the branche. So in certain situations, when it is behind 800
cP, it won't consider capturing a Rook, because it expects that to be
worth about 500 cP, and thus falls 300 cP below the target. Such a large
deficit would be beyond the safety margin for pruning the move. But if the
piece values where scaled up such that the 800 merely represented being a
Bishop behind, this obviously would be an unjustified pruning.

The safety margin is large enough to allow some leeway here, but don't
overdo it. It would be safest to keep the value of Q close to 950.

I am indeed skeptical to the possibility to do enough games to measure the
difference you want to see in the total score percentage. But perhaps some
sound conclusions could be drawn by not merely looking at the result, but
at the actual games, and single out the Q vs 2R trades. (Or actually any
Rook versus other material trade before the end-game. Rooks capturing
Pawns to prevent their promotion probably should not count, though.) These
could then be used to separately extracting the probability for such a
trade for the two sets of piece values, and determine the winning
probability for each of the piece values once such a trade would have
occurred. By filtering the raw data this way, we get rid of the stochastic
noise produced by the (majority of) games whwre the event we want to
determine the effect of would not have occurred.

Derek Nalls wrote on Tue, May 20, 2008 05:17 PM EDT:
Muller:

Please confirm that these are legal values for the 'winboard.ini' file.

/firstChessProgramNames={'C:\winboard-F\Joker80\w\M-st\w-M-st 22
P100=353=459=559=1029=1059=1118'
'C:\winboard-F\Joker80\w\M-sp\w-M-sp 22
P100=353=459=659=1029=1059=1118'
'C:\winboard-F\Joker80\w\S-st\w-S-st 22
P100=306=363=557=702=912=960'
'C:\winboard-F\Joker80\w\S-sp\w-S-sp 22
P100=306=363=557=866=912=960'
'C:\winboard-F\Joker80\w\N-st\w-N-st 22
P100=308=376=594=940=958=1031'
'C:\winboard-F\Joker80\w\N-sp\w-N-sp 22
P100=308=376=594=940=958=1031'
'C:\winboard-F\TJchess\TJChess10x8'
}
/secondChessProgramNames={'C:\winboard-F\Joker80\b\M-st\b-M-st 22
P100=353=459=559=1029=1059=1118'
'C:\winboard-F\Joker80\b\M-sp\b-M-sp 22
P100=353=459=659=1029=1059=1118'
'C:\winboard-F\Joker80\b\S-st\b-S-st 22
P100=306=363=557=702=912=960'
'C:\winboard-F\Joker80\b\S-sp\b-S-sp 22
P100=306=363=557=866=912=960'
'C:\winboard-F\Joker80\b\N-st\b-N-st 22
P100=308=376=594=940=958=1031'
'C:\winboard-F\Joker80\b\N-sp\b-N-sp 22
P100=308=376=594=940=958=1031'
'C:\winboard-F\TJchess\TJChess10x8'
}

Derek Nalls wrote on Tue, May 20, 2008 05:05 PM EDT:
Of course, I would bet anything that there are no 1:1 exchanges supported
under the standard Muller CRC model that could cause material losses.  If
that were the case, yours would not be one of the three most credible CRC
models under close consideration.  In fact, even your excellent Joker80
program would play poorly if stuck with using faulty CRC piece values.

Obviously, the longer the exchange, the rarer its occurrence during
gameplay.  The predominance of simple 1:1 exchanges over even the least
complicated, 1:2 or 2:1 exchanges, in gameplay is large although I do not
know the stats.

In fact, there is a certain 1:2 or 2:1 exchange I am hoping to see that is
likely to support my contention that the Muller rook value should be
higher: the 1 queen for 2 rooks or 2 rooks for 1 queen exchange.  Please
recall that under the standard Muller model, this is an equal exchange. 
However, under asymmetrical playtesting of comparable quality to and
similar to that I used to confirm the correctness of your higher
archbishop value, I played numerous CRC games at various moderate time
controls where the player without 1 queen (yet with 2 rooks) defeated the
player without 2 rooks (yet with 1 queen).  Ultimately, a key mechanism to conclusive results is that while the standard Muller model is neutral toward a 2 rook : 1 queen or 1 queen : 2 rook exchange, the special Muller model regards its 1 queen as significantly less valuable than 2 rooks of its opponent.  Consequently, this contrast in valuation could be played into ... and we would see who wins.

I am actually pleased that you are a realist who shares my pessimism in
this experiment.  In any case, low odds do not deter a best effort to
succeed.  The main difference between us is that you calculate your
pessimism by extreme statistical methods whereas I calculate my pessimism
by moderate probabilistic methods.  I remain hopeful that eventually I
will prove to you that the method Scharnagl & I developed is occasionally
productive.

H. G. Muller wrote on Tue, May 20, 2008 02:43 PM EDT:
Well, to get an impression at what you can expect: In my first versions of
Joker80 I still used the Larry-Kaufman piece values of 8x8 Chess. So the
Bishop was half a Pawn too low, nearly equal to the Knight (as with more
than 5 Pawns, Kaufman has a Knight worth more than a lone Bishop,
neutraling a large part of the pair bonus.) Now unlike a Rook, a Bishop is
very easy to trade for a Knight, as both get into play early. Making the
trade usually wrecks the opponent's pawn structure by creating a doubled
Pawn, giving enough compensation to make it attractive.

So in almost all games Joker played with two Knights against two Bishops
after 12 moves or so. Fixing that did increase the playing strength by
~100 Elo points. So where the old version would score 50%, the improved
version would score 57%.

Now a similarly bad value for the Rook would manifest itself much more
difficultly: the Rooks get into play late, there is no nearly equal piece
for which a 1:1 trade changes sign, and you would need 1:3 trades (R vs
B+2P) or 2:2 trades (R+P for N+N), which are much more difficult to set
up. So I would expect that being half a Pawn off on the Rook value would
only reduce your score by about 3%, rather than 7% as with the Bishop.
After playing 100 games, the score differs by more than 3% from the true
win probability more often than not. So you would need at least 400 games
to show with minimal confidence that there was a difference.

Beware that the result of the games are stochastic quantities. Replay the
game at the same time control, and the game Joker80 plays will be
different. And often the result will be different. This is true at 1 sec
per move, but it is equally true at 1 year per move. The games that will
be played, are just a sample from the myriads of games Joker80 could play
with non-zero probability. And with fewer than 400 games, the difference
between the actually measured score percentage and the probability you
want to determine will be in most cases larger than the effect of the
piece values, if they are not extremey wrong (e.g. setting Q < B).

Derek Nalls wrote on Tue, May 20, 2008 12:48 PM EDT:
Everything is working fine.
Thank you!

I now have 12 instances of the Joker80 program running in various
sub-directories of Winboard F with the 'winboard.ini' file set to
conveniently initiate any desired standard or special material values for
the CRC models by Muller, Scharnagl and Nalls.

In the first test, I am going to attempt to find a playtesting time where
a distinct seperation in playing strength occurs between the standard
Muller model wherein the rook is 1 pawn more valuable than the bishop and
a special Muller model wherein the rook is 2 pawns more valuable than the
bishop.  If I successfully find a playtesting time that is survivable by
humans, then we can hopefully establish a tentative probability as to
which CRC model plays decisively better after a few-several games.

At par 100 (for the pawn), the bishop is at 459 under both models with the
rook at 559 under the standard Muller model and 659 under the special
Muller model.

I want to playtest a special Muller model with a rook value 2.00 pawns higher than the bishop because the Nalls model has a rook value 2.19 pawns higher than the bishop and the Scharnagl model has a rook value 1.94 pawns higher than the bishop (for an average of 2.06 pawns).

Since I am attempting to test for such a small difference in the material value of only one type of piece (the rook), I have doubts that I will be able to obtain conclusive results.  In any case ... If I obtain conclusive results, then very long time controls will surely be required to produce them.

H. G. Muller wrote on Tue, May 20, 2008 07:39 AM EDT:
One small refinement:

If the command-line argument was used to modify the piece values, Joker80
will give its own name to WinBoard as 'Joker80.xp', in stead of
'Joker80.np', so that it becomes less hard to figure out which engine
was winning (e.g. from the PGN file).

Note also that at very long time control you might want to enlarge the
hash table; default is 128MB, but if you invoke Joker80 as

'joker80.exe 22 P100=300=....'

it will use 256MB (and with 23 in stead of 22 it will use 512MB, etc.)

H. G. Muller wrote on Tue, May 20, 2008 07:06 AM EDT:
OK, I replaced the joker80.exe on my website by one with adjustable piece
values. (If you run it from the command line, it should say version 1.1.14
(h).) I also tried to fix the bug in undo (which I discoverd was disabled
altogether in the previous version), and although it seemed to work, it
might remain a weak spot. (I foresee problems if the game contained a
promotion, for instance, as it might not remember the correct promotion
piece on replay.) So try to avoid using the undo.

I decided to make the piece values adjustable through a command-line
option, rather than from a file, to avoid problems if you want to run two
different sets of piece values (where you then would have to keep the
files separate somehow). The way it works now is that for the engine name
(that WinBoard asks in the startup dialog, or that you can put in the
winboard.ini file to appear in the selectable engines there), you should
write:

joker80.exe P85=300=350=475=875=900=950

The whole thing should be put between double quotes, so that WinBoard
knows the P... is an option to the engine, and not to WinBoard. The
numerical values are those of P, N, B, R, A, C and Q, respectively, in
centiPawn. You can replace them by any value you like. If you don't give
the P argument, it uses the default values. If you give a P argument with
not enough values, the engine exits.

Note that these are base values, for the positionally average piece. For N
and B this would be on c3, in the presence (for B) of ~ 6 own Pawns, half
of them on the color of the Bishop. A Bishop pair further gets 40cP bonus.
For the Rook it is the value for one in the absence of (half-)open files.
The Pawn value will be heavily modified by positional effects
(centralization, support by own Pawns, blocking by enemy Pawns), which on
the average will be positive.

Note that you can play two different versions against each other
automatically. The first engine plays white, in two-machines mode. (You
won't be able to recognize them from their name...)

Derek Nalls wrote on Tue, May 20, 2008 03:16 AM EDT:
'Human vs. engine play is virtually untested. 
Did you at any point of the game use 'undo'
(through the WinBoard 'retract move')?'

Yes.
Many of us error-prone humans use it frequently.
________________________________________________

'This is indeed something I should fix but
the current work-around would be not to use 'undo'.'

Makes sense to me.
I can avoid using the 'retract move' command altogether.
________________________________________________________

'I could make a Joker80 version that reads the piece base values from a
file 'joker.ini' at startup. Then you could change them to anything you
want to test, without the need to re-compile. Would that satisfy your
needs?'

Yes, better than I ever imagined.
Thank you!

H. G. Muller wrote on Tue, May 20, 2008 02:39 AM EDT:
First about the potential bug: I am afraid that I need more information to figure out what exactly was the problem. This is not a plain move-generator bug; when I feed the game to to my version of Joker80 here (which is presumably the same as that you are using), it accepts the move without complaints. It would be unconceivable anyway that a move-generator bug in such a common move would not have manifested itself in the many hundreds of games I had it play against other engines. OTOH, Human vs. engine play is virtually untested. Did you at any point of the game use 'undo' (through the WinBoard 'retract move')? It might be that the undo is not correctly implemented, and I would not notice it in engine-engine play. In fact it is very likely to be broken fter setting up a position, as I implemented it by resetting to the opening position and replaying all moves from there. But this won't work after loading a FEN (a feature I added only later). This is indeed something I should fix, but the current work-around would be not to use 'undo'. To make sure what happened, I would have to see the winboard.debug file (which records all communication between engine and GUI, including a lot of debug output from the engine itself). Unfortunately this file is not made by default. You would have to start WinBoard with the command-line option /debug, or press + + after starting WinBoard. And then immediately rename the winboard.debug to something else if a bug manisfests itself, to prevent it from being overwritten when you run WinBoard again. Joker80 also makes a log file 'jokerlog.txt', but this also is overwritten each time you re-run it. If you didn't run Joker80 since the bug, it might help if you sent me that file. Otherwise, I am afraid that there is little I can do at the moment; we would have to wait until the problem occurs again, and examine the recorded debug information. About the piece values: I could make a Joker80 version that reads the piece base values from a file 'joker.ini' at startup. Then you could change them to anything you want to test, without the need to re-compile. Would that satisfy your needs? Note that currently Joker80 is not really able to play CRC, as it only supports normal castling

Derek Nalls wrote on Mon, May 19, 2008 09:13 PM EDT:
Muller:

Please investigate this potentially serious bug I may have discovered
while testing Joker80 under Winboard F ...

Bugs, Bugs, Bugs!
http://www.symmetryperfect.com/pass

I am having a hard time with software today.

Joe Joyce wrote on Mon, May 19, 2008 07:40 PM EDT:
This sounds like an interesting proposition.

Derek Nalls wrote on Mon, May 19, 2008 06:28 PM EDT:
Muller:

I would like to conduct two focused playtests using Joker80 at very long
time controls (e.g., 30 minutes per move) to investigate these important questions-

1.  Is Muller's rook value within the CRC set too low?
2.  Is Scharnagl's archbishop value within the CRC set too low?

I would need for you to compile special versions of Joker80 for me using
significantly different values for those CRC pieces as well as
Scharnagl's CRC piece set.  To isolate the target variable, these games would be Muller (standard values) vs. Muller (test values) and Scharnagl (standard values) vs. Scharnagl (test values) via symmetrical playtesting.  Anyway, we can discuss the details if you are interested or willing.  Please let me know.

Derek Nalls wrote on Mon, May 19, 2008 06:13 PM EDT:
Since Muller's Joker80 has recently established itself via 'The Battle Of
The (Unspeakables)' tournament as the best free CRC program in the world,
I checked it out.  I must report that setting-up Winboard F (also written
by Muller) to use it was straight-forward with helpful documentation. 
Generally, I am finding the features of Joker80 to be versatile and
capable for any reasonable uses.

Derek Nalls wrote on Mon, May 19, 2008 05:58 PM EDT:
To anyone who was interested ...

My playtesting efforts using SMIRF have been suspended indefinitely due to a serious checkmate bug which tainted the first game at 30 minutes per move between Scharnagl's and Muller's sets of CRC piece values.

H. G. Muller wrote on Thu, May 15, 2008 12:22 PM EDT:
Rich Hutnik:
| Anyone think this might be a sound approach?

Well, not me! Science is not a democracy. We don't interview people in
the street to determine if a neutron is heavier than a proton, or what the 100th decimal of the number pi is.

At best, you could use this method to determine the CV rating of the
interviewed people. But even if a million people would think that piece A
is worth more than piece B, and none the other way around, that doesn't
make it so. The only thing that counts is if A makes you win more often
than B would. If it doesn't, than it is of lower value. No matter what people say, or how many say it.

Rich Hutnik wrote on Wed, May 14, 2008 10:26 PM EDT:
Here is another approach I would suggest for strength of pieces.  How about
we pick 100 and people order them from strongest to weakest?  Work on a
scoring system for position, and then at least get an idea of order of
strength.

Anyone think this might be a sound approach?

H. G. Muller wrote on Wed, May 14, 2008 03:09 AM EDT:
This discussion is too silly for words anyway. Because even if it were true that the winning probability for a given material imbalance would be different at 1 hour per move than it would be at 10 sec/move, it would merely mean that piece values are different for different quality players. And although that is unprecedented, that revelation in itself would not make the piece values at 1 hour per move of any use, as that is a time control that no one wants to play anyway.

So the whole endeavor is doomed from the start: by testing at 1 hour per move, either you measure the same piece values as you would at 10  sec/move, and wasted 99.7% of your time, or you find different values, and then you have wrong values, which cannot be used at any time control you would actually want to play...

H. G. Muller wrote on Tue, May 13, 2008 05:13 PM EDT:
Reinhard, that is not relevant. It will happen on the average as often for
the other side. It is in the nature of Chess. Every game that is won, is
won by an error, that might not have been made on longer thinking. As the
initial position is not a won position for eaiter side. But most games are
won by either side, and if they are allowed to think longer, most games are
still won by either side.

What is so hard to understand about the statement 'the win probability
(score fraction, if you allow for draws) obtained from a given quiet, but
complex (many pieces) position between equal opponents does not depend on
time control' that it prompt people to come up with irrelevancies? Why do
you think that saying anything at all that does not mention an observed
probability would have any bearing on this statement whatsoever?

I don't think the ever more hollow sounding selfdeclared superiority of
Derek need much comment. He obviously doesn't know zilch about
probability theory and statistics. Shouting that he does won't make it
so, and won't fool anyone.

Reinhard Scharnagl wrote on Tue, May 13, 2008 03:05 PM EDT:
To H.G.M.: why have you to be that unfriendly? But to give you a strong
argument, that longer thinking phases could change a game result: have a
look at: 
[site removed], 
where [a claim is made], that there would be a mate in 9. In
fact there SMIRF has been in a lost situaton. But watching a chess engine
calculate on that position, you could see, that an initial heavy
disadvantage switches into a secure win. Having engines calculate with
short time frames would probably lead to another result. Here increasing
thinking time indeed is leading to a result switch.

[The above has been edited to remove a name and site reference. It is the
policy of cv.org to avoid mention of that particular name and site to
remove any threat of lawsuits. Sorry to have to do that, but we must
protect ourselves. -D. Howe]

Derek Nalls wrote on Tue, May 13, 2008 01:27 PM EDT:
'Is this story meant to illustrate that you have no clue as to how to
calculate statistical significance?'

No.

This story is meant to illustrate that you have no clue as to how to
calculate probabilistic significance ... and it worked perfectly.
________________________________________________________

There you go again.  Missing the point entirely and ranting about
probabilities not being proper statistics.

H. G. Muller wrote on Tue, May 13, 2008 01:06 PM EDT:
Reinhard Scharnagl:
| I am still convinced, that longer thinking times would have an 
| influence on the quality of the resulting moves.

Yes, so what? Why do you think that is a relevant remark? The better moves
won't help you at all, if the opponent also does better moves. The result
will be the same. And the rare cases it is not, on the average cancel each
other.

So for the umptiest time:
NO ONE DENIES THAT LONGER THINKING TIME PRODUCES SOMEWHAT BETTER MOVES.
THE ISSUE IS THAT IF BOTH SIDES PLAY WITH LONGER TC, THEIR WINNING
PROBABILITIES WON'T CHANGE.

And don't bother to to tell us that you are also convinced that the
winning probabilities will change, without showing us proof. Because no
one is interested in unfounded opinions, not even if they are yours.

H. G. Muller wrote on Tue, May 13, 2008 12:57 PM EDT:
Is this story meant to illustrate that you have no clue as to how to
calculate statistical significance? Or perhaps that you don't know what
it is at all?

The observation of a single tails event rules out the null hypothesis that
the lottery was fair (i.e. that the probability for this to happen was
0.000,000,01) with a confidence of 99.999,999%.

Be careful, though, that this only describes the case where the winning
android was somehow special or singled out in advance. If the other
participants to the lottery were 100 million other cheating androids, it
would not be remarkable in anyway that one of them won. The null
hypothesis that the lottery was fair predicted a 100% probability for
that.

But, unfortunately for you, it doesn't work for lotteries with only 2
tickets. Then you can rule the null hypothesis that the lottery was fair
(and hence the probability 0.5) with a confidence of 50%. And 50%
confidence means that in 50% of the cases your conclusion is correct, and
in the other 50% of the cases not. In other words, a confidence level of
50% is a completely blind, uninformed random guess.

Derek Nalls wrote on Tue, May 13, 2008 12:18 PM EDT:
Since I had to endure one of your long bedtime stories (to be sure),
you are going to have to endure one of mine.  Yet unlike yours
[too incoherent to merit a reply], mine carries an important point:

Consider it a test of your common sense-

Here is a scenario ...

01.  It is the year 2500 AD.

02.  Androids exist.

03.  Androids cannot tell lies.

04.  Androids can cheat, though.

05.  Androids are extremely intelligent in technical matters.

06.  Your best friend is an android.

07.  It tells you that it won the lottery.

08.  You verify that it won the lottery.

09.  It tells you that it purchased only one lottery ticket.

10.  You verify that it purchased only one lottery ticket.

11.  The chance of winning the lottery with only one ticket is 1 out of
100 million.

12.  It tells you that it cheated to win the lottery by hacking into its
computer system immediately after the winning numbers were announced,
purchasing one winning ticket and back-dating the time of the purchase.
____________________________________________

You have only two choices as to what to believe happened-

A.  The android actually won the lottery by cheating.

OR

B.  The android actually won the lottery by good luck.
The android was mistaken in thinking it successfully cheated.
______________________________________________________

The chance of 'A' being true is 99,999,999 out of 100,000,000.
The chance of 'B' being true is 1 out of 100,000,000.
________________________________________________

I would place my bet upon 'A' being true
because I do not believe such unlikely coincidences
will actually occur.

You would place your bet upon 'B' being true
because you do not believe such unlikely coincidences
have any statistical significance whatsoever.
_________________________________________

I make this assessment of your judgment ability fairly because you think
it is a meaningless result if a player with one set of CRC piece values
wins against its opponent 10-times-in-a-row even as the chance of it being
'random good luck' is indisputably only 1 out of 1024.

By the way ...

base 2 to exponent 100 equals 1,267,650,600,228,229,401,496,703,205,376.

Can you see how ridiculous your demand of 100 games is?

Reinhard Scharnagl wrote on Tue, May 13, 2008 11:50 AM EDT:
H.G.M. wrote: '... he threw the coin only 10 feet up into the air, on each try. While I brought my coin up to 30,000 feet in an airplane ...'

Understanding your example as an argument against Derek Nalls' testing method, I wonder why your chess engines always are thinking using the full given timeframe. It would be much more impressive, if your engine would decide always immediately. ;-)

I am still convinced, that longer thinking times would have an influence on the quality of the resulting moves.

Jianying Ji wrote on Tue, May 13, 2008 11:28 AM EDT:
I really am completely lost, so I won't comment until I can see what the
debate is about.

Derek Nalls wrote on Tue, May 13, 2008 11:08 AM EDT:
'This discussion is pointless.'

On this one occasion, I agree with you.

However, I cannot just let you get away with some of your most 
outrageous remarks to date.

So, unfortunately, this discussion is not yet over.
____________________________________________

'First you should have results, 
then it becomes possible to talk about what they mean. 
You have no result.'

Of course, I have a result!

The result is obviously the game itself as a win, loss or draw
for the purposes of comparing the playing strengths of two
players using different sets of CRC piece values.

The result is NOT statistical in nature.
Instead, the result is probabilistic in nature.

I have thoroughly explained this purpose and method to you.
I understand it.
Reinhard Scharnagl understands it.
You do not understand it.
I can accept that.
However, instead of admitting that you do not understand it,
you claim there is nothing to understand.
______________________________________

'Two sets of piece values as different as day and night, and the only
thing you can come up with is that their comparison is
'inconclusive'.'

Yes.  Draws make it impossible to determine which of two sets of
piece values is stronger or weaker.  However, by increasing the
time (and plies) per move, smaller differences in playing strength 
can sometimes be revealed with 'conclusive' results.

I will attempt the next pair of Scharnagl vs. Muller and Muller vs.
Scharnagl games at 30 minutes per move.  Knowing how much
you appreciate my efforts on your behalf motivates me.
___________________________________________________

'Talk about pathetic: even the two games you played are the same.'

Only one game was played.

The logs you saw were produced by the Scharnagl (standard) version
of SMIRF for the white player and the Muller (special) version of SMIRF
for the black player.  The game is handled in this manner to prevent 
time from being expired without computation occurring.
___________________________________________________

'... does your test setup s*ck!'

What, now you hate Embassy Chess too?
Take up this issue with Kevin Hill.

H. G. Muller wrote on Tue, May 13, 2008 09:58 AM EDT:
Jianying Ji:
| Two suggestion for settling debates such as these. First distributed
| computing to provide as much data as possible. And bayesian statistical
| methods to provide statistical bounds on results.

Agreed: one first needs to generate data. Without data, there isn't even
a debate, and everything is just idle talk. What bounds would you expect
from a two-game dataset? And what if these two games were actually the
same?

But the problem is that the proverbial fool can always ask more than
anyone can answer. If, by recruting all PCs in the World, we could
generate 100,000 games at an hour per move, an hour per move will of
course not be 'good enough'. It will at least have to be a week per
move. Or, if that is possible, 100 years per move.

And even 100 years per move are of course no good, because the computers
will still not be able to search into the end-game, as they will search
only 12 ply deeper than with 1 hour per move. So what's the point?

Not only is his an énd-of-the-rainbow-type endeavor, even if you would get
there, and generate the perfect data, where it is 100% sure and prooven for
each position what the outcome under perfect play is, what then? Because
for simple end-games we are alrady in a position to reach perfect play,
through retrograde analysis (tablebases).

So why not start there, to show that such data is of any use whatsoever,
in this case for generating end-game piece values? If you have the EGTB
for KQKAN, and KAKBN, how would you extract a piece value for A from it?

Jianying Ji wrote on Tue, May 13, 2008 08:59 AM EDT:
Two suggestion for settling debates such as these. First distributed
computing to provide as much data as possible. And bayesian statistical
methods to provide statistical bounds on results.

H. G. Muller wrote on Tue, May 13, 2008 06:59 AM EDT:
Once upon a time I had a friend in a country far, far away, who had
obtained a coin from the bank. I was sure this coin was counterfeit, as it
had a far larger probability of producing tails. I even PROVED it to him: I
threw the coin twice, and both times tails came up. But do you think the
fool believed me? No, he DIDN'T! 

He had the AUDACITY to claim there was nothing wrong with the coin,
because he had tossed it a thouand times, and 523 times heads had come up!
While it was clear to everyone that he was cheating: he threw the coin only
10 feet up into the air, on each try. While I brought my coin up to 30,000
feet in an airplane, before I threw it out of the window, BOTH times! And,
mind you, both times it landed tails! And it was not just an ordinary
plane, like a Boeing 747. No sir, it was a ROCKET plane!

And still this foolish friend of mine insisted that his measly 10 feet
throws made him more confident that the coin was OK then my IRONCLAD PROOF
with the rocket plane. Ridicuoulous! Anyone knows that you can't test a
coin by only tossing it 10 feet. If you do that, it might land on any
side, rather than the side it always lands on. He might as well have
flipped a coin! No wonder they send him to this far, far away country: no
one would want to live in the same country as such an idiot. He even went
as far as to buy an ICECREAM for that coin, and even ENJOYED eating that!
Scandalous! I can tell you, he ain't my friend anymore! Using coins that
always land on one side as if it were real money.

For more fairy tales and bed-time stories, read Derek's postings on piece
values...
:-) :-) :-)

H. G. Muller wrote on Tue, May 13, 2008 03:17 AM EDT:
This discussion is pointless. In dealing with a stochastic quantity, if
your statistics are no good, your observations are no good, and any
conclusions based on them utterly meaningless. Nothing of what you say
here has any reality value, it is just your own fantasies. First you
should have results, then it becomes possible to talk about what they
mean. You have no result. Get statistically meaningful testresults. If
your method can't produce them, or you don't feel it important enough to
make your method produce them, don't bother us with your cr*p instead.

Two sets of piece values as different as day and knight, and the only
thing you can come up with is that their comparison is 'inconclusive'.
Are you sure that you could conclusively rule out that a Queen is worth 7,
or a Rook 8, by your method of 'playtesting'? Talk about pathetic: even
the two games you played are the same. Oh man, does your test setup s*ck!
If you cannot even decide simple issues like this, what makes you think
you have anything meaningful to say about piece values at all?

Derek Nalls wrote on Mon, May 12, 2008 11:38 PM EDT:
CRC piece values tournament
http://www.symmetryperfect.com/pass/

Just push the 'download now' button.

Game #1
Scharnagl vs. Muller
10 minutes per move
SMIRF MS-174c

Result- inconclusive.
Draw after 87 moves by black.
Perpetual check declared.

Derek Nalls wrote on Mon, May 12, 2008 10:39 PM EDT:
'Of course, that is easily quantified. The entire mathematical field of
statistics is designed to precisely quantify such things, through
confidence levels and uncertainty intervals.'

No, it is not easily quantified.  Some things of numerical importance
as well as geometric importance that we try to understand or prove 
in the study of chess variants are NOT covered or addressed by statistics.
I wish our field of interest was that simple (relatively speaking) and
approachable but it is far more complicated and interdisciplinary.  
All you talk about is statistics.  Is this because statistics is all you
know well?
___________

'That difference just can't be seen with two games. Play 100.
There is no shortcut.'

I agree.  Not with only 2 games.  

However ...

With only 4 games, IF they were ALL victories or defeats for the player 
using a given piece values model, I could tell you with confidence 
that there is at least a 15/16 chance the given piece values model is 
stronger or weaker, respectively, than the piece values model used by 
its opponent.  [Otherwise, the results are inconclusive and useless.]

Furthermore, based upon the average number of moves per game 
required for victory or defeat compared to the established average 
number of moves in a long, close game, I could probably, correctly 
estimate whether one model was a little or a lot stronger or weaker, 
respectively, than the other model.  Thus, I will not play 100 games 
because there is no pressing, rational need to reduce the 'chance of 
random good-bad luck' to the ridiculously-low value of 
'the inverse of (base 2 to exponent 100)'.

Is there anything about the odds associated with 'flipping a coin'
that is beyond your ability to understand?  This is a fundamental 
mathematical concept applicable without reservation to symmetrical 
playtesting.  In any case, it is a legitimate 'shortcut' that I can and
will use freely.
________________

'Even perfect play doesn't help. We do have perfect play for all 6-men 
positions.'

I meant perfect play throughout an entire game of a CRC variant 
involving 40 pieces initially.  That is why I used the word 'impossible'
with reference to state-of-the-art computer technology.
_______________________________________________________

'This is approximately master-level play.'

Well, if you are getting master-level play from Joker80 with speed
chess games, then I am surely getting a superior level of play from 
SMIRF with much longer times and deeper plies per move.  You see,
I used the term 'virtually random moves' appropriately in a 
comparative context based upon my experience.
_____________________________________________

'Doesn't matter if you play at an hour per move, a week per move, 
a year per move, 100 year per move. The error will remain >=32%. 
So if you want to play 100 years per move, fine. But you will still
need 100 games.'

Of course, it matters a lot.  If the program is well-written, then the 
longer it runs per move, the more plies it completes per move
and consequently, the better the moves it makes.  Hence,
the entire game played will progressively approach the ideal of 
perfect play ... even though this finite goal is impossible to attain.
Incisive, intelligent, resourceful moves must NOT to be confused with 
or dismissed as purely random moves.  Although I could humbly limit 
myself to applying only statistical methods, I am totally justified,
in this case, in more aggressively using the 'probabilities associated 
with N coin flips ALL with the same result' as an incomplete, minimum 
value before even taking the playing strength of SMIRF at extremely-long 
time controls into account to estimate a complete, maximum value.
______________________________________________________________

'The advantage that a player has in terms of winning probability is the
same at any TC I ever tried, and can thus equally reliably be determined
with games of any duration.'

You are obviously lacking completely in the prerequisite patience and 
determination to have EVER consistently used long enough time controls 
to see any benefit whatsoever in doing so.  If you had ever done so, 
then you would realize (as everyone else who has done so realizes) 
that the quality of the moves improves and even if the winning probability
has not changed much numerically in your experience, the figure you 
obtain is more reliable.  

[I cannot prove to you that this 'invisible' benefit exists
statistically. Instead, it is an important concept that you need to
understand in its own terms.  This is essential to what most playtesters do, with the notable exception of you.  If you want to understand what I do and why, then you must come to grips with this reality.]

H. G. Muller wrote on Mon, May 12, 2008 06:12 PM EDT:
Drek Nalls:
| They definitely mean something ... although exactly how much is not 
| easily known or quantified (measured) mathematically.
Of course that is easily quantified. The entire mathematical field of
statistics is designed to precisely quantify such things, through
confidence levels and uncertainty intervals. The only thing you proved
with reasonable confidence (say 95%) is that two Rooks are not 1.66 Pawn
weaker than a Queen. So if Q=950, then R > 392. Well, no one claimed
anything different. What we want to see is if Q-RR scores 50% (R=475) or
62% (R=525). That difference just can't be seen with two games. Play 100.
There is no shortcut. Even perfect play doesn't help. We do have perfect
play for all 6-men positions. Can you derive piece values from that, even
end-game piece values???

| Statistically, when dealing with speed chess games populated 
| exclusively with virtually random moves ... YES, I can understand and 
| agree with you requiring a minimum of 100 games.  However, what you 
| are doing is at the opposite extreme from what I am doing via my 
| playtesting method.
Where do you get this nonsense? This is approximately master-level play.
Fact is that results from playing opening-type positions (with 35 pieces
or more) are stochastic quantity at any level of play we are likely to see
the next few million years. And even if they weren't, so that you could
answer the question 'who wins' through a 35-men tablebase, you would
still have to make some average over all positions (weighted by relevance)
with a certain material composition to extract piece values. And if you
would do that by sampling, the resukt would again be a sochastic quantity.
And if you would do it by exhaustive enumeration, you would have no idea
which weights to use.
And if you are sampling a stochastic quantity, the error will be AT LEAST
as large as the statistical error. Errors from other sources could add to
that. But if you have two games, you will have at least 32% error in the
result percentage. Doesnt matter if you play at an hour per move, a week
per move, a year per move, 100 year per move. The error will remain >=
32%. So if you want to play 100 yesr per move, fine. But you will still
need 100 games.

| Nonetheless, games played at 100 minutes per move (for example) have 
| a much greater probability of correctly determining which player has 
| a definite, significant advantage than games played at 10 seconds per 
| move (for example).
Why do I get the suspicion that you are just making up this nonsense? Can
you show me even one example where you have shown that a certain material
advantage would be more than 3-sigma different for games at 100 min / move
than for games at 1 sec/move? Show us the games, then. Be aware that this
would require at least 100 games at aech time control. That seems to make
it a safe guess that you did not do that for 100 min/move.
 On the other hand, in stead of just making things up, I have actually
done such tests, not with 100 games per TC, but with 432, and for the
faster even with 1728 games per TC. And there was no difference beyond the
expected and unavoidable statistical fluctuations corresponding to those
numbers of games, between playing 15 sec or 5 minutes. 
The advantage that a player has in terms of winning probability is the
same at any TC I ever tried, and can thus equally reliably be determined
with games of any duration. (Provided ou have the same number of games).
If you think it would be different for extremely long TC, show us
statistically sound proof.

I might comment on the rest of your long posting later, but have to go
now...

Derek Nalls wrote on Mon, May 12, 2008 03:06 PM EDT:
'You hardly have the possibility of trading it before there are open
files. So it stands to reason that you might as well use the higher value
during the entire game.'

Well, I understand and accept your reasons for leaving your lower rook 
value in CRC as is.  It is interesting that you thoroughly understand and
accept the reasons of others for using a higher rook value in CRC as
well.  Ultimately, is not the higher rook value in CRC more practical and useful to the game by your own logic?
_____________________________

'... if we both play a Q-2R match from the opening, it is a serious
problem if we don't get the same result. But you have played only 2
games. Statistically, 2 games mean NOTHING.'

I never falsely claimed or implied that only 2 games at 10 minutes per 
move mean everything or even mean a great deal (to satisfy probability
overwhelmingly).  However, they mean significantly more than nothing.  
I cannot accept your opinion, based upon a purely statistical viewpoint,
since it is at the exclusion another applicable mathematical viewpoint.  
They definitely mean something ... although exactly how much is not 
easily known or quantified (measured) mathematically.
__________________________________________________

'I don't even look at results before I have at least 100 games, because
before they are about as likely to be the reverse from what they will 
eventually be, as not.'

Statistically, when dealing with speed chess games populated 
exclusively with virtually random moves ... YES, I can understand and 
agree with you requiring a minimum of 100 games.  However, what you 
are doing is at the opposite extreme from what I am doing via my 
playtesting method.

Surely you would agree that IF I conducted only 2 games with perfect 
play for both players that those results would mean EVERYTHING.  
Unfortunately, with state-of-the-art computer hardware and chess variant 
programs (such as SMIRF), this is currently impossible and will remain 
impossible for centuries-millennia.  Nonetheless, games played at 100 
minutes per move (for example) have a much greater probability of 
correctly determining which player has a definite, significant advantage 
than games played at 10 seconds per move (for example).

Even though these 'deep games' play of nowhere near 600 times better
quality than these 'shallow games' as one might naively expect
(due to a non-linear correlation), they are far from random events 
(to which statistical methods would then be fully applicable).  
Instead, they occupy a middleground between perfect play games and 
totally random games.  [In my studied opinion, the example 
'middleground games' are more similar to and closer to perfect play 
games than totally random games.]  To date, much is unknown to
combinatorial game theory about the nature of these 'middleground 
games'.

Remember the analogy to coin flips that I gave you?  Well, in fact, 
the playtest games I usually run go far above and beyond such random 
events in their probable significance per event.

If the SMIRF program running at 90 minutes per move casted all of its 
moves randomly and without any intelligence at all (as a perfect 
woodpusher), only then would my 'coin flip' analogy be fully applicable.
Therefore, when I estimate that it would require 6 games (for example) 
for me to determine, IF a player with a given set of piece values loses 
EVERY game, that there is only a 63/64 chance that the result is
meaningful (instead of random bad luck), I am being conservative to the
extreme.  The true figure is almost surely higher than a 63/64 chance.

By the way, if you doubt that SMIRF's level of play is intelligent and
non-random, then play a CRC variant of your choice against it at 90 
minutes per move.  After you lose repeatedly, you may not be able to 
credit yourself with being intelligent either (although you should) ... 
if you insist upon holding an impractically high standard to define the 
word.
______

'If you find a discrepancy, it is enormously more likely that the result
of your 2-game match is off from its true win probability.'

For a 2-game match ... I agree.  However, this may not be true for a 
4-game, 6-game or 8-game match and surely is not true to the extremes 
you imagine.  Everything is critically dependant upon the specifications 
of the match.  The number of games played (of course), the playing 
strength or quality of the program used, the speed of the computer and 
the time or ply depth per move are the most important factors.
_________________________________________________________

'Play 100 games, and the error in the observed score is reasonable
certain (68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only then you can see with reasonable confidence if your observations differ from mine.'

It would require est. 20 years for me to generate 100 games with the 
quality (and time controls) I am accustomed to and somewhat satisfied 
with.  Unfortunately, it is not that important to me just to get you to
pay attention to the results for the benefit of only your piece values
model.  As a practical concern to you, everyone else who is working to
refine quality piece values models in FRC and CRC will have likely
surpassed your achievements by then IF you refuse to learn anything from
the results of others who use different yet valid and meaningful methods
for playtesting and mathematical analysis than you.

H. G. Muller wrote on Mon, May 12, 2008 01:57 AM EDT:
To Derek:

I am aware that the empirical Rook value I get is suspiciously low. OTOH,
it is an OPENING value, and Rooks get their value in the game only late.
Furthermore, this only is the BASE VALUE of the Rook; most pieces have a
value that depends on the position on the board where it actually is, or
where you can quickly get it (in an opening situation, where the opponent
is not yet able to interdict your moves, because his pieces are in
inactive places as well). But Rooks only increase their value on open
files, and initially no open files are to be seen. In a practical game, by
the time you get to trade a Rook for 2 Queens, there usually are open
files. So by that time, the value of the Q vs 2R trade will have gone up
by two times the open-file bonus. You hardly have the possibility of
trading it before there are open files. So it stands to reason that you
might as well use the higher value during the entire game.

In 8x8 Chess, the Larry Kaufman piece values include the rule that a Rook
should be devaluated by 1/8 Pawn for each Pawn on the board there is over
five. In the case of 8 Pawns that is a really large penalty of 37.5cP for
having no open files. If I add that to my opening value, the late
middle-game / end-game value of the Rook gets to 512, which sounds a lot
more reasonable.

There are two different issues here:
1) The winning chances of a Q vs 2R material imbalance game
2) How to interpret that result as a piece value

All I say above has no bearing on (1): if we both play a Q-2R match from
the opening, it is a serious problem if we don't get the same result. But
you have played only 2 games. Statistically, 2 games mean NOTHING. I don't
even look at results before I have at least 100 games, because before they
are about as likely to be the reverse from what they will eventually be,
as not. The standard deviation of the result of a single Gothic Chess game
is ~0.45 (it would be 0.5 point if there were no draws possible, and in
Gothic Chess the draw percentge is low). This error goes down as the
square root of the number of games. In the case of 2 games this is
45%/sqrt(2) = 32%. The Pawn-odds advantage is only 12%. So this standard
error corresponds to 2.66 Pawns. That is 1.33 Pawns per Rook. So with this
test you could not possibly see if my value is off by 25, 50 or 75. If you
find a discrepancy, it is enormously more likely that the result of your
2-game match is off from to true win probability.

Play 100 games, and the error in the observed score is reasonable certain
(68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only thn
you can see with reasonable confidence if your observations differ from
mine.

Derek Nalls wrote on Sun, May 11, 2008 06:05 PM EDT:
Before Scharnagl sent me three special versions of SMIRF MS-174c compiled
with the CRC material values of Scharnagl, Muller & Nalls, I began
playtesting something else that interested me using SMIRF MS-174b-O.

I am concerned that the material value of the rook (especially compared to
the queen) amongst CRC pieces in the Muller model is too low:

rook  55.88
queen  111.76

This means that 2 rooks exactly equal 1 queen in material value.

According to the Scharnagl model:

rook  55.71
queen  91.20

This means that 2 rooks have a material value (111.42) 22.17% greater than
1 queen.

According to the Nalls model:

rook  59.43
queen  103.05

This means that 2 rooks have a material value (118.86) 15.34% greater than
1 queen.

Essentially the Scharnagl & Nalls models are in agreement in predicting
victories in a CRC game for the player missing 1 queen yet possessing 2
rooks.  By contrast, the Muller model predicts draws (or appr. equal
number of victories and defeats) in a CRC game for either player.

I put this extraordinary claim to the test by playing 2 games at 10
minutes per move on an appropriately altered Embassy Chess setup with the
missing-1-queen player and the missing-2-rooks player each having a turn
at white and black.

The missing-2-rooks player lost both games and was always behind.  They
were not even long games at 40-60 moves.

Muller:

I think you need to moderately raise the material value of your rook in
CRC.  It is out of its proper relation with the other material values
within the set.

H. G. Muller wrote on Sun, May 4, 2008 04:57 AM EDT:
Derek Nalls: | The additional time I normally give to playtesting games to improve | the move quality is partially wasted because I can only control the | time per move instead of the number of plies completed using most | chess variant programs. Well, on Fairy-Max you won't have that problem, as it always finishes an iteration once it decides to start it. But although Fairy-Max might be stronger than most other variant-playing AIs you use, it is not stronger than SMIRF, so using it for 10x8 CVs would still be a waste of time. Joker80 tries to minimize the time wastage you point out by attempting only to start iterations when it has time to finish them. It cannot always accurately guess the required time, though, so unlike Fairy-Max it has built in some emergency breaks. If they are triggered, you would have an incomplete iteration. Basically, the mechanism works by stopping to search new moves in the root if there already is a move with a similar score as on the previous iteration, once it gets in 'overtime'. In practice, these unexpectedly long iterations mainly occur when the previously best move runs into trouble that so far was just beyond the horizon. As the tree for that move will then look completely different from before, it takes a long time to search (no useful information in the hash), and the score will have a huge drop. It then continues searching new moves even in overtime in a desparate attempt to find one that avoids the disaster. Usually this is time well spent: even if there is no guarantee it finds the best move of the new iteration, if it aborts it early, it at least has found a move that was significantly better than that found in the previous iteration. Of course both Joker80 and Fairy-Max support the WinBoard 'sd' command, allowing you to limit the depth to a certain number of plies, although I never use that. I don't like to fix the ply depth, as it makes the engine play like an idiot in the end-game. | Can you explain to me in a way I can understand how and why | you are able to successfully obtain valuable results using this | method? Well, to start with, Joker80 at 1 sec per move still reaches a depth of 8-9 ply in the middle-game, and would probably still beat most Humans at that level. My experience is that, if I immediately see an obvious error, it is usually because the engine makes a strategic mistake, not a tactical one. And such strategic mistakes are awefully persistent, as they are a result of faulty evaluation, not search. If it makes them at 8 ply, it is very likely to make that same error at 20 ply. As even 20 ply is usually not enough to get the resolution of the strategical feature within the horizon. That being said, I really think that an important reason I can afford fast games is a statistical one: by playing so many games I can be reasonably sure that I get a representative number of gross errors in my sample, and they more or less cancel each other out on the average. Suppose at a certain level of play 2% of the games contains a gross error that turns a totally won position into a loss. If I play 10 games, there is a 20% error that one game contains such an error (affecting my result by 10%), and only ~2% probability on two such errors (that then in half the cases would cancel, but in other cases would put the result off by 20%). If, OTOH, I would play 1000 faster games, with an increased 'blunder rate' of 5% because of the lower quality, I would expect 50 blunders. But the probability that they were all made by the same side would be negligible. In most cases the imbalace would be around sqrt(50) ~ 7. That would impact the 1000-game result by only 0.7%. So virtually all results would be off, but only by about 0.7%, so I don't care too much. Another way of visualizing this would be to imagine the game state-space as a2-dimensional plane, with two evaluation terms determining the x- and y-coordinate. Suppose these terms can both run from -5 to +5 (so the state space is a square), and the game is won if we end in the unit circle (x^2 + y^2 < 1), but that we don't know that. Now suppose we want to know how large the probability of winning is if we start within the square with corners (0,0) and (1,1) (say this is the possible range of the evaluation terms when we posses a certain combination of pieces). This should be the area of a quarter circle, PI/4, divided by the area of the square (1), so PI/4 = 79%. We try to determine this empirically by randomly picking points in the square (by setting up the piece combination in some shuffled configuration), and let the engines play the game. The engines know that getting closer or farther away of (0,0) is associated with changing the game result, and are programmed to maximize or minimize this distance to the origin. If they both play perfectly, they should by definition succeed in doing this. They don't care about the 'polar angle' of the game state, so the point representing the game state will make a random walk on a circle around the origin. When the game ends, it will still be in the same region (inside or outside the unit circle), and games starting in the won region will all be won. Now with imperfect play, the engines will not conserve the distance to the origing, but their tug of war will sometimes change it in favor of one or the other (i.e. towards the origin, or away from it). If the engines are still equally strong, by definition on the average this distance will not change. But its probability distribution will now spread out over a ring with finite width during the game. This might lead to won positions close to the boundary (the unit circle) now ending up outside it, in the lost region. But if the ring of final game states is narrow (width << 1), there will be a comparable number of initial game states that diffuse from within the unit circle to the outside, as in the other direction. In other words, the game score as a function of the initial evaluation terms is no longer an absolute all or nothing, but the circle is radially smeared out a little, making a smooth transition from 100% to 0% in a narrow band centered on the original circle. This will hardly affect the averaging, and in particular, making the ring wider by decreasing playing accuracy will initially hardly have any effect. Only when play gets so wildly inaccurate that the final positions (where win/loss is determined) diverge so far from the initial point that it could cross the entire circle, you will start to see effects on the score. In the extreme case wher the radial diffusion is so fast that you could end up anywhere in the 10x10 square when the game finishes, the result score will only be PI/100 = 3%. So it all depends on how much the imperfections in the play spread out the initial positions in the game-state space. If this is only small compared to the measures of the won and lost areas, the result will be almost independent of it.

Reinhard Scharnagl wrote on Sun, May 4, 2008 03:09 AM EDT:
Harm, I think of a more simple formula, because it seems to be easier to
find out an approximation than to weight a lot of parameters facing a lot
of other unhanded strange effects. Therefore my less dimensional approach
is looking like: f(s := sum of unbalanced big pieces' values,  n :=
number of unbalanced big pieces, v := value of biggest opponents' piece).

So I intend to calculate the presumed value reduction e.g. as:

(s - v*n)/constant

P.S.: maybe it will make sense to down limit v by s/(2*n) to prevent a too big reduction, e.g. when no big opponents' piece would be present at all.  

P.P.S.: There have been some more thoughts of mine on this question. Let w := sum of n biggest opponent pieces, limited by s/2. Then the formula should be:

(s - w)/constant

P.P.P.S.: My experiments suggest, that the constant is about 2.0

P^4.S.: I have implemented this 'Elephantiasis-Reduction' (as I will name it) in a new private SMIRF version and it is working well. My constant is currently 8/5. I found out, that it is good to calculate one more piece than being without value compensation, because that bottom piece pair could be of switched size and thus would reduce the reduction. Non existing opponent pieces will be replaced by a Knight piece value within the calculation. I noticed a speeding up of SMIRF when searching for mating combinations (by normal play). I also noticed that SMIRF is making sacrifices, incorporating vanishing such penalties of the introduced kind.

Derek Nalls wrote on Sun, May 4, 2008 02:38 AM EDT:
'I never found any effect of the time control on the scores I measure for
some material imbalance. Within statistical error, the combinations I
tries produced the same score at 40/15', 40/20', 40/30', 40/40',
40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I
did not consider it worth doing just to prove that it was a waste of
time...'
_________

The additional time I normally give to playtesting games to improve the
move quality is partially wasted because I can only control the time per
move instead of the number of plies completed using most chess variant
programs.  This usually results in the time expiring while it is working
on an incomplete ply.  Then, it prematurely spits out a move
representative of an incomplete tour of the moves available within that
ply at a random fraction of that ply.  Since there is always more than one
move (often, a few-several) under evaluation as being the best possible
move [Otherwise, the chosen move would have already been executed.], this
means that any move on this 'list of top candidates' is equally likely
to be randomly executed.

Here are two typical scenarios that should cover what usually happens:

A.  If the list of top candidates in an 11-ply search consists of 6 moves
where the list of top candidates in a 10-ply search consists of 7 moves,
then only 1 discovered-to-be-less-than-the-best move has been successfully
excluded and cannot be executed.  

Of course, an 11-ply search completion may typically require est. 8-10
times as much time as the search completions for all previous plies (1-ply
thru 10-ply) up until then added together.

OR

B.  If the list of top candidates in an 11-ply search consists of 7 moves
[Moreover, the exact same 7 moves.] just as the preceding 10-ply search, 
then there is no benefit at all in expending 8-10 times as much time.
______________________________________________________________

The reason I endure this brutal waiting game is not for purely masochistic
experience but because the additional time has a tangible chance (although
no guarantee) of yielding a better move with every occasion.  Throughout
the numerous moves within a typical game, it can be realistically expected
to yield better moves on dozens of occasions.

We usually playtest for purposes at opposite extremes of the spectrum 
yet I regard our efforts as complimentary toward building a complete 
picture involving material values of pieces.

You use 'asymmetrical playtesting' with unequal armies on fast time 
controls, collect and analyze statistics ... to determine a range, with a
margin of error, for individual material piece values.

I remain amazed (although I believe you) that you actually obtain any 
meaningful results at all via games that are played so quickly that the AI
players do not have 'enough time to think' while playing games so complex
that every computer (and person) needs time to think to play with minimal
competence.  Can you explain to me in a way I can understand how and why
you are able to successfully obtain valuable results using this method? 
The quality of your results was utterly surprising to me.  I apologize for
totally doubting you when you introduced your results and mentioned how you
obtained them.

I use 'symmetrical playtesting' with identical armies on very slow time
controls to obtain the best moves realistically possible from an
evaluation function thereby giving me a winner (that is by some margin
more likely than not deserving) ... to determine which of two sets of
material piece values is probably (yet not certainly) better. 
Nonetheless, as more games are likewise played ...  If they present a
clear pattern, then the results become more probable to be reliable, 
decisive and indicative of the true state of affairs.

The chances of flipping a coin once and it landing 'heads' are equal to
it landing 'tails'.  However, the chances of flipping a coin 7 times and
it landing 'heads' all 7 times in a row are 1/128.  Now, replace the
concepts 'heads' and 'tails' with 'victory' and 'defeat'.  I
presume you follow my point.

The results of only a modest number of well-played games can definitely
establish their significance beyond chance and to the satisfaction of 
reasonable probability for a rational human mind.  [Most of us, including
me, do not need any better than a 95%-99% success to become convinced that
there is a real correlation at work even though such is far short of an
absolute 100% mathematical proof.]

In my experience, I have found that using any less than 10 minutes per
move will cause at least one instance within a game when an AI player
makes a move that is obvious to me (and correctly assessed as truly being)
a poor move.  Whenever this occurs, it renders my playtesting results 
tainted and useless for my purposes.  Sometimes this occurs during a 
game played at 30 minutes per move.  However, this rarely occurs during 
a game played at 90 minutes per move.

For my purposes, it is critically important above all other considerations
that the winner of these time-consuming games be correctly determined 
'most of the time' since 'all of the time' is impossible to assure.
I must do everything within my power to get as far from 50% toward 100%
reliability in correctly determining the winner.  Hence, I am compelled to
play test games at nearly the longest survivable time per move to minimize
the chances that any move played during a game will be an obviously poor 
move that could have changed the destiny of the game thereby causing 
the player that should have won to become the loser, instead.  In fact, 
I feel as if I have no choice under the circumstances.

H. G. Muller wrote on Sat, May 3, 2008 05:31 PM EDT:
Reinhard, if I understand you correct, what you basically want to introduce
in the evaluation is terms of the type w_ij*N_i*N_j, where N_i is the
number of pieces of type i of one side, and N_j is the number of pieces of
type j of the opponent, and w_ij is an tunable weight.

So that, if type i = A and type j = N, a negative w_ij would describe a
reduction of the value of each Archbishop by the presence of the enemy
Knights, through the interdiction effect. Such a term would for instance
provide an incentive to trade A in a QA vs ABNN for the QA side, as his A
is suppressed in value by the presence of the enemy N (and B), while the
opponent's A would not be similarly suppressed by our Q. On the contrary,
our Q value would be suppressed by the the opponent's A as well, so
trading A also benefits him there.

I guess it should be easy enough to measure if terms of this form have
significant values, by playing Q-BNN imbalances in the presence of 0, 1
and 2 Archbishops, and deducing from the score whose Archbishops are worth
more (i.e. add more winning probability). And similarly for 0, 1, 2
Chancellors each, or extra Queens. And then the same thing with a Q-RR
imbalance, to measure the effect of Rooks on the value of A, C or Q.

In fact, every second-order term can be measured this way. Not only for
cross products between own and enemy pieces, but also cooperative effects
between own pieces of equal or different type. With 7 piece types for each
side (14 in total) there would be 14*13/2 = 91 terms of this type possible.

H. G. Muller wrote on Sat, May 3, 2008 04:18 PM EDT:
| And by that this would create just the problem I have tried to 
| demonstrate. The three Chancellors could impossibly be covered, 
| thus disabling their potential to risk their own existence by 
| entering squares already influenced by the opponent's side.

You make it sound like it is a disadvantage to have a stronger piece,
because it cannot go on squares attacked by the weaker piece. To a certain
extent this is true, if the difference in capabilities is not very large.
Then you might be better off ignoring the difference in some cases, as
respecting the difference would actually deteriorate the value of the
stronger piece to the point where it was weaker than the weak piece. (For
this reason I set the B and N value in my 1980 Chess program Usurpator to
exactly the same value.) But if the difference between the pieces is
large, then the fact that the stronger one can be interdicted by the
weaker one is simply an integral part of its piece value.

And IMO this is not the reason the 4A-9N example is so biased. The problem
there is that the pieces of one side are all worth more than TWICE that of
the other. Rooks against Knights would not have the same problem, as they
could still engage in R vs 2N trades, capturing a singly defended Knight,
in a normal exchange on a single square. But 3 vs 1 trades are almost
impossible to enforce, and require very special tactics.

It is easy enough to verify by playtesting that playing CCC vs AAA (as
substitutes for the normal super-pieces) will simply produce 3 times the
score excess of playing a normal setup with on one side a C deleted, and
at the other an A. The A side will still have only a single A to harrass
every C. Most squares on enemy territory will be covered by R, B, N or P
anyway, in addition to A, so the C could not go there anyway. And it is
not true that anything defended by A would be immune to capture by C, as
A+anything > C (and even 2A+anything > 2C. So defending by A will not
exempt the opponent from defending as many times as there is attack, by
using A as defenders. And if there was one other piece amongst the
defenders, the C had no chance anyway. 

The effect you point out does not nearly occur as easily as you think.
And, as you can see, only 5 of my different armies did have duplicated
superpieces. All the other armies where just what you would get if you
traded the mentioned pieces, thus detecting if such a trade would enhance
or deteriorate your winning chances or not.

Reinhard Scharnagl wrote on Sat, May 3, 2008 04:06 PM EDT:
H.G.M. wrote: ... Both imbalances are large enough to cause 80-90% win percentages, so that just a few games should make it obvious which value is very wrong.

Hard to see. You will wait for White to lose because of insufficient material, and I will await a loss of White because of the lonely big pieces disadvantage. It will be the task then to find out the true reasons of that.

I will try to create two arrays, where each side think to have advantage.

Reinhard Scharnagl wrote on Sat, May 3, 2008 02:43 PM EDT:
H.G.M. wrote: ... After an unequal trade, andy Chess game becomes a game between different armies. ...

And thus I am convinced, that I have to include this aspect into SMIRF's successor's detail evaluation function.

... This can still be done in a reasonably realistic mix of pieces, e.g. replacing Q and C on one side by A, and on the other side by Q and A by C, so that you play 3C vs 3A, and then give additional Knight odds to the Chancellors. ...

And by that this would create just the problem I have tried to demonstrate. The three Chancellors could impossibly be covered, thus disabling their potential to risk their own existence by entering squares already influenced by the opponent's side.

H. G. Muller wrote on Sat, May 3, 2008 02:42 PM EDT:
Derek Nalls:
| Given enough years (working with only one server), this quantity of 
| well-played games may eventually become adequate.

I never found any effect of the time control on the scores I measure for
some material imbalance. Within statistical error, the combinations I
tries produced the same score at 40/15', 40/20', 40/30', 40/40',
40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I
did not consider it worth doing just to prve that it was a waste of
time...

The way I see it, piece-values are a quantitative measure for the amount
of control that a piece contributes to steering the game tree in the
direction of the desired evaluation. He who has more control, can
systematically force the PV in the direction of better and better
evaluation (for him). This is a strictly local property of the tree. The
only advantage of deeper searches is that you average out this control
(which highly fluctuates on a ply-by play basis) over more ply. But in
playing the game, you average over all plies anyway.

100 comments displayed

LatestLater Reverse Order EarlierEarliest

Permalink to the exact comments currently displayed.