Comments/Ratings for a Single Item
H.G.M.: '... This move blundered away the Queen with which Smirf was supposed to mate, after which Fairy-Max had no trouble winning with an Archbishop agains some five Pawns. ...' Well, there still is a mating bug within SMIRF. Though I have improved SMIRF's behavior near mating situations (please request a copy of this unpublished engine if needed for key owners' testings), it still seems to be there. There might be a minimal chance, that it sometimes could be caused by a communication problem using the adapter. But I am still convinced, that it is caused by a internal bad evaluation storing design, which hopefully would be corrected in Octopus sometimes ...
OK, I see the problem now. I forgot that the Embassy array is a mirrored one, with the King starting on e1, rather than f1. And that to avoid any problems with it in Battle of the Goths, I did not really play Embassy, but the fully equivalent mirrored Embassy. And with that one, none of the engines had problems, of course. Actually it seems that it is not TJchess that is in error here: e1b1 does seem a legal castling in Embassy. It is WinBoard_F which unjustly rejects the move. Most likely because of the FEN reader ignoring specified castling rights for which it does not find a King on f1 and a Rook in the indicated corner. The fact that you don't have this problem with Joker80 is because Joker80 is buggy. (Well, not really; it is merely outside its specs. Joker80 considers all castlings with a non-corner Rook and King not in the f-file as CRC castlings, which are only allowed in variant caparandom, but not in variants capablanca or *UNSPEAKABLE*. And Joker80 does not support caparandom yet.) So the fact that you don't see any problems with Joker80 is because it will never castle when you feed it the Embassy setup, so that WinBoard doesn't get the chance to reject the castling as illegal. And if the opponent castles, WinBoard would reject it as illegal, and not pass it on to Joker80. I guess the fundamental fix will have to wait until I implement variant caparandom in WinBoard; I think that both WinBoard and Joker80 are correct in identifying the Embassy opening position as not belonging to Capablanca Chess, but needing the CRC extension of castling. (Even if it is only a limited extension, as the Rooks are still in thre corner.) And after I fix it in WinBoard, I still would have to equip Joker80 with CRC capability before you could use it to play the Embassy setup. It is not very high on my list of priorities, though, as I see little reason to play Embassy rather than mirrored Embassy.
Hecker: It was fairly easy for me to replicate the bug I experienced. In fact, I have never successfully played a computer vs. computer game to completion using TJChess10x8 in my life. So, you should be able to replicate the bug I experienced using the information I have provided. I hope you can fix it as well. Bug Report TJChess10x8 http://www.symmetryperfect.com/report
| However, TJChess cannot handle my favorite CRC opening setup, | Embassy Chess, without issuing false 'illegal move' warnings and | stopping the game. Remarkable. I played this opening setup too, in Battle of the Goths, and never noticed any problems with TJchess. It might have been another version, though. If you have somehow saved the game, be sure to send it to Tony, so he can fix the problem.
'Of course you could also use Joker80 or TJchess10x8, which do not suffer from such problems.' ____________________ While you were on vacation, I started a series of 'minimized asymmetrical playtests' using SMIRF. So, I will complete them using SMIRF. Joker80, running under Winboard F, has never acted buggy in computer vs. computer games. However, TJChess cannot handle my favorite CRC opening setup, Embassy Chess, without issuing false 'illegal move' warnings and stopping the game.
Well, never mind. The symmetrical playtesting would not have given any conclusive results with anything less than 2000 games anyway. The asymmetrical playtesting sounds more interesting. I am not completely sure what Smirf bug you are talking about, but in the Battle of the Goths Championship it happened that Smirf played a totally random move when it could give mate in 3 (IIRC) according to both programs (Fairy-Max was the lucky opponent). This move blundered away the Queen with which Smirf was supposed to mate, after which Fairy-Max had no trouble winning with an Archbishop agains some five Pawns. This seems to happen when Smirf has seen the mate, and stored the tree leading to it completely in its hash table. It is then no longer searching, and it reports score and depth zero, playing the stored moves (at least, that was the intention). I have never seen any such behavior when Smirf was reporting non-zero search depth, and in particular, the last non-zero-depth score before such an occurence (a mate score) seemed to be correct. So I don't think there is much chance of an error when you believe the mate announce,emt and call the game. Of course you could also use Joker80 or TJchess10x8, which do not suffer from such problems.
Muller: Thank you for the helpful response. Frankly, I considered my own question so obvious as to be borderline-stupid but I just wanted to be certain. The following entries within the 'winboard.ini' file should enable me to playtest (limited) randomized and non-randomized versions of Joker80 against one another. Does it look alright? If/When I run out of more pressing playtesting missions, I may undertake this one after all. /firstChessProgramNames={'Joker80 22' /firstInitString='new\n' 'Joker80 22' } /secondChessProgramNames={'Joker80 22' /secondInitString='new\n' 'Joker80 22' } Unfortunately, I no longer plan to playtest sets of CRC piece values by Muller, Scharnagl and Nalls against one another. I think having the pawn set to 85 and the queen set to 950 (as required by Joker80) for all three sets of material values would have the unintentional side effect of equalizing their scales (which are normally different). This means that the Muller set would, in fact, be tested against something other than a true, accurate representation of the Scharnagl and Nalls sets. I am currently in the midst of conducting several 'minimized asymmetrical playtests' using SMIRF at moderate time controls. I want to tentatively determine who is correct in disagreements between our models involving 2:1 or 1:2 exchanges (with supreme pieces). I have to avoid its checkmate bug, though. This requires me to take back one move whenever the program declares checkmate and 'call the game' if a sizeable material and/or positional advantage indisputably exists for one player. Fortunately, this is almost always the case. I will give a report in a few-several weeks.
George Duke: | Has initial array positioning already entered discussion for | value determinations? No, it hasn't, and I don't think it should, as this discussion is about Piece Values, and not about positional play. Piece values are by definition averages over all positions, and thus independent on the placement of pieces on the board. Note furthermore that the heuristic of evaluation is only useful for strategic characteristics of a position, i.e. characteristics that tend to be persistent, rather than volatile. Piece placement can be such a trait, but not always. In particular, in the opening phase, pieces are not locked in the places they start, but can find plenty better places to migrate to, as the center of the board is still complete no-man's land. Therefore, in the opening phase, the concept of 'tempo' becomes important: if you waste too much time, the opponent gets the chance to conquer space, and prevent your pieces that were badly positioned in the array to properly develop. I did some asymmetric playtesting for positional values in normal Chess, swapping Knights and Bishops for one side, or Knights and Rooks. I was not able to detect any systematic advantage the engines might have been deriving from this. In my piece value testing I eliminate positionsal influences by playing from positions that are as symmetric as possible given the material imbalance. And the effect of starting the pieces involved in the imbalance in different places is averaged out by playing from shuffled arrays, so that each piece is tried in many different locations.
Derek: | Could you please give me example lines within the 'winboard.ini' | file that would successfully do so? I need to make sure every | character is correct. Sorry for the late response; I was on holiday for the past two weeks. The best way to do it is probably to make the option dependent on the engine selection. That means you have to write it behind the engine name in the list of pre-selectable engines like: /firstChessProgramNames={... 'C:/engines/joker/joker80.exe 23' /firstInitString='new\n' ... } And something similar for the second engine, using /secondInitString. The path name of the joker80 executable would of course have to be where you installed it on your computer; the argument '23' sets the hash-table size. you could add other arguments, e.g. for setting the piece values, there as well. Note the executable name and all engine argument are enclosed by the first set of quotes (which are double quotes, but these for some reason refuse to print in this forum), and everything after this first syntactical unit on the line is interpreted as WinBoard arguments that should be used with this engine when it gets selected. Note that string arguments are C-style strings, enclosed in double quotes, and making use of escape sequences like '\n' for newline. The defauld value for the init strings is 'new\nrandom\n'.
I bet if you offered a $20,000 reward we'd see many programs coming to meet the poetic challenge within a matter of months. You can read about computer generated writing here:
http://www.evolutionzone.com/kulturezone/c-g.writing/index_body.html
Anyway, I believe that computers are up to such a poetic task... it just takes a motivated programmer.
Back to CVs: Chess is a great game. And just because computers can play it far better than most, are we to discard it? I don't think so; not as long as humans vs. humans and enjoy the game while doing so. The same goes with other variants.
As for the poetry, just because computers don't write that style certainly doesn't motivate me to do so.
With Centaur and Champion(RN) the array must affect values on 8x10 especially. Detraction of 0.1 or more for both cornered, one would expect. In Falcon Chess, of the 453,600 initial arrays, cornered positions for Falcon lower value relatively. Cheops' 'FRNBQ...' or Pyramids' 'FBRNQ...' each take away 0.1 or 0.2 of more general 6.5. Templar's 'RBFNQ...' and Osiris' 'RNBQFF...' are harder to distinguish from standard 'RNBFQ...' and 'RNFBQ...' Has initial array positioning already entered discussion for value determinations?
Let's open the discussion to designers not actively programming now. There is a lot of two-tracking in CVPage. One example of double track is that most designers see their work as art and become prolificists (Betza, Gilman), the more ''paintings'' in their portfolio the better; whereas few others want to replace standard FIDE form logically (Duke, Trice, FischerRandom, Duniho's Eurasian, Joyce's Great Shatranj). The two camps talk at cross-purposes. Two other(different) opposite tracks may be seen this thread, namely, between player and programmer. Staightforward heuristic for player (usually designer too hereabouts), to make ongoing alterable piece-value estimates, certainly refining if possible to within 0.1 of a Pawn, their being so many hundreds of CVs to compute, of course will not do in itself for programmer. It is interesting, that's all, that the player's recipe is rejected immediately by the programmer. Player would gravitate to '1)' and '4)' rather than programmer-popular '2)' and '3)'. Another topic to relate here is proven fallacy after 400 years of emphasized Centaur(BN) and Champion (RN) anyway, discussed much in 2007, to be resurrected in follow-up. // In response to Gifford's: Computers will never write rhymed lines this century where every syllable matches in rhyme like: ''The avatar Horus' all-seeing Eye/ We have a star-chorus rallying cry.'' Granted most would not like style of writing, but still Computer cannot do it, rhyme every word with meaning. Similarly, we need games Computer cannot play well, or be expected to ever, using hidden information like Kriegspiel if that's what it takes, or Rules changing within score, or something else. Surely the main reason for vanishing interest in Mad Queen is Computer dominance in all aspects.
Upon reflection, I have no conceivable reason to be distrustful of using Joker80 IF I shut-off its limited randomization of move selection which Winboard F activates by default. Could you please give me example lines within the 'winboard.ini' file that would successfully do so? I need to make sure every character is correct.
George Duke: | However, the reality is if one is playing many CVs, precisely | Number One, not any of the other 3, is far and away the most valuable | and reliable tool, effectively building on experience. Time is also | factor, and unless Player can adjust quickly, without extensive | playtesting, and make ballpark estimates of values, all is lost on | new enterprise. We recommend just this Method One, increasing | facility at it, for serious CV play, and in turn the designer | needs to try to keep the game somewhat out of reach for Computer. Well, I guess that it depends on what your standards are. If you are satisfied with values that are sometimes off by 2 full Pawns, (as the case of the Archbishop demonstrates to be possible), I guess method #1 will do fine for you. But, as 2 Pawns is almost a totally winning advantage, my standards are a bit higher than that. If I build an engine for a CV, I don't want it to strive for trades that are immediately losing.
From what I have seen in regard to both variants and programmers, it seems logical to conclude that any game a human mind can play, a program can be written for. The program may be flawed, but the bugs can be worked out.
In my opinion, designers need not worry about computers. If you make a great game, likely someone will get a computer to play it. That is not to say all great games end up having associated programs... but they could.
''Educated guessing based on known 8x8 piece values and assumptions on synergy values of compound pieces'' -- [immediately] Muller rejects it out of hand from his list of four 3.May.2008. ''We can safely dismiss method (1) as unreliable...'' Then he touts the more scientific, roughly: 2) board-averaged piece mobilities 3) best fit from computer-computer games deliberately imbalanced 4) Playtesting. However, the reality is if one is playing many CVs, precisely Number One, not any of the other 3, is far and away the most valuable and reliable tool, effectively building on experience. Time is also factor, and unless Player can adjust quickly, without extensive playtesting, and make ballpark estimates of values, all is lost on new enterprise. We recommend just this Method One, increasing facility at it, for serious CV play, and in turn the designer needs to try to keep the game somewhat out of reach for Computer.
No engine I know of prunes in the root, in any iteration. They might reduce the depth of poor moves compared to that of the best move by one ply, but they will be searched in every iteration except a very early one (where they were classified as poor) to a larger depth then they were ever searched before. So at any time their score can recover, and if it does, they are re-searched within the same iteration at the un-reduced depth. This is absolutely standard, and also how Joker80 works. Selective search, in engines that do it, is applied only very deep in the tree. Never close to the root.
I have read that most computer chess programmers use the brute force method initially when the plies can be cut thru quickly and then switch to use advanced pruning techniques to focus the search from then on. This lead to my mis-interpretation that Joker80 would have more moves under consideration as the best at short time controls than long time controls. Some moves that score highly-positive after only a few-several plies will score lowly-positive, neutral or negative after more plies. Thus, I do not see how the number of moves under consideration as the best could prevent being reduced slightly with plies completed. As a practical concern, there is rarely any benefit in accepting the CPU load associated with, for example, checking a low-score positive move returned after 13-ply completion thru 14-ply completion (for example) when other high-score positive moves exist in sufficient number.
Derek: | The moral of the story is that randomization of move selection | reduces the growth in playing strength that normally occurs with | time and plies completed. This is not how it works. For one, you assume that at long TC there would be fewer moves to chose from, and they would be farther apart in score. This is not the case. The average distribution of move scores in the root depends on the position, not on search depth. And in cases were the scores of the best and second-best move are far apart, the random component of the score propagating from the end-leaves to the root is limited to some maximum value, and thus could never cause the second-best move to be preferred over the best move. The mechanism can only have any effect on moves that would score nearly equal (within the range of the maximum addition) in absence of the randomization. For moves that are close enough in score to have an effect on, the random contribution in the end-leaves will be filtered by minimax while trickling down to the root in such a way that it is no longer a homogeneously distributed random contribution to the root score, but on average suppresses scores of moves leading to sub-trees where the opponent had a lot of playable options, and we only few, while on average increasing scores where we have many options, and the opponent only few. And the latter are exactly the moves that, in the long term, will lead you to positions of the highest score.
I'm not very familiar with H.G.'s randomization technique, so I really have no idea how well it works. It sounds like he adds small random values to leaf node evaluations, which is of course different than selecting a random 'good' move from the root of the search. Note that it is definitely true that randomness can be helpful for a chess engine, even though it might seem counter-intuitive. For example, basically all strong chess engines (as far as I know) use random (pseudo-random) Zobrist keys for hashing. The random keys may be generated at run-time, or pre-generated, but they are random either way. Using different random keys will cause the engine to give slightly different results without necessarily changing the engine's overall strength. Obviously, if used incorrectly, randomness could severely hurt an engine's strength as well. For example, if an engine just plays random moves. :)
Rest assured, I intend to drop this futile topic of conversation soon and leave you alone. The following is my impression of how the limited randomization of move selection that you have described as being at work within Joker80 must be harmful to the quality of moves made (on average) at long time controls. Since you have experience and knowledge as the developer of Joker80, I will defer to you the prerogative to correct errors in my inferred, general understanding of its workings. _______________________________________________________ short time control 1x At an example time control of 10 seconds per move (average), Joker80 cuts thru 8 plies before it runs out of time and must produce a move. At the moment the time expires, it has selected 12 high-scoring moves as candidates out of a much larger number of legal moves available. Generally, all of them score closely together with a few of them even tied for the same score. So, when Joker80 randomly chooses one move out of this select list, it has probably not chosen a move (on average) that is beneath the quality of the best move it could have found (within those severe time constraints) by anything except a minor amount. In other words, the damage to playing strength via randomization of move selection is minimized under minimal time controls. ___________________________ long time control 360x At an example time control of 60 minutes per move (average), Joker80 cuts thru 14 plies (due to its sophisticated advance pruning techniques) before it runs out of time and must produce a move. At the moment the time expires, it has selected only 4 high-scoring moves as candidates out of a much larger number of legal moves available. Generally, all of them score far apart with a probable best move scored significantly higher than the probable second best move. So, when Joker80 randomly chooses one move out of this select list, the chances are 3/4 that it has ignored its probable best move. Furthermore, it may not have chosen the probable second best move, either. It just as likely could have chosen the probable third or fourth best move, instead. Ultimately, it has probably chosen a move (on average) that is beneath the quality of the best move it may have successfully found by a moderate-major amount. In other words, the damage to playing strength via randomization of move selection is maximized under maximal time controls. _______________________________________ The moral of the story is that randomization of move selection reduces the growth in playing strength that normally occurs with time and plies completed.
'It would be very educational then to get yourself acquainted with the current state of the art of Go programming ...' Go is a connection game that is not related to Chess or its variants. The only thing Go has in common with Chess is that it is played upon a board using pieces. You did not directly address my comment.
| I just cannot understand how any rational, intelligent man could | believe that introducing chaos (i.e., randomness) is beneficial | (instead of detrimental) to achieving a goal defined in terms of | filtering-out disorder to pinpoint order. It would be very educational then to get yourself acquainted with the current state of the art of Go programming, where Monte-Carlo techniques are the most successful paradigm to date... | When you reduce the power of your algorithm in any way to | filter-out inferior moves, you thereby reduce the average | quality of the moves chosen and consequently, you reduce | the playing strength of your program- esp. at long time controls. Exactly. This is why I _enhance_ the power of my algorithm to filter out inferior moves. As the inferior moves have a smaller probability to draw a large positive random bonus than the better moves. They thus have a lower probability to be chosen, which enhances the average quality of the moves, and thus playing strength. At any time control. It is a pity this suppression of inferior moves is only probabilistic, and some inferior moves by sheer luck can still penetrate the filter. But I know of no deterministic way to achieve the same thing. So something ais better as nothing, and I settle for the inferior moves only getting a lower chance to pass. Even if it is not a zero chance, it is still better than letting them pass unimpededly. | In any event, the addition of the completely-unnecessary module of | code used to create the randomization effect within Joker80 that | you desire irrefutably makes your program larger, more complicated | and slower. Can that be a good thing? Everything you put into a Chess engine makes it larger and slower. Yet, taking almost everything out, only leaves you with a weak engine like micro-Max 1.6. The point is that putting code in also can make the engine smarter, improve its strategic understanding, reduce its branching ratio, etc. So if it is a good thing or not does not depend on if it makes the engine larger, motre complicated, or slower. It depends on if the engine still fits in the available memory, and from there produces better moves in the same time. Which larger, more complicated and slower engines often do. As always, testing is the bottom line. Actually the 'module of code' consists only of only 6 instructions, as I derive the pseudo-random number from the hashKey. But the point you are missing is this: I have theoretical understanding of how Chess engines work, and therefore are able to extrapolate their behavior with high confidence from what I observe under other conditions (i.e. at fast TC). Just like I don't have to travel to the Moon and back to know its distance from the Earth, because I understand geometry and triangulation. So I know that if including a certain evaluation term gives me more accurate scores (and thus more reliable selection of the best move) from 8-ply search trees, I know that this can only give better moves from 18-ply search trees. As the latter is nothing but millions of 8-ply search trees grafted on the tips of a mathematically exact 10-ply minimax propagation of the score from the 8-ply trees towards the root. Anyway, it is not of any interest to me to throw months of valuable CPU time to answer questions I already know the answer to.
'Joker80's strength increases with time as expected, in the range from 0.4 sec to 36 sec per move, in a regular and theoretically expected way.' 'The effect you mention is observed NOT to occur and thus cannot explain anything that was observed to occur.' Admittedly, I have no proof ... yet. Of course, this is due to Joker80 never have been playtested at truly long time controls (to my point of view). _______________________________________________________________ 'Now if you want to conjecture that this will all miraculously become very different at longer TC, you are welcome to test it and show us convincing results. I am not going to waste my computer time on such a wild and expensive goose chase.' I respect your bravery to issue the challenge. Although I would surely find the results of a randomized Joker80 vs. non-randomized Joker80 tournament at 60 minutes per move (on average) interesting, I am not willing either to invest a few (3-4) months of my computer time that I estimate it would require to playtest 16 games under acceptable, reliable conditions. My refusal is due to it not being extremely important or worthwhile to me just to keep the chess variant community from losing one potentially great talent to numerology (or some such). Besides, I have nothing to gain and nothing new to learn by conducting this long, difficult experiment. Only you stand to benefit tangibly from its results. I just cannot understand how any rational, intelligent man could believe that introducing chaos (i.e., randomness) is beneficial (instead of detrimental) to achieving a goal defined in terms of filtering-out disorder to pinpoint order. When you reduce the power of your algorithm in any way to filter-out inferior moves, you thereby reduce the average quality of the moves chosen and consequently, you reduce the playing strength of your program- esp. at long time controls. In other words, you are counteracting a portion of everything desirable that you achieve thru advanced pruning techniques used elsewhere within your program. Since you argue that randomization is no problem at all and I argue that randomization is a moderate-major problem, everything we say to one another is becoming purely argumentative. Only tests (that neither one of us intend to perform) can prove who is correct and settle the issue. ___________________________________________________________________ 'As I explained, it is very easy to switch this feature off. But you should be prepared for significant loss of strength if you do that.' To the contrary, you should be prepared for a significant gain of strength if you do that. Notably, you do not dare. In any event, the addition of the completely-unnecessary module of code used to create the randomization effect within Joker80 that you desire irrefutably makes your program larger, more complicated and slower. Can that be a good thing?
Derek Nalls: | Nonetheless, completing games of CRC (where a long, close, | well-played game can require more than 80 moves per player) | in 0:24 minutes - 36 minutes does NOT qualify as long or even, | moderate time controls. In the case of your longest 36-minute games, | with an example total of 160 moves, that allows just 13.5 seconds per | move per player. In fact, that is an extremely short time by any | serious standards. In my experience most games on the average take only 60 moves (perhaps because of the large strength difference of the players). As early moves are more important for the game result as late moves (even the best moves late in the game do not help you if your position is already lost), most engines use 2.5% of the remaining time for their next move (on average, depending on how the iterations end compared to the target time). That would be nearly 54 sec/move at 36 min/game in the decisive phase of the game. That is more than you thought, but admittedly still fast. Note, however, that I also played 60-min games in the General Championship (without time odds), and that Joker80 confirms its lead over the competitors it manifested at faster time controls. But I don't see the point: Joker80's strength increases with time as expected, in the range from 0.4 sec to 36 sec per move, in a regular and theoretically expected way. This is over the entire range where I tested the dependence of the scoring percentage of various material imbalances, which extended to only 15 sec/move, and found it to be independent of TC. So your 'explanation' for the latter phenomenon is just nonsense. The effect you mention is observed NOT to occur, and thus cannot explain anything that was observed to occur. Now if you want to conjecture that this will all miraculously become very different at longer TC, you are welcome to test it and show us convincing results. I am not going to waste my computer time on such a wild and expensive goose chase. Because from the way I know the engines work, I know that they are 'scalable': their performance at 10 ply results from one ply being put in front of 9-ply search trees. And that extra ply will always help. If they have good 9-ply trees, they will have even better 10-ply trees. But you don't have to take my word for it. You have the engine, and if you don't want to believe that at 1 hour per move you will get the same win probability as at 1 sec/move, or that at 1 hour per move it won't beat 10 min//move, just play the games, and you will see for yourself. It would even be appreciated if you publish the games here or on your website. But, needless to say, one or two games won't convince anyone of anything. | 'since I am not a computer chess programmer, I cannot possibly | know what I am talking about when I dare criticize an important | working of your Joker80 program' Well, you certainly make it appear that way. As, despite the elaborate explanation I gave of why programs derive extra strength from this technique, you still draw a conclusion that in practice was already shown to be 100% wrong earlier. And if you think you will run into the problem you imagine at enormously longer TC, well, very simple: don't use Joker80, but use some other engine. You are on your own there, as I am not specifically interested in extremely long TC. There is always a risk in using equipment outside the range of conditions for which it was designed and tested, and that risk is entirely yours. So better tread carefully, and make sure you rule out the percieved dangers by concise testing. | You must decide upon and define the primary function of your | Joker80 program. I do not see the dilemma you sketch. The purpose is to play ON AVERAGE the best possible move. If you do that, you have the best chance to win the game. If I can achieve that through a non-deterministic algorithm better than through a deterministic one, I go for the nondeterministic method. That it also diversifies play, and makes me less sensitive to prepared openings from the opponent, is a win/win situation. Not a compromise. As I explained, it is very easy to switch this feature off. But you should be prepared for significant loss of strength if you do that.
I am slightly relieved and surprised that Joker80 measurably improves the quality of its moves as a function of time or plies completed over a range of speed chess tournaments. Nonetheless, completing games of CRC (where a long, close, well-played game can require more than 80 moves per player) in 0:24 minutes - 36 minutes does NOT qualify as long or even, moderate time controls. In the case of your longest 36-minute games, with an example total of 160 moves, that allows just 13.5 seconds per move per player. In fact, that is an extremely short time by any serious standards. I consider 10 minutes per move a moderate time that produces results of marginal, unreliable quality and 60-90 minutes per move a long time that produces results of acceptable, reliable quality. Ask Reinhard Scharnagl or ET about the longest time per move they have used testing openings with their programs playing 'Unmentionable Chess'- 24 hours per move! It is noteworthy that you are now resorting to playing dirty by using the 'exclusivist argument' that essentially 'since I am not a computer chess programmer, I cannot possibly know what I am talking about when I dare criticize an important working of your Joker80 program'. What you fail to take into account is that I am a playtester with more experience than you at truly long time controls. If you will not listen to what I am trying to tell you, then why will you not listen to Scharnagl? After all, he is also a computer chess programmer with a lot of knowledge in important subject matters (such as mathematics). You really should not be laughing. This is a serious problem. Your sarcastic reaction does nothing to reassure my trust or confidence that you will competently investigate it, confirm it and fix it. Now, please do not misconstrue my remarks? My intent is not to overstate the problem. I realize Joker80 in its present form is not a totally random 'woodpusher'. It would not be able to win any short time control tournaments if that were the case. In fact, I believe you when you state that you have not experienced any problems with it but ... I think this is strictly because you have not done any truly long time control playtesting with it. You must decide upon and define the best primary function for your Joker80 program: 1. To pinpoint the single, very best move available from any position. [Ideally, repeats could produce an identical move.] OR 2. To produce a different move from any position upon most repeats. [At best, by randomly choosing amongst a short list of the best available moves.] These two objectives are mutually exclusive. It is impossible and self-contradictory for a program to somehow accomplish both. Virtually every AI game developer in the world except you chooses #1 as preferable to #2 by a long shot in terms of the move quality produced on average. If you do not even commit your AI program to TRYING to find the single best move available because you think variety is just a whole lot more interesting and fun, then it will be soft competition at truly long time controls facing other quality AI programs that are frequently-sometimes pinpointing the single, best move available and playing it against you.
Derek: 'I hope you can handle constructive advice.' It gives me a big laugh, that's for sure. Of course none of what you say is even remotely true. That is what happens if you jump to conclusions regarding complex matters you are not knowledgeable about, without even taking the trouble to verify your ideas. Of course I extensively TESTED how the playing strength of Joker80, (and all available other engines), varied as a function of time control. This was the purpose of several elaborate time-odds tournament I conducted, where various versions of most engines participated that had to play their games in 36, 12, 4, 1:30, 0:40 or 0:24 min, where handicapped engines were meeting non-handicapped ones in a full round robin. (I.e. the handicaps were factors 3, 9, 24, 54 or 90, where only the strongest engines were handicapped upto the very maximum, and the weakest only participated in an unhandicapped version). And of course Joker80 behaves similar to any Shannon-type engine that is reasonably free of bugs: its playing strength measured in Elo monotonically increases in a logarithmic fashion, approximately to the formula rating = 100*ln(time). So Joker80 at 5 min/move crushes Joker80 at 1 sec per move, as you could have easily found out for yourself. So that much for your nonsense about Joker80 failing to improve its move quality with time. For some discussion on one of the tournaments, see: http://www.talkchess.com/forum/viewtopic.php?t=19764&postdays=0&postorder=asc&topic_view=flat&start=34 At that time Fairy-Max still had a hash-table bug that made it hang (and subsequently forfeit on time) that was striking at a fixed rate per second, so that Fairy-Max started to forfeit more and more games at longer TC. Since then the bug has been identified and repaired, and now also Fairy-Max performs progressively better at longer TC. So nice try, but next time better save your breath for telling the surgeon how to do his job before he will perform open heart surgery on you. Because he has no doubt much more to learn from you regarding cardiology than I have in the area of building Chess engines... Things are as they are, and can become known by observation and testing. Believing in misconceptions born out of ignorance is not really helpful. Or, more explicitly: if you think you know how to build better Chess engines than other people, by all means, do so. It will be fun to confront your ideas with reality. In the mean time I will continue to build them as I think best, (and know is best, through extensive testing), so you should have every chance to surpass them. Lacking that, you could at least _use_ the engines of others to check out if your theories of how they behave have any reality value. You don't have to depend on the time-odds tourneys and other tests I conduct. You might not even be aware of them, as the developers of Chess engines hardly ever publish the thousands of games they do for testing if their ideas work in practice.
The reason you have never been able find any correlation between winning probabilities for one army and time controls [contrary to the experiences of people using other AI programs] in asymmetrical playtests using Joker80 is that you have destructively randomized the algorithm within your program to such an extent that it fails to measurably improve the quality of its moves as a function of time or plies completed. A program with serious problems of this nature may do well in speed chess but at truly long time controls against quality programs that improve as they should with time or plies per move, it cannot consistently win. I have two useful, important pieces of news for you: 1. All of the statistical data you have generated using Joker80 (appr. 20,000+ games) is corrupt. It must all be thrown out and started over from scratch after you repair Joker80. 2. All of your material values for CRC pieces are unreliable since they are based upon and derived from #1 (corrupt statistical data). I hope you can handle constructive advice.
I would have thought that 'twice the same flip in a row' was pretty unambiguous, especially in combination with the remark about two-sided testing. But let's not quibble about the wording. The point was that for two-sided testing, if you suspect a coin to be loaded, but have no idea if it is loaded to produce tail or heads, thw two flips tell you exactly nothing. They are either the same or different, and on an unbiased coin that would occur with equal probability. So the 'confidence' of any conclusion as to the fairness of the coin drawn from the two flips would be only 50%. I.e. not better than totally random, you might as well have guessed if it was fair or not without flipping it at all. That would also have given you a 50% chance of guessing correct.
Well, when you said ... 'Actually the chance for twice the same flip in a row is 1/2.' ... that was vague and misleading. I thought you meant 'heads' twice OR 'tails' twice equals a chance of 1/2 instead of the sum of 'heads' twice AND 'tails' twice equals a chance of 1/2. Since English is a second language to you, of course I will overlook this minor mis-communication and even apologize for implicitly accusing you of incompetence. However, you should expect that you will draw critical reactions from others when you have previously, falsely, explicitly accused them of incompetence in a subject matter.
'Actually the chance for twice the same flip in a row is 1/2.' H.G. is correct here. - The probability of two heads in a row is 1/4. - The probability of two tails in a row is 1/4. - The probability of two same flips in a row is the sum of these two outcomes: 1/4 + 1/4 = 1/2. Another way to think about it: With two coin flips, there are 4 equally likely outcomes: HH, HT, TH, TT. In 2 of the 4 (equally likely) outcomes, the same flip result occurs twice in a row.
Indeed, it is a stochastic way to simulate mobility evaluation. In the presence of other terms it should of course not be made so large that it dominates the total evaluation. Like explicit mobility terms should not dominate the evaluation. But its weight should not be set to zero either: properly weighted mobility might add more than 100 Elo to an engine. Joker has no explicit mobility in its evaluation, and relies entirely on this probabilistic mechanism to simulate it. The disadvantage is that, because of the probabilistic nature, it is not 100% guaranteed to always take the best decision. On rare occasions the single acceptable end leave does draw a higher random bonus than one-hundred slightly better positions in another branch. OTOH it is extremely cheap to implement, while explicit mobility is very expensive. As a result, I might gain an extra ply in search depth. And then it becomes superior to explicit mobility, as it only counts tactically sound moves, rather than just every move. So it is like safe mobility verified by a full Quiescence Search. In my assesment, the probabilistic mobility adds more strength to Joker than changing the Rook value by 50cP would add or subtract. This can be easily verified by play-testing. It is possible to switch this evaluation term off. In fact, you have to switch it on, but WinBoard does this by default. To prevent it from being switched on, one should run WinBoard with the command-line option /firstInitString='new'. (The default setting is 'new\nrandom'. If Joker is running as second engine, you will of course have to use /secondInitString='new'.)
Harm wrote: ... 'OTOH, a program that evaluates every position as a completely random number starts to play quite reasonable ches, once the search reaches 8-10 ply. Because it is biased to seek out moves that lead to pre-horizon nodes that have the largest number of legal moves, which usually are the positions where the strongest pieces are still in its possession.' ... This is nothing but a probability based heuristic simulating a mobility evaluation component. But having a working positional evaluation, especially when also covering mobility, that randomizing method is not orthogonal to the calculated much more appropriate knowledge. Thus you will overlay a much better evaluation by a disturbing noise generator. Nevertheless this approach might have advantages through the opening, preventing some else working implementations of preinvestigated killer combinations.
'Do not you realize that forcing Joker80 to do otherwise must reduce its playing strength significantly from its maximum potential?' On the contrary, it makes it stronger. The explanation is that by adding a random value to the evaluation, branches with very many equal end leaves have a much larger probability to have the highest random bonus amongst them than a branch that leads to only a single end-leaf of that same score. The difference can be observed most dramatically when you evaluate all positions as zero. This makes all moves totally equivalent at any search depth. Such a program would always play the first legal move it finds, and would spend the whole game moving its Rook back and forth between a1 and b1, while the opponent is eating all its other pieces. OTOH, a program that evaluates every position as a completely random number starts to play quite reasonable ches, once the search reaches 8-10 ply. Because it is biased to seek out moves that lead to pre-horizon nodes that have the largest number of legal moves, which usually are the positions where the strongest pieces are still in its possession. It is always possible to make the random addition so small that it only decides between moves that would otherwise have exactly equal evaluation. But this is not optimal, as it would then prefer a move (in the root) that could lead (after 10 ply or so) to a position of score 53 (centiPawn), while all other choices later in the PV would lead to -250 or worse, over a move that could lead to 20 different positions (based on later move choices) all evaluating as 52cP. But, as the scores were just approximations based on finite-depth search, two moves later, when it can look ahead further, all the end-leaf scores will change from what they were, because those nodes are now no longer end-leaves. The 53 cP might now be 43cP because deeper search revealed it to disappoint by 10cP. But alas, there is no choice: the alternatives in this branch might have changed a little too, but now all range from -200 to -300. Not much help, whe have to settle for the 43cP... Had it taken the root move that keeps the option open to go to any of the 20 positions of 52cP, it would now see that their scores on deeper search would have been spread out between 32cP and 72cP, and it could now go for the 72cP. In other words, the investment of keeping its options open rather than greedily commit itself to going for an uncertain, only marginally better score, typically pays off. To properly weight the expected pay-back of keeping options that at the current search depth seem inferior, it must have an idea of the typical change of a score from one search depth to the next. And match the size of the random eval addition to that, to make sure that even sligtly (but insignificantly) worse end-leaves still contribute to enhancing the probability that the branch will be chosen. Playing a game in the face of an approximate (and thus noisy) evaluation is all about contingency planning. As to the probability theory, you don't seem to be able to see the math because of the formulae... P(hh) = 0.5*0.5 = 0.25 P(tt) = 0.5*0.5 = 0.25 ______________________+ P(two equal) = 0.5
'... in Joker the source of indeterminism is much less subtle: it is programmed explicitly.' This renders Joker80 totally unsuitable for my playtesting purposes. [I am just relieved that you told me this bizarre fact now before I invested large amounts of computer time and effort.] It is critically important that any AI program attempt (to its greatest capability) to pinpoint the single, very best possible move in the time allowed upon every move in the game even if this means that it would often-sometimes repeat an identical move from an identical position. Do not you realize that forcing Joker80 to do otherwise must reduce its playing strength significantly from its maximum potential?
'Actually the chance for twice the same flip in a row is 1/2.' ______________________________________________________ Really? You obviously need a lesson on probability. Let us start with elementary stuff. Mathematical Ideas fifth edition Miller & Heeren 1986 It is an old college textbook from a class I took in the mid-90's. [Yes, I passed the class.] ______________________ It says interesting things such as- 'The relative frequency with which an outcome happens represents its probability.' 'In probability, each repetition of an experiment is a trial. The possible results of each trial are outcomes.' ____________________________________________ An example of a probability experiment is 'tossing a coin'. Each 'toss' (trial of the experiment) has only two equally-possible outcomes, 'heads' or 'tails' ... assuming the condition that the coin is fair (i.e., not loaded). probability = p heads = h tails = t number of tosses = x addition = + involution = ^ [This is a substitute upon a single line for superscript representation of an exponent to the upper right of a base.] probability of heads = p(h) probability of tails = p(t) p(h) is a base. p(t) is a base. x is an exponent. p(h) = 0.5 p(t) = 0.5 _________________ What follows are examples of the chances of getting the same result upon EVERY consecutive toss. 1 time x = 1 p(h) ^ x = 0.5 ^ 1 = 0.5 p(t) ^ x = 0.5 ^ 1 = 0.5 Note: In this case only ... p(h) + p(t) = 1.0 2 times x = 2 p(h) ^ x = 0.5 ^ 2 = 0.25 p(t) ^ x = 0.5 ^ 2 = 0.25 3 times x = 3 p(h) ^ x = 0.5 ^ 3 = 0.125 p(t) ^ x = 0.5 ^ 3 = 0.125 Etc ... ______________________ By a function that is the inverse of successive exponents of base 2, the chance for consecutive tosses to yield the same result rapidly becomes extremely small. When this occurs, there are only two possibilities- 'random good-bad luck' or an unfair advantage-disadvantage exists (i.e., 'the coin is loaded'). The sum of these two possibilities always equals 1. random luck (good or bad) = l unfair (advantage or disadvantage) = u luck (heads) = l(h) luck (tails) = l(t) unfair (heads) = u(h) unfair (tails) = u(t) p(h) ^ x = l(h) p(t) ^ x = l(t) l(h) + u(h) = 1 l(t) + u(t) = 1 Therefore, as the chances of 'random good-bad luck' become extremely low in the example, the chances of an advantage-disadvantage existing for 'one side of the coin' or (if you follow the analogy) 'one side of the gameboard' or 'one player' or 'one set of piece values' become likewise extremely high. Only if it can be proven that an advantage-disadvantage does not exist for one player, then can it be accepted that the extremely unlikely event by 'random good-bad luck' is indeed the case. It is essential to understand that random good luck or random bad luck cannot be consistently relied upon. From this fact alone, firm conclusions can be responsibly drawn with a strong probability of correctness. ____________________________________________________________ 1 time x = 1 p(h) ^ x = 0.5 u(h) = 0.5 p(t) ^ x = 0.5 u(t) = 0.5 2 times x = 2 p(h) ^ x = 0.25 u(h) = 0.75 p(t) ^ x = 0.25 u(t) = 0.75 3 times x = 3 p(h) ^ x = 0.125 u(h) = 0.875 p(t) ^ x = 0.125 u(t) = 0.875 Etc ...
Derek: | Conclusions drawn from playing at normal time controls are | irrelevant compared to extremely-long time controls. First, that would only be true if the conclusions would actually depend on the TC. Which is a totally unproven conjecture on your part, and in fact contrary to any observation made at TCs where such observations can be made with any accuracy (because enough games can be played). This whole thing reminds me of my friend, who always claims that stones fall upward. When I then drop a stone to refute him, he jsut shrugs, and says it proves nothing because the stone is 'not big enough'. Very conveniently for him, the upward falling of stones can only be observed on stones that are too big for anyone to lift... But the main point is of course, if you draw a conclusion that is valid only at a TC that no one is interested in playing, what use would such a conclusion be? | The chance of getting the same flip (heads or tails) twice-in-a-row | is 1/4. Not impressive but a decent beginning. Add a couple or a | few or several consecutive same flips and it departs 'luck' by a | huge margin. Actually the chance for twice the same flip in a row is 1/2. Unless you are biased as to what the outcome of the flip should be (one-sided testing). And indeed, 10 identical flips in a row would be unlikely to occur by luck by a large margin. But that is rather academic, because you won't see 10 identical results in a row between the subtly different models. You will see results like 6-4 or 7-3, which will again be very likely to be a result of luck (as that is exactly what they are the result of, as you would realize after 10,000 games when the result is standing at 4,628-5,372). Calculate the number of games you need to typically get a result for a 53-47 advantage that could not just as easily have been obtained from a 50-50 chance with a little luck. You will be surprised... | I have wondered why the performance of computer chess programs is | unpredictable and varied even under identical controls. Despite | their extraordinary complexity, I think of computer hardware, | operating systems and applications (such as Joker80) as deterministic. In most engines there alwas is some residual indeterminism, due to timing jitter. There are critical decision points, where the engine decides if it should do one more iteration or not (or search one more move vs aborting the iteration). If it would take such decisions purely on internal data, like node count, it would play 100% reproducible. But most engines use the system clock, (to not forfeit on time if the machine is also running other tasks), and experience the timing jitter caused by other processes running, or rotational delays of the hard disk they had been using. In multi-threaded programs this is even worse, as the scheduling of the threads by the OS is unpredictable. Even the position where exactly the program is loaded in physical memory might have an effect. But in Joker the source of indeterminism is much less subtle: it is programmed explicitly. Joker uses the starting time of the game as the seed of a pseudo-random-number generator, and uses the random numbers generated with the latter as a small addition to the evaluation, in order to lift the degeneracy of exactly identical scores, and provide a bias for choosing the move that leads to the widest choice of equivalent positions later. The non-determanism is a boon, rather than a bust, as it allows you to play several games from an identical position, and still do a meaningful sampling of possible games, and of the decisions that lead to their results. If one position would always lead to the same game, with the same result (as would occur if you were playing a simple end-game with the aid of tablebases), it would not tell you anything about the relative strength of the armies. It would only tell you that this particular position was won / drawn. But noting about the millions of other positons with the same material on the board. And the value of the material is by definition an average over all these positions. So with deterministic play, you would be forced to sample the initial positions, rather than using the indeterminism of the engine to create a representative sample of positions before anything is decided. | In fact, to the extent that your remarks are true, they will | support my case if my playtesting is successful that the | unlikelihood of achieving the same outcome (i.e., wins or | losses for one player) is extreme. This sentence is to complicated for me to understand. 'Your case' is that 'the unlikelyhood of achieving the same outcome is extreme'? If the unlikelyhood is extreme, is that the same as that the likelyhood is extreme? Is the 'unlikelyhood to be the same' the same as the 'likelyhood to be different'? What does 'extreme' mean for a likelyhood? Extremely low or extremely high? I wonder if anything is claimed here at all... I think you make a mistake by seeing me as a low-quality advocate. I only advocate minimum quantity to not make the results inconclusive. Unfortunately, that is high, despite my best efforts to make it as low as possible through asymmetric playtesting and playing material imbalances in pairs (e.g. 2 Chancellors agains two Archbisops, rather than one vs one). And that minimum quantity puts limits to the maximum quality that I can afford with my limited means. So it would be more accurate to describe me as a minimum-(significant)-quantity, maximum-(affordable)-quality advocate...
'If the result would be different from playing at a a more 'normal' TC, like one or two hours per game, it would only mean that any conclusions you draw on them would be irrelevant for playing Chess at normal TC.' Conclusions drawn from playing at normal time controls are irrelevant compared to extremely-long time controls. It is desirable to see what secrets can be discovered from a rarely viewed vantage of extremely well-played games. Are not you interested at all to analyze move-by-move games played better than almost any pair of human players are capable? You do not seem to understand that I, too, am discontent with the probability of a small number of wins or losses in a row. This is a compensation that reduces the chance that the games were randomly played to the greatest extent attainable and consequently, the winner or loser randomly determined. _____________________________ '... playing 2 games will be like flipping a coin.' Correction- Playing 1 game will be like flipping a coin ... once. Playing 2 games will be like flipping a coin ... twice. The chance of getting the same flip (heads or tails) twice-in-a-row is 1/4. Not impressive but a decent beginning. Add a couple or a few or several consecutive same flips and it departs 'luck' by a huge margin. _______________________________________________________________ 'The result, whatever it is, will not prove anything, as it would be different if you would repeat the test. Experiments that do not give a fixed outcome will tell you nothing, unless you conduct enough of them to get a good impression on the probability for each outcome to occur.' I have wondered why the performance of computer chess programs is unpredictable and varied even under identical controls. Despite their extraordinary complexity, I think of computer hardware, operating systems and applications (such as Joker80) as deterministic. The details of the differences in outcomes do not concern me. In fact, to the extent that your remarks are true, they will support my case if my playtesting is successful that the unlikelihood of achieving the same outcome (i.e., wins or losses for one player) is extreme. I am pleased to report that I estimate it will be possible, over time, to generate enough experiments using Joker80 to have meaning for a high-quality, low-quantity advocate (such as myself) and even a moderate-quality, moderate-quantity advocate (such as Scharnagl). As for a low-quality, high-quantity advocate (such as you), you will always be disappointed as you are impossible to please.
I have recently been sufficiently convinced via asymmetrical playtesting (still underway) that the 2 rooks : 1 queen advantage in material values is appr. the same in CRC as in FRC. [I used to think it was higher in CRC.] Consequently, I revised my model (again) and my CRC piece values: universal calculation of piece values http://www.symmetryperfect.com/shots/calc.pdf CRC material values of pieces http://www.symmetryperfect.com/shots/values-capa.pdf FRC material values of pieces http://www.symmetryperfect.com/shots/values-chess.pdf This change was implemented by raising the value of the queen in CRC- not by lowering the value of the rook. revised Joker80 values Nalls standard CRC model P85=268=307=518=818=835=950
Derek Nalls: | This might require very deep runs of moves with a completion time | of a few weeks to a few months per pair of games to achieve | conclusive results. It still escapes me what you hope to prove by playing at such an excessively long Time Control. If the result would be different from playing at a a more 'normal' TC, like one or two hours per game, (which IMO will not be the case), it would only mean that any conclusions you draw on them would be irrelevant for playing Chess at normal TC. Furthermore, playing 2 games will be like flipping a coin. The result, whatever it is, will not prove anything, as it would be different if you would repeat the test. Experiments that do not give a fixed outcome will tell you nothing, unless you conduct enough of them to get a good impression on the probability for each outcome to occur.
'Because of all this, I suggest evaluating entire configuration of pieces, rather than a single piece.' This is exactly what Chess engines do. But it is a subject that transcends piece values. Material evaluation is supposed to answer the question: 'what combination of pieces would you rather have, without knowing where they stand on the board'. Piece values are an attempt to approximate the material evaluation as a simple sum of the value of the individual pieces, making up the army. It turns out that material evaluation is by far the largest component of the total evaluation of a Chess position. And this material evaluation again can be closely approximated by a sum of piece values. The most well-known exception is the Bishop pair: having two Bishops is worth about half a Pawn more than double the value of a single Bishop. Other non-additive terms are those that make the Bishop and Rook value dependent on the number of Pawns present. To account for such effects some engines (e.g. Rybka) have tabulated the total value of all possible combinations of material (ignoring promotions) in a 'material table'. Such tables can then also account for the material component of the evaluation that gives the deviation from the sum of piece values due to cooperative effects between the various pieces. Useful as this may be, it remains true that piece values are by far the largest contribution to the total evaluation. The only positional terms that can compete with it are passed pawns (a Pawn on 7th rank is worth nearly 2.5 normal Pawns) and King Safety (having a completely exposed King in the middle game, when the opponent still has a Queen or similar super-piece, can be worth nearly a Rook).
Perhaps we need to look back to exactly why we need piece values. Is it to balance different armies, or just because people are curious? Is the objective to turn Chess Variants into a single balanced game, or something else? Maybe need to think of the reason for the discussion, so then you can perhaps find a way to cut the Gordian knot instead of trying to untangle it.
Originally, I planned two 'internal playtests'. [By this self-invented term I mean playtests of the standard model of a person against a special model that I have compelling reasons to think may be superior by a provable margin.] The first planned test involves the standard CRC model of Muller against a special CRC model with a higher, closer-to-conventional rook value. Upon closer examination, I suspected that the discrepancy was possibly too small to be detected even with very long time controls. So, I announced that this test was cancelled. Notwithstanding, I may change my mind and return to this unsolved mystery if Joker80 demonstrates unusually-high aptitude as a playtesting tool. This might require very deep runs of moves with a completion time of a few weeks to a few months per pair of games to achieve conclusive results. The second planned test involves the standard CRC model of Scharnagl against a special CRC model with a higher, unconventional archbishop value. Scharnagl currently assigns the archbishop with a material value of appr. 77% that of the chancellor in his standard CRC model. Muller currently assigns the archbishop with a material value of greater than 97% that of the chancellor in his standard CRC model. Nalls currently assigns the archbishop with a material value of lesser than 98% that of the chancellor in his standard CRC model. I devised a special CRC model using identical material values for every piece in the standard CRC model by Scharnagl except that it assigns the archbishop with a material value of exactly 95% that of the chancellor (18% or 1.65 pawns higher). [Note that this figure is slightly more moderate than those by Muller & Nalls.] A discrepancy this large should be detectable at short-moderate time controls. This test is now underway. If either of these tests are successful at establishing or implicating a probability that the special models play stronger than the standard models, then revisions to the standard models may occur. At that juncture, we would be ready to begin 'external playtests'. [By this self-invented term I mean playtests of the standard models of different persons against one another.]
I believe that is correct [that is what programs like Fritz and Chess Master seem to do... evaluating the two configurations and giving a score for the deviation] but also I would say, evaluate the pieces within the given position. The values are relative and change with every move.
The lowly pawn about to queen is a fine example. The Knight that attacks 8 spaces compared to one that attacks 4 is another, as is the 'bad' [blockaded] Bishop.
Another concept is that of brain power. For example, the late Bobby Fischer's Knights would be much more powerful than mine... not in potential, but in reality of games played. Pieces have potential, but the amount of creative power behind them is an important factor.
It seems like a normal FIDE pawn, but by simply shifting all the pawns up one row, the value of all them changes. In other words, their value is dependent upon their proximity to other pawns. In light of this, are pieces worth the same in every configuration of Chess960? This issue is more complicated than it appears. Take Near vs Normal Chess, for example. Which side has an advantage? The Near side moves everything up one row, but drops castling, but has a back row to either drop the king back or mobilize the rooks. And, against this, Near can En Passant the pawns of Normal, but Normal can't do the same to Near. Because of all this, I suggest evaluating entire configuration of pieces, rather than a single piece.
'Let me provide another challenge for people here regarding pawns. How much is a pawn that moves only one space forward (not initial 2) but starts on the third row instead of second worth in contrast to a normal chess pawn? How much is it worth alone, and then in a line of pawns that start on the third row?' But this is a totally normal FIDE Pawn... It would get a pretty large positional penalty if it was alone (isolated-pawn penalty). In a complete line of pawns on the 3rd rank it would be worth a lot more, as it would not be isolated, and not be backward. All in all it would be fairly similar to having a line of Pawns on second rank, as the bonus for pushing the Pawns forward 1 square is approximately cancelled by not having Pawn control anymore over any of the squares on the 3rd rank.
I believe the value of a piece should relate to its mobility first and foremost. If one were to end up rating a piece, come up with a value of 1 for the most pathetic potential piece in the game, and then adjust accordingly. How about a pawn that starts out on the second space and only moves backwards one as its move and doesn't capture? That pawn has a value of one. How much more is an Asian chess pawn that moves only one space forward, and doesn't promote worth in contrast? To base it on a normal chess pawn is to not provide a full solution for the variant community. Let me provide another challenge for people here regarding pawns. How much is a pawn that moves only one space forward (not initial 2) but starts on the third row instead of second worth in contrast to a normal chess pawn? How much is it worth alone, and then in a line of pawns that start on the third row?
'Do you think these piece values will work smoothly with Joker80 running under Winboard F yet remain true to all three models?' Yes, I think these values will not conflict in anyway with any of the hard-wired value approximates that are used for pruning decisions. At least not to the point where it would lead to any observable effect on playing strength. (Prunings based on the piece values occur only close to the leaves, and engines are usually quite insensitive as to how exactly you prune there.)
'I cannot speak for Reinhard Scharnagl at all, though.' This is exactly the problem. 'base value' for Pawns is a very ill-defined concept, as it is the smallest of all piece base values, while the positional terms regarding to Pawns are usually the largest of all positional terms. And the whole issue of pawn-structure evaluation in Joker is so complex that I am not even sure if the average of positional terms (over all pawns and over a typical game) is positive or negative. Pawns get penalties for being doubled, or having no Pawns next or behind them on neigboring files. They get points for advancing, but they get penalties for creating squares that no longer can be defended by any Pawn. My guess is that in general, the positional terms are slightly positive, even for non-passers not involved in King Safety. A statement like 'a Knight is worth exactly 3 Pawns' is only meaningful after exactly specifying which kind of pawn. If the Scharnagl model evaluates all non-passers exactly the same (except, perhaps, edge Pawns), then the question still arises how to most-closely approximate that in Joker80, which doesn't. And simply setting the Joker80 base value equal to the single value of the Scharnagle model is very unlikely to do it. Good differentiation in Pawn evaluation is likely to impact play strength much more than the relative value of Pawns and Pieces, as Pawns are traded for other Pawns (or such trades are declined by pushing the Pawn and locking the chains) much more often than they can be traded for Pieces.
Muller: Here is my latest revision to my 'winboard.ini' file. Are these piece values acceptable to you? Do you think these piece values will work smoothly with Joker80 running under Winboard F yet remain true to all three models? ______________________________________________________ /firstChessProgramNames={'C:\winboard-F\Joker80\w\M-st\w-M-st 22 P85=300=350=475=875=900=950' 'C:\winboard-F\Joker80\w\M-sp\w-M-sp 22 P85=300=350=560=875=900=950' 'C:\winboard-F\Joker80\w\S-st\w-S-st 22 P85=302=339=551=694=902=950' 'C:\winboard-F\Joker80\w\S-sp\w-S-sp 22 P85=302=339=551=857=902=950' 'C:\winboard-F\Joker80\w\N-st\w-N-st 22 P85=284=326=548=866=884=950' 'C:\winboard-F\Joker80\w\N-sp\w-N-sp 22 P85=284=326=548=866=884=950' 'C:\winboard-F\TJchess\TJChess10x8' } /secondChessProgramNames={'C:\winboard-F\Joker80\b\M-st\b-M-st 22 P85=300=350=475=875=900=950' 'C:\winboard-F\Joker80\b\M-sp\b-M-sp 22 P85=300=350=560=875=900=950' 'C:\winboard-F\Joker80\b\S-st\b-S-st 22 P85=302=339=551=694=902=950' 'C:\winboard-F\Joker80\b\S-sp\b-S-sp 22 P85=302=339=551=857=902=950' 'C:\winboard-F\Joker80\b\N-st\b-N-st 22 P85=284=326=548=866=884=950' 'C:\winboard-F\Joker80\b\N-sp\b-N-sp 22 P85=284=326=548=866=884=950' 'C:\winboard-F\TJchess\TJChess10x8' }
'If I were you, I would normalize all models to Q=950 but then replace the pawn value everywhere by 85.' Since this is what you (the developer of Joker80) recommend as optimum, this is what I will do. Are you sure that replacing any pawn values different than 85 points after renormalization to queen = 950 points still renders an accurate and complete representation, more or less, of the Scharnagl and Nalls models? At par of queen = 950 points, the pawn value in the Nalls model is not represented as being only 92.19% as high as that in the Muller model and the pawn value in the Scharnagl model is not represented as being only 98.95% as high as that in the Muller model. Thru it all ... If a perfect representation is not quite possible, I can accept that without reservation. __________________________________ 'I don't think you could say then that you deviate from the model as the models do not really specify which type of Pawn they use as a standard.' Correctly calculating pawn values at the start of the game (much less, throughout the game) requires finesse as it is indeed a complex issue. In fact, its excessively complexity is the reason my 66-page paper on material values of pieces is silent in the case of calculating pawn values in FRC & CRC. Instead, someone needs to read an entire book from an outside source about calculating the material values of the pieces in Chess to sufficiently understand it. Personally, I am content with the test situation as long as Joker80 handles all pawns under all three models initially valued at 85 points as fairly and equally as realistically possible. I cannot speak for Reinhard Scharnagl at all, though. ________________________________________________ 'The way you did it now would make the first Bishop to be traded of the value the model prescribes, but would make the second much lighter. If you would subtract half the bonus, then on the average they would be what the model prescribes.' Now, I understand better. It makes sense. [I am glad I asked you.] Yes, I will subtract 20 points (1/2 of the 'bishop pair bonus') from the model-independant, material values for the bishop under the Scharnagl & Nalls models.
Is there any special reason you want to keep the Pawn value equal in all trial versions, rather than, say, the total value of the army, or the value of the Queen? Especially in the Scharnagl settings it makes almost every piece rather light compared to the quick guesses used for pruning. Note that there are so many positional modifiers on the value of a pawn (not only determined by its own position, but also by the relation to other friendly and enemy pawns) that I am not sure what the base value really means. Even if I say that it represents the value of a Pawn at g2, the evaluation points lost on deleting a pawn on g2 will depend on if there are pawns on e- and i-file, and how far they are advanced, and on the presence of pawns on the f- and h-file (which mighht become backward or isolated), and of course if losing the pawn would create a passer for the opponent. If I were you, I would normalize all models to Q=950, but then replace the pawn value everywhere by 85 (I think the standard value used in Joker is even 75). I don't think you could say then that you deviate from the model, as the models do not really specify which type of Pawn they use as a standard. My value refers to the g2 pawn in an opening setup. Perhaps Reinhard's value refers to an 'average' pawn, in a typical pawn chain occurring in the early middle game, or a Pawn on d4/e4 (which is the most likely to be traded). As to the B-pair: tricky question. The way you did it now would make the first Bishop to be traded of the value the model prescribes, but would make the second much lighter. If you would subtract half the bonus, then on the average they would be what the model prescribes. The value is indeed hard-wired in Joker, but if you really want, I could make it adjustable through a 8th parameter.
Muller: Please have another look at this except from my 'winboard.ini' file. There are standard and special versions of piece values by Muller, Scharnagl & Nalls for the white and black players renormalized to pawn = 85 points. The special version of the Muller model has a rook value exactly 85 points or 1.00 pawn higher than the standard version. The special version of the Scharnagl model has an archbishop value (736 points) at appr. 95% of the archbishop value (775 points) instead of 597 points at appr. 77% for the standard version. The special version of the Nalls model is identical to the standard version until some test is needed and planned. Since I assume that the 'bishop pairs bonus' is hardwired into Joker80, 40 points has been subtracted from the model-independant, material values of the bishop under all three models. Is this correct? _____________________________________________________ /firstChessProgramNames={'C:\winboard-F\Joker80\w\M-st\w-M-st 22 P85=300=350=475=875=900=950' 'C:\winboard-F\Joker80\w\M-sp\w-M-sp 22 P85=300=350=560=875=900=950' 'C:\winboard-F\Joker80\w\S-st\w-S-st 22 P85=260=269=474=597=775=816' 'C:\winboard-F\Joker80\w\S-sp\w-S-sp 22 P85=260=269=474=736=775=816' 'C:\winboard-F\Joker80\w\N-st\w-N-st 22 P85=262=279=505=799=815=876' 'C:\winboard-F\Joker80\w\N-sp\w-N-sp 22 P85=262=279=505=799=815=876' 'C:\winboard-F\TJchess\TJChess10x8' } /secondChessProgramNames={'C:\winboard-F\Joker80\b\M-st\b-M-st 22 P85=300=350=475=875=900=950' 'C:\winboard-F\Joker80\b\M-sp\b-M-sp 22 P85=300=350=560=875=900=950' 'C:\winboard-F\Joker80\b\S-st\b-S-st 22 P85=260=269=474=597=775=816' 'C:\winboard-F\Joker80\b\S-sp\b-S-sp 22 P85=260=269=474=736=775=816' 'C:\winboard-F\Joker80\b\N-st\b-N-st 22 P85=262=279=505=799=815=876' 'C:\winboard-F\Joker80\b\N-sp\b-N-sp 22 P85=262=279=505=799=815=876' 'C:\winboard-F\TJchess\TJChess10x8' }
Well, I share that concern. But note that the low Rook value was not only based on the result of Q-2R assymetric testing. I also played R-BP and NN-RP, which ended unexpectedly bad for the Rook, and sets the value of the Rook compared to that of the minor pieces. While the value of the Queen was independently tested against that of the minor pieces by playing Q-BNN. The low difference between R and B does make sense to me now, as the wider board should upgrade the Bishop a lot more than the Rook. The Bishop gets extra forward moves, and forward moves are worth a lot more than lateral moves. I have seen that in testing cylindrical pieces, (indicated by *), where the periodic boundary condition w.r.t. the side edges effectifely simulates an infinitely wide board. In a context of normal Chess pieces, B* = B+P, while R* = R + 0.25P. OTOH, Q* = Q+2P. So it doesn't surprise me that on wider boards R loses compared to Q and B. I can think of several systematic errors that lead to unrealistically poor performance of the Rook in asymmetric playtesting from an opening position. One is that Capablanca Chess is a very violent game, where the three super-pieces are often involved in inflicting an early chekmate (or nearly so, where the opponent has to sacrifice so much material to prevent the mate, that he is lost anyway). The Rooks initially offer not much defense against that. But your chances for such an early victory would be strongly reduced if you were missing a super-piece. So perhaps two Rooks would do better against Q after A and C are traded. This explanation would do nothing for explaining poor Rook performance of R vs B, but perhaps it is B that is strong (it is also strong compared to N). The problem then would be not so much low R value, but high Q value, due to cooperativity between superpieces. So perhaps the observed scores should not be entirely interpreted as high base values for Q, C and A, but might be partly due to super-piece pair bonuses similar to that for the Bishop pair. Which I would then (mistakenly) include in the base value, as the other super-pieces are always present in my test positions. Another possible source of error is that the engine plays a strategy that is not well suited for playing 2R vs Q. Joker80's evaluation does not place a lot of importance to keeping all its pieces defended. In general this might be a winning strategy, giving the engine more freedom in using its pieces in daring attacks. But 2R vs Q might be a case where this backfires, and where you can only manifest the superiority of your Rook force by very careful and meticulous, nearly allergic defense of your troops, slowly but surely pushing them forward. This is not really the style of Joker's play. So it would be interesting to do the asymmetreic playtesting for Q vs 2R also with other engines. But TJchess10x8 only became available long after I started my piece value project, TSCP-G does not allow setting up positions (although now I know a work-around for that, forcing initial moves with both ArchBishops to capture all pieces to delete, and then retreating them before letting the engine play). And Smirf initially could not play automatically at all, and when I finally made a WB adapter for it so that it could, fast games by it where more decided by timing issues than by play quality (many losses on time with scores like +12!). And Fairy-Max is really a bit too simplistic for this, not knowing the concept of a Bishop pair or passed pawns, besides being a slower searcher.
As I moved to renormalize all of the values used in Joker80 (written into the 'winboard.ini' file) with the pawn at a par of 85 points, I looked at my notes again. They reminded me that your use of the 'bishop pair' refinement (with a bonus of 40 points) ramifies that the material value of the rook is either 1.00 pawns or 1.47 pawns greater than the material value of the bishop in CRC, depending upon whether or not only one bishop or both bishops, respectively, remain in the game. At that point, I realized that I would be attempting to playtest for a discrepancy that I know from experience is just too small to detect even at very long time controls. So, this planned test has been cancelled. I am not implying that this matter is unimportant, though. I remain concerned for the standard Muller model whenever it allows the exchange of its 2 rooks for 1 queen belonging to its opponent.
It looks OK to me. One caveat: the normalization (e.g. Pawn = 100) is not completely arbitrary, as the engine weights material against positional terms, and doubling all piece values would effectively scale down the importance of passers and King Safety. In addition, the engine also uses some heavily rounded 'quick' piece values internally, where B=N=3, R=5, A=C=8 and Q=9, to make a rough guess if certain branches stand any chance to recoup the material it gave earlier in the branche. So in certain situations, when it is behind 800 cP, it won't consider capturing a Rook, because it expects that to be worth about 500 cP, and thus falls 300 cP below the target. Such a large deficit would be beyond the safety margin for pruning the move. But if the piece values where scaled up such that the 800 merely represented being a Bishop behind, this obviously would be an unjustified pruning. The safety margin is large enough to allow some leeway here, but don't overdo it. It would be safest to keep the value of Q close to 950. I am indeed skeptical to the possibility to do enough games to measure the difference you want to see in the total score percentage. But perhaps some sound conclusions could be drawn by not merely looking at the result, but at the actual games, and single out the Q vs 2R trades. (Or actually any Rook versus other material trade before the end-game. Rooks capturing Pawns to prevent their promotion probably should not count, though.) These could then be used to separately extracting the probability for such a trade for the two sets of piece values, and determine the winning probability for each of the piece values once such a trade would have occurred. By filtering the raw data this way, we get rid of the stochastic noise produced by the (majority of) games whwre the event we want to determine the effect of would not have occurred.
Muller: Please confirm that these are legal values for the 'winboard.ini' file. /firstChessProgramNames={'C:\winboard-F\Joker80\w\M-st\w-M-st 22 P100=353=459=559=1029=1059=1118' 'C:\winboard-F\Joker80\w\M-sp\w-M-sp 22 P100=353=459=659=1029=1059=1118' 'C:\winboard-F\Joker80\w\S-st\w-S-st 22 P100=306=363=557=702=912=960' 'C:\winboard-F\Joker80\w\S-sp\w-S-sp 22 P100=306=363=557=866=912=960' 'C:\winboard-F\Joker80\w\N-st\w-N-st 22 P100=308=376=594=940=958=1031' 'C:\winboard-F\Joker80\w\N-sp\w-N-sp 22 P100=308=376=594=940=958=1031' 'C:\winboard-F\TJchess\TJChess10x8' } /secondChessProgramNames={'C:\winboard-F\Joker80\b\M-st\b-M-st 22 P100=353=459=559=1029=1059=1118' 'C:\winboard-F\Joker80\b\M-sp\b-M-sp 22 P100=353=459=659=1029=1059=1118' 'C:\winboard-F\Joker80\b\S-st\b-S-st 22 P100=306=363=557=702=912=960' 'C:\winboard-F\Joker80\b\S-sp\b-S-sp 22 P100=306=363=557=866=912=960' 'C:\winboard-F\Joker80\b\N-st\b-N-st 22 P100=308=376=594=940=958=1031' 'C:\winboard-F\Joker80\b\N-sp\b-N-sp 22 P100=308=376=594=940=958=1031' 'C:\winboard-F\TJchess\TJChess10x8' }
Of course, I would bet anything that there are no 1:1 exchanges supported under the standard Muller CRC model that could cause material losses. If that were the case, yours would not be one of the three most credible CRC models under close consideration. In fact, even your excellent Joker80 program would play poorly if stuck with using faulty CRC piece values. Obviously, the longer the exchange, the rarer its occurrence during gameplay. The predominance of simple 1:1 exchanges over even the least complicated, 1:2 or 2:1 exchanges, in gameplay is large although I do not know the stats. In fact, there is a certain 1:2 or 2:1 exchange I am hoping to see that is likely to support my contention that the Muller rook value should be higher: the 1 queen for 2 rooks or 2 rooks for 1 queen exchange. Please recall that under the standard Muller model, this is an equal exchange. However, under asymmetrical playtesting of comparable quality to and similar to that I used to confirm the correctness of your higher archbishop value, I played numerous CRC games at various moderate time controls where the player without 1 queen (yet with 2 rooks) defeated the player without 2 rooks (yet with 1 queen). Ultimately, a key mechanism to conclusive results is that while the standard Muller model is neutral toward a 2 rook : 1 queen or 1 queen : 2 rook exchange, the special Muller model regards its 1 queen as significantly less valuable than 2 rooks of its opponent. Consequently, this contrast in valuation could be played into ... and we would see who wins. I am actually pleased that you are a realist who shares my pessimism in this experiment. In any case, low odds do not deter a best effort to succeed. The main difference between us is that you calculate your pessimism by extreme statistical methods whereas I calculate my pessimism by moderate probabilistic methods. I remain hopeful that eventually I will prove to you that the method Scharnagl & I developed is occasionally productive.
Well, to get an impression at what you can expect: In my first versions of Joker80 I still used the Larry-Kaufman piece values of 8x8 Chess. So the Bishop was half a Pawn too low, nearly equal to the Knight (as with more than 5 Pawns, Kaufman has a Knight worth more than a lone Bishop, neutraling a large part of the pair bonus.) Now unlike a Rook, a Bishop is very easy to trade for a Knight, as both get into play early. Making the trade usually wrecks the opponent's pawn structure by creating a doubled Pawn, giving enough compensation to make it attractive. So in almost all games Joker played with two Knights against two Bishops after 12 moves or so. Fixing that did increase the playing strength by ~100 Elo points. So where the old version would score 50%, the improved version would score 57%. Now a similarly bad value for the Rook would manifest itself much more difficultly: the Rooks get into play late, there is no nearly equal piece for which a 1:1 trade changes sign, and you would need 1:3 trades (R vs B+2P) or 2:2 trades (R+P for N+N), which are much more difficult to set up. So I would expect that being half a Pawn off on the Rook value would only reduce your score by about 3%, rather than 7% as with the Bishop. After playing 100 games, the score differs by more than 3% from the true win probability more often than not. So you would need at least 400 games to show with minimal confidence that there was a difference. Beware that the result of the games are stochastic quantities. Replay the game at the same time control, and the game Joker80 plays will be different. And often the result will be different. This is true at 1 sec per move, but it is equally true at 1 year per move. The games that will be played, are just a sample from the myriads of games Joker80 could play with non-zero probability. And with fewer than 400 games, the difference between the actually measured score percentage and the probability you want to determine will be in most cases larger than the effect of the piece values, if they are not extremey wrong (e.g. setting Q < B).
Everything is working fine. Thank you! I now have 12 instances of the Joker80 program running in various sub-directories of Winboard F with the 'winboard.ini' file set to conveniently initiate any desired standard or special material values for the CRC models by Muller, Scharnagl and Nalls. In the first test, I am going to attempt to find a playtesting time where a distinct seperation in playing strength occurs between the standard Muller model wherein the rook is 1 pawn more valuable than the bishop and a special Muller model wherein the rook is 2 pawns more valuable than the bishop. If I successfully find a playtesting time that is survivable by humans, then we can hopefully establish a tentative probability as to which CRC model plays decisively better after a few-several games. At par 100 (for the pawn), the bishop is at 459 under both models with the rook at 559 under the standard Muller model and 659 under the special Muller model. I want to playtest a special Muller model with a rook value 2.00 pawns higher than the bishop because the Nalls model has a rook value 2.19 pawns higher than the bishop and the Scharnagl model has a rook value 1.94 pawns higher than the bishop (for an average of 2.06 pawns). Since I am attempting to test for such a small difference in the material value of only one type of piece (the rook), I have doubts that I will be able to obtain conclusive results. In any case ... If I obtain conclusive results, then very long time controls will surely be required to produce them.
One small refinement: If the command-line argument was used to modify the piece values, Joker80 will give its own name to WinBoard as 'Joker80.xp', in stead of 'Joker80.np', so that it becomes less hard to figure out which engine was winning (e.g. from the PGN file). Note also that at very long time control you might want to enlarge the hash table; default is 128MB, but if you invoke Joker80 as 'joker80.exe 22 P100=300=....' it will use 256MB (and with 23 in stead of 22 it will use 512MB, etc.)
OK, I replaced the joker80.exe on my website by one with adjustable piece values. (If you run it from the command line, it should say version 1.1.14 (h).) I also tried to fix the bug in undo (which I discoverd was disabled altogether in the previous version), and although it seemed to work, it might remain a weak spot. (I foresee problems if the game contained a promotion, for instance, as it might not remember the correct promotion piece on replay.) So try to avoid using the undo. I decided to make the piece values adjustable through a command-line option, rather than from a file, to avoid problems if you want to run two different sets of piece values (where you then would have to keep the files separate somehow). The way it works now is that for the engine name (that WinBoard asks in the startup dialog, or that you can put in the winboard.ini file to appear in the selectable engines there), you should write: joker80.exe P85=300=350=475=875=900=950 The whole thing should be put between double quotes, so that WinBoard knows the P... is an option to the engine, and not to WinBoard. The numerical values are those of P, N, B, R, A, C and Q, respectively, in centiPawn. You can replace them by any value you like. If you don't give the P argument, it uses the default values. If you give a P argument with not enough values, the engine exits. Note that these are base values, for the positionally average piece. For N and B this would be on c3, in the presence (for B) of ~ 6 own Pawns, half of them on the color of the Bishop. A Bishop pair further gets 40cP bonus. For the Rook it is the value for one in the absence of (half-)open files. The Pawn value will be heavily modified by positional effects (centralization, support by own Pawns, blocking by enemy Pawns), which on the average will be positive. Note that you can play two different versions against each other automatically. The first engine plays white, in two-machines mode. (You won't be able to recognize them from their name...)
'Human vs. engine play is virtually untested. Did you at any point of the game use 'undo' (through the WinBoard 'retract move')?' Yes. Many of us error-prone humans use it frequently. ________________________________________________ 'This is indeed something I should fix but the current work-around would be not to use 'undo'.' Makes sense to me. I can avoid using the 'retract move' command altogether. ________________________________________________________ 'I could make a Joker80 version that reads the piece base values from a file 'joker.ini' at startup. Then you could change them to anything you want to test, without the need to re-compile. Would that satisfy your needs?' Yes, better than I ever imagined. Thank you!
Muller: Please investigate this potentially serious bug I may have discovered while testing Joker80 under Winboard F ... Bugs, Bugs, Bugs! http://www.symmetryperfect.com/pass I am having a hard time with software today.
Muller: I would like to conduct two focused playtests using Joker80 at very long time controls (e.g., 30 minutes per move) to investigate these important questions- 1. Is Muller's rook value within the CRC set too low? 2. Is Scharnagl's archbishop value within the CRC set too low? I would need for you to compile special versions of Joker80 for me using significantly different values for those CRC pieces as well as Scharnagl's CRC piece set. To isolate the target variable, these games would be Muller (standard values) vs. Muller (test values) and Scharnagl (standard values) vs. Scharnagl (test values) via symmetrical playtesting. Anyway, we can discuss the details if you are interested or willing. Please let me know.
Since Muller's Joker80 has recently established itself via 'The Battle Of The (Unspeakables)' tournament as the best free CRC program in the world, I checked it out. I must report that setting-up Winboard F (also written by Muller) to use it was straight-forward with helpful documentation. Generally, I am finding the features of Joker80 to be versatile and capable for any reasonable uses.
To anyone who was interested ... My playtesting efforts using SMIRF have been suspended indefinitely due to a serious checkmate bug which tainted the first game at 30 minutes per move between Scharnagl's and Muller's sets of CRC piece values.
Rich Hutnik: | Anyone think this might be a sound approach? Well, not me! Science is not a democracy. We don't interview people in the street to determine if a neutron is heavier than a proton, or what the 100th decimal of the number pi is. At best, you could use this method to determine the CV rating of the interviewed people. But even if a million people would think that piece A is worth more than piece B, and none the other way around, that doesn't make it so. The only thing that counts is if A makes you win more often than B would. If it doesn't, than it is of lower value. No matter what people say, or how many say it.
Here is another approach I would suggest for strength of pieces. How about we pick 100 and people order them from strongest to weakest? Work on a scoring system for position, and then at least get an idea of order of strength. Anyone think this might be a sound approach?
This discussion is too silly for words anyway. Because even if it were true that the winning probability for a given material imbalance would be different at 1 hour per move than it would be at 10 sec/move, it would merely mean that piece values are different for different quality players. And although that is unprecedented, that revelation in itself would not make the piece values at 1 hour per move of any use, as that is a time control that no one wants to play anyway. So the whole endeavor is doomed from the start: by testing at 1 hour per move, either you measure the same piece values as you would at 10 sec/move, and wasted 99.7% of your time, or you find different values, and then you have wrong values, which cannot be used at any time control you would actually want to play...
Reinhard, that is not relevant. It will happen on the average as often for the other side. It is in the nature of Chess. Every game that is won, is won by an error, that might not have been made on longer thinking. As the initial position is not a won position for eaiter side. But most games are won by either side, and if they are allowed to think longer, most games are still won by either side. What is so hard to understand about the statement 'the win probability (score fraction, if you allow for draws) obtained from a given quiet, but complex (many pieces) position between equal opponents does not depend on time control' that it prompt people to come up with irrelevancies? Why do you think that saying anything at all that does not mention an observed probability would have any bearing on this statement whatsoever? I don't think the ever more hollow sounding selfdeclared superiority of Derek need much comment. He obviously doesn't know zilch about probability theory and statistics. Shouting that he does won't make it so, and won't fool anyone.
To H.G.M.: why have you to be that unfriendly? But to give you a strong argument, that longer thinking phases could change a game result: have a look at: [site removed], where [a claim is made], that there would be a mate in 9. In fact there SMIRF has been in a lost situaton. But watching a chess engine calculate on that position, you could see, that an initial heavy disadvantage switches into a secure win. Having engines calculate with short time frames would probably lead to another result. Here increasing thinking time indeed is leading to a result switch. [The above has been edited to remove a name and site reference. It is the policy of cv.org to avoid mention of that particular name and site to remove any threat of lawsuits. Sorry to have to do that, but we must protect ourselves. -D. Howe]
'Is this story meant to illustrate that you have no clue as to how to calculate statistical significance?' No. This story is meant to illustrate that you have no clue as to how to calculate probabilistic significance ... and it worked perfectly. ________________________________________________________ There you go again. Missing the point entirely and ranting about probabilities not being proper statistics.
Reinhard Scharnagl: | I am still convinced, that longer thinking times would have an | influence on the quality of the resulting moves. Yes, so what? Why do you think that is a relevant remark? The better moves won't help you at all, if the opponent also does better moves. The result will be the same. And the rare cases it is not, on the average cancel each other. So for the umptiest time: NO ONE DENIES THAT LONGER THINKING TIME PRODUCES SOMEWHAT BETTER MOVES. THE ISSUE IS THAT IF BOTH SIDES PLAY WITH LONGER TC, THEIR WINNING PROBABILITIES WON'T CHANGE. And don't bother to to tell us that you are also convinced that the winning probabilities will change, without showing us proof. Because no one is interested in unfounded opinions, not even if they are yours.
Is this story meant to illustrate that you have no clue as to how to calculate statistical significance? Or perhaps that you don't know what it is at all? The observation of a single tails event rules out the null hypothesis that the lottery was fair (i.e. that the probability for this to happen was 0.000,000,01) with a confidence of 99.999,999%. Be careful, though, that this only describes the case where the winning android was somehow special or singled out in advance. If the other participants to the lottery were 100 million other cheating androids, it would not be remarkable in anyway that one of them won. The null hypothesis that the lottery was fair predicted a 100% probability for that. But, unfortunately for you, it doesn't work for lotteries with only 2 tickets. Then you can rule the null hypothesis that the lottery was fair (and hence the probability 0.5) with a confidence of 50%. And 50% confidence means that in 50% of the cases your conclusion is correct, and in the other 50% of the cases not. In other words, a confidence level of 50% is a completely blind, uninformed random guess.
Since I had to endure one of your long bedtime stories (to be sure), you are going to have to endure one of mine. Yet unlike yours [too incoherent to merit a reply], mine carries an important point: Consider it a test of your common sense- Here is a scenario ... 01. It is the year 2500 AD. 02. Androids exist. 03. Androids cannot tell lies. 04. Androids can cheat, though. 05. Androids are extremely intelligent in technical matters. 06. Your best friend is an android. 07. It tells you that it won the lottery. 08. You verify that it won the lottery. 09. It tells you that it purchased only one lottery ticket. 10. You verify that it purchased only one lottery ticket. 11. The chance of winning the lottery with only one ticket is 1 out of 100 million. 12. It tells you that it cheated to win the lottery by hacking into its computer system immediately after the winning numbers were announced, purchasing one winning ticket and back-dating the time of the purchase. ____________________________________________ You have only two choices as to what to believe happened- A. The android actually won the lottery by cheating. OR B. The android actually won the lottery by good luck. The android was mistaken in thinking it successfully cheated. ______________________________________________________ The chance of 'A' being true is 99,999,999 out of 100,000,000. The chance of 'B' being true is 1 out of 100,000,000. ________________________________________________ I would place my bet upon 'A' being true because I do not believe such unlikely coincidences will actually occur. You would place your bet upon 'B' being true because you do not believe such unlikely coincidences have any statistical significance whatsoever. _________________________________________ I make this assessment of your judgment ability fairly because you think it is a meaningless result if a player with one set of CRC piece values wins against its opponent 10-times-in-a-row even as the chance of it being 'random good luck' is indisputably only 1 out of 1024. By the way ... base 2 to exponent 100 equals 1,267,650,600,228,229,401,496,703,205,376. Can you see how ridiculous your demand of 100 games is?
Understanding your example as an argument against Derek Nalls' testing method, I wonder why your chess engines always are thinking using the full given timeframe. It would be much more impressive, if your engine would decide always immediately. ;-)
I am still convinced, that longer thinking times would have an influence on the quality of the resulting moves.
I really am completely lost, so I won't comment until I can see what the debate is about.
'This discussion is pointless.' On this one occasion, I agree with you. However, I cannot just let you get away with some of your most outrageous remarks to date. So, unfortunately, this discussion is not yet over. ____________________________________________ 'First you should have results, then it becomes possible to talk about what they mean. You have no result.' Of course, I have a result! The result is obviously the game itself as a win, loss or draw for the purposes of comparing the playing strengths of two players using different sets of CRC piece values. The result is NOT statistical in nature. Instead, the result is probabilistic in nature. I have thoroughly explained this purpose and method to you. I understand it. Reinhard Scharnagl understands it. You do not understand it. I can accept that. However, instead of admitting that you do not understand it, you claim there is nothing to understand. ______________________________________ 'Two sets of piece values as different as day and night, and the only thing you can come up with is that their comparison is 'inconclusive'.' Yes. Draws make it impossible to determine which of two sets of piece values is stronger or weaker. However, by increasing the time (and plies) per move, smaller differences in playing strength can sometimes be revealed with 'conclusive' results. I will attempt the next pair of Scharnagl vs. Muller and Muller vs. Scharnagl games at 30 minutes per move. Knowing how much you appreciate my efforts on your behalf motivates me. ___________________________________________________ 'Talk about pathetic: even the two games you played are the same.' Only one game was played. The logs you saw were produced by the Scharnagl (standard) version of SMIRF for the white player and the Muller (special) version of SMIRF for the black player. The game is handled in this manner to prevent time from being expired without computation occurring. ___________________________________________________ '... does your test setup s*ck!' What, now you hate Embassy Chess too? Take up this issue with Kevin Hill.
Jianying Ji: | Two suggestion for settling debates such as these. First distributed | computing to provide as much data as possible. And bayesian statistical | methods to provide statistical bounds on results. Agreed: one first needs to generate data. Without data, there isn't even a debate, and everything is just idle talk. What bounds would you expect from a two-game dataset? And what if these two games were actually the same? But the problem is that the proverbial fool can always ask more than anyone can answer. If, by recruting all PCs in the World, we could generate 100,000 games at an hour per move, an hour per move will of course not be 'good enough'. It will at least have to be a week per move. Or, if that is possible, 100 years per move. And even 100 years per move are of course no good, because the computers will still not be able to search into the end-game, as they will search only 12 ply deeper than with 1 hour per move. So what's the point? Not only is his an énd-of-the-rainbow-type endeavor, even if you would get there, and generate the perfect data, where it is 100% sure and prooven for each position what the outcome under perfect play is, what then? Because for simple end-games we are alrady in a position to reach perfect play, through retrograde analysis (tablebases). So why not start there, to show that such data is of any use whatsoever, in this case for generating end-game piece values? If you have the EGTB for KQKAN, and KAKBN, how would you extract a piece value for A from it?
Two suggestion for settling debates such as these. First distributed computing to provide as much data as possible. And bayesian statistical methods to provide statistical bounds on results.
Once upon a time I had a friend in a country far, far away, who had obtained a coin from the bank. I was sure this coin was counterfeit, as it had a far larger probability of producing tails. I even PROVED it to him: I threw the coin twice, and both times tails came up. But do you think the fool believed me? No, he DIDN'T! He had the AUDACITY to claim there was nothing wrong with the coin, because he had tossed it a thouand times, and 523 times heads had come up! While it was clear to everyone that he was cheating: he threw the coin only 10 feet up into the air, on each try. While I brought my coin up to 30,000 feet in an airplane, before I threw it out of the window, BOTH times! And, mind you, both times it landed tails! And it was not just an ordinary plane, like a Boeing 747. No sir, it was a ROCKET plane! And still this foolish friend of mine insisted that his measly 10 feet throws made him more confident that the coin was OK then my IRONCLAD PROOF with the rocket plane. Ridicuoulous! Anyone knows that you can't test a coin by only tossing it 10 feet. If you do that, it might land on any side, rather than the side it always lands on. He might as well have flipped a coin! No wonder they send him to this far, far away country: no one would want to live in the same country as such an idiot. He even went as far as to buy an ICECREAM for that coin, and even ENJOYED eating that! Scandalous! I can tell you, he ain't my friend anymore! Using coins that always land on one side as if it were real money. For more fairy tales and bed-time stories, read Derek's postings on piece values... :-) :-) :-)
This discussion is pointless. In dealing with a stochastic quantity, if your statistics are no good, your observations are no good, and any conclusions based on them utterly meaningless. Nothing of what you say here has any reality value, it is just your own fantasies. First you should have results, then it becomes possible to talk about what they mean. You have no result. Get statistically meaningful testresults. If your method can't produce them, or you don't feel it important enough to make your method produce them, don't bother us with your cr*p instead. Two sets of piece values as different as day and knight, and the only thing you can come up with is that their comparison is 'inconclusive'. Are you sure that you could conclusively rule out that a Queen is worth 7, or a Rook 8, by your method of 'playtesting'? Talk about pathetic: even the two games you played are the same. Oh man, does your test setup s*ck! If you cannot even decide simple issues like this, what makes you think you have anything meaningful to say about piece values at all?
CRC piece values tournament http://www.symmetryperfect.com/pass/ Just push the 'download now' button. Game #1 Scharnagl vs. Muller 10 minutes per move SMIRF MS-174c Result- inconclusive. Draw after 87 moves by black. Perpetual check declared.
'Of course, that is easily quantified. The entire mathematical field of statistics is designed to precisely quantify such things, through confidence levels and uncertainty intervals.' No, it is not easily quantified. Some things of numerical importance as well as geometric importance that we try to understand or prove in the study of chess variants are NOT covered or addressed by statistics. I wish our field of interest was that simple (relatively speaking) and approachable but it is far more complicated and interdisciplinary. All you talk about is statistics. Is this because statistics is all you know well? ___________ 'That difference just can't be seen with two games. Play 100. There is no shortcut.' I agree. Not with only 2 games. However ... With only 4 games, IF they were ALL victories or defeats for the player using a given piece values model, I could tell you with confidence that there is at least a 15/16 chance the given piece values model is stronger or weaker, respectively, than the piece values model used by its opponent. [Otherwise, the results are inconclusive and useless.] Furthermore, based upon the average number of moves per game required for victory or defeat compared to the established average number of moves in a long, close game, I could probably, correctly estimate whether one model was a little or a lot stronger or weaker, respectively, than the other model. Thus, I will not play 100 games because there is no pressing, rational need to reduce the 'chance of random good-bad luck' to the ridiculously-low value of 'the inverse of (base 2 to exponent 100)'. Is there anything about the odds associated with 'flipping a coin' that is beyond your ability to understand? This is a fundamental mathematical concept applicable without reservation to symmetrical playtesting. In any case, it is a legitimate 'shortcut' that I can and will use freely. ________________ 'Even perfect play doesn't help. We do have perfect play for all 6-men positions.' I meant perfect play throughout an entire game of a CRC variant involving 40 pieces initially. That is why I used the word 'impossible' with reference to state-of-the-art computer technology. _______________________________________________________ 'This is approximately master-level play.' Well, if you are getting master-level play from Joker80 with speed chess games, then I am surely getting a superior level of play from SMIRF with much longer times and deeper plies per move. You see, I used the term 'virtually random moves' appropriately in a comparative context based upon my experience. _____________________________________________ 'Doesn't matter if you play at an hour per move, a week per move, a year per move, 100 year per move. The error will remain >=32%. So if you want to play 100 years per move, fine. But you will still need 100 games.' Of course, it matters a lot. If the program is well-written, then the longer it runs per move, the more plies it completes per move and consequently, the better the moves it makes. Hence, the entire game played will progressively approach the ideal of perfect play ... even though this finite goal is impossible to attain. Incisive, intelligent, resourceful moves must NOT to be confused with or dismissed as purely random moves. Although I could humbly limit myself to applying only statistical methods, I am totally justified, in this case, in more aggressively using the 'probabilities associated with N coin flips ALL with the same result' as an incomplete, minimum value before even taking the playing strength of SMIRF at extremely-long time controls into account to estimate a complete, maximum value. ______________________________________________________________ 'The advantage that a player has in terms of winning probability is the same at any TC I ever tried, and can thus equally reliably be determined with games of any duration.' You are obviously lacking completely in the prerequisite patience and determination to have EVER consistently used long enough time controls to see any benefit whatsoever in doing so. If you had ever done so, then you would realize (as everyone else who has done so realizes) that the quality of the moves improves and even if the winning probability has not changed much numerically in your experience, the figure you obtain is more reliable. [I cannot prove to you that this 'invisible' benefit exists statistically. Instead, it is an important concept that you need to understand in its own terms. This is essential to what most playtesters do, with the notable exception of you. If you want to understand what I do and why, then you must come to grips with this reality.]
Drek Nalls: | They definitely mean something ... although exactly how much is not | easily known or quantified (measured) mathematically. Of course that is easily quantified. The entire mathematical field of statistics is designed to precisely quantify such things, through confidence levels and uncertainty intervals. The only thing you proved with reasonable confidence (say 95%) is that two Rooks are not 1.66 Pawn weaker than a Queen. So if Q=950, then R > 392. Well, no one claimed anything different. What we want to see is if Q-RR scores 50% (R=475) or 62% (R=525). That difference just can't be seen with two games. Play 100. There is no shortcut. Even perfect play doesn't help. We do have perfect play for all 6-men positions. Can you derive piece values from that, even end-game piece values??? | Statistically, when dealing with speed chess games populated | exclusively with virtually random moves ... YES, I can understand and | agree with you requiring a minimum of 100 games. However, what you | are doing is at the opposite extreme from what I am doing via my | playtesting method. Where do you get this nonsense? This is approximately master-level play. Fact is that results from playing opening-type positions (with 35 pieces or more) are stochastic quantity at any level of play we are likely to see the next few million years. And even if they weren't, so that you could answer the question 'who wins' through a 35-men tablebase, you would still have to make some average over all positions (weighted by relevance) with a certain material composition to extract piece values. And if you would do that by sampling, the resukt would again be a sochastic quantity. And if you would do it by exhaustive enumeration, you would have no idea which weights to use. And if you are sampling a stochastic quantity, the error will be AT LEAST as large as the statistical error. Errors from other sources could add to that. But if you have two games, you will have at least 32% error in the result percentage. Doesnt matter if you play at an hour per move, a week per move, a year per move, 100 year per move. The error will remain >= 32%. So if you want to play 100 yesr per move, fine. But you will still need 100 games. | Nonetheless, games played at 100 minutes per move (for example) have | a much greater probability of correctly determining which player has | a definite, significant advantage than games played at 10 seconds per | move (for example). Why do I get the suspicion that you are just making up this nonsense? Can you show me even one example where you have shown that a certain material advantage would be more than 3-sigma different for games at 100 min / move than for games at 1 sec/move? Show us the games, then. Be aware that this would require at least 100 games at aech time control. That seems to make it a safe guess that you did not do that for 100 min/move. On the other hand, in stead of just making things up, I have actually done such tests, not with 100 games per TC, but with 432, and for the faster even with 1728 games per TC. And there was no difference beyond the expected and unavoidable statistical fluctuations corresponding to those numbers of games, between playing 15 sec or 5 minutes. The advantage that a player has in terms of winning probability is the same at any TC I ever tried, and can thus equally reliably be determined with games of any duration. (Provided ou have the same number of games). If you think it would be different for extremely long TC, show us statistically sound proof. I might comment on the rest of your long posting later, but have to go now...
'You hardly have the possibility of trading it before there are open files. So it stands to reason that you might as well use the higher value during the entire game.' Well, I understand and accept your reasons for leaving your lower rook value in CRC as is. It is interesting that you thoroughly understand and accept the reasons of others for using a higher rook value in CRC as well. Ultimately, is not the higher rook value in CRC more practical and useful to the game by your own logic? _____________________________ '... if we both play a Q-2R match from the opening, it is a serious problem if we don't get the same result. But you have played only 2 games. Statistically, 2 games mean NOTHING.' I never falsely claimed or implied that only 2 games at 10 minutes per move mean everything or even mean a great deal (to satisfy probability overwhelmingly). However, they mean significantly more than nothing. I cannot accept your opinion, based upon a purely statistical viewpoint, since it is at the exclusion another applicable mathematical viewpoint. They definitely mean something ... although exactly how much is not easily known or quantified (measured) mathematically. __________________________________________________ 'I don't even look at results before I have at least 100 games, because before they are about as likely to be the reverse from what they will eventually be, as not.' Statistically, when dealing with speed chess games populated exclusively with virtually random moves ... YES, I can understand and agree with you requiring a minimum of 100 games. However, what you are doing is at the opposite extreme from what I am doing via my playtesting method. Surely you would agree that IF I conducted only 2 games with perfect play for both players that those results would mean EVERYTHING. Unfortunately, with state-of-the-art computer hardware and chess variant programs (such as SMIRF), this is currently impossible and will remain impossible for centuries-millennia. Nonetheless, games played at 100 minutes per move (for example) have a much greater probability of correctly determining which player has a definite, significant advantage than games played at 10 seconds per move (for example). Even though these 'deep games' play of nowhere near 600 times better quality than these 'shallow games' as one might naively expect (due to a non-linear correlation), they are far from random events (to which statistical methods would then be fully applicable). Instead, they occupy a middleground between perfect play games and totally random games. [In my studied opinion, the example 'middleground games' are more similar to and closer to perfect play games than totally random games.] To date, much is unknown to combinatorial game theory about the nature of these 'middleground games'. Remember the analogy to coin flips that I gave you? Well, in fact, the playtest games I usually run go far above and beyond such random events in their probable significance per event. If the SMIRF program running at 90 minutes per move casted all of its moves randomly and without any intelligence at all (as a perfect woodpusher), only then would my 'coin flip' analogy be fully applicable. Therefore, when I estimate that it would require 6 games (for example) for me to determine, IF a player with a given set of piece values loses EVERY game, that there is only a 63/64 chance that the result is meaningful (instead of random bad luck), I am being conservative to the extreme. The true figure is almost surely higher than a 63/64 chance. By the way, if you doubt that SMIRF's level of play is intelligent and non-random, then play a CRC variant of your choice against it at 90 minutes per move. After you lose repeatedly, you may not be able to credit yourself with being intelligent either (although you should) ... if you insist upon holding an impractically high standard to define the word. ______ 'If you find a discrepancy, it is enormously more likely that the result of your 2-game match is off from its true win probability.' For a 2-game match ... I agree. However, this may not be true for a 4-game, 6-game or 8-game match and surely is not true to the extremes you imagine. Everything is critically dependant upon the specifications of the match. The number of games played (of course), the playing strength or quality of the program used, the speed of the computer and the time or ply depth per move are the most important factors. _________________________________________________________ 'Play 100 games, and the error in the observed score is reasonable certain (68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only then you can see with reasonable confidence if your observations differ from mine.' It would require est. 20 years for me to generate 100 games with the quality (and time controls) I am accustomed to and somewhat satisfied with. Unfortunately, it is not that important to me just to get you to pay attention to the results for the benefit of only your piece values model. As a practical concern to you, everyone else who is working to refine quality piece values models in FRC and CRC will have likely surpassed your achievements by then IF you refuse to learn anything from the results of others who use different yet valid and meaningful methods for playtesting and mathematical analysis than you.
To Derek: I am aware that the empirical Rook value I get is suspiciously low. OTOH, it is an OPENING value, and Rooks get their value in the game only late. Furthermore, this only is the BASE VALUE of the Rook; most pieces have a value that depends on the position on the board where it actually is, or where you can quickly get it (in an opening situation, where the opponent is not yet able to interdict your moves, because his pieces are in inactive places as well). But Rooks only increase their value on open files, and initially no open files are to be seen. In a practical game, by the time you get to trade a Rook for 2 Queens, there usually are open files. So by that time, the value of the Q vs 2R trade will have gone up by two times the open-file bonus. You hardly have the possibility of trading it before there are open files. So it stands to reason that you might as well use the higher value during the entire game. In 8x8 Chess, the Larry Kaufman piece values include the rule that a Rook should be devaluated by 1/8 Pawn for each Pawn on the board there is over five. In the case of 8 Pawns that is a really large penalty of 37.5cP for having no open files. If I add that to my opening value, the late middle-game / end-game value of the Rook gets to 512, which sounds a lot more reasonable. There are two different issues here: 1) The winning chances of a Q vs 2R material imbalance game 2) How to interpret that result as a piece value All I say above has no bearing on (1): if we both play a Q-2R match from the opening, it is a serious problem if we don't get the same result. But you have played only 2 games. Statistically, 2 games mean NOTHING. I don't even look at results before I have at least 100 games, because before they are about as likely to be the reverse from what they will eventually be, as not. The standard deviation of the result of a single Gothic Chess game is ~0.45 (it would be 0.5 point if there were no draws possible, and in Gothic Chess the draw percentge is low). This error goes down as the square root of the number of games. In the case of 2 games this is 45%/sqrt(2) = 32%. The Pawn-odds advantage is only 12%. So this standard error corresponds to 2.66 Pawns. That is 1.33 Pawns per Rook. So with this test you could not possibly see if my value is off by 25, 50 or 75. If you find a discrepancy, it is enormously more likely that the result of your 2-game match is off from to true win probability. Play 100 games, and the error in the observed score is reasonable certain (68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only thn you can see with reasonable confidence if your observations differ from mine.
Before Scharnagl sent me three special versions of SMIRF MS-174c compiled with the CRC material values of Scharnagl, Muller & Nalls, I began playtesting something else that interested me using SMIRF MS-174b-O. I am concerned that the material value of the rook (especially compared to the queen) amongst CRC pieces in the Muller model is too low: rook 55.88 queen 111.76 This means that 2 rooks exactly equal 1 queen in material value. According to the Scharnagl model: rook 55.71 queen 91.20 This means that 2 rooks have a material value (111.42) 22.17% greater than 1 queen. According to the Nalls model: rook 59.43 queen 103.05 This means that 2 rooks have a material value (118.86) 15.34% greater than 1 queen. Essentially the Scharnagl & Nalls models are in agreement in predicting victories in a CRC game for the player missing 1 queen yet possessing 2 rooks. By contrast, the Muller model predicts draws (or appr. equal number of victories and defeats) in a CRC game for either player. I put this extraordinary claim to the test by playing 2 games at 10 minutes per move on an appropriately altered Embassy Chess setup with the missing-1-queen player and the missing-2-rooks player each having a turn at white and black. The missing-2-rooks player lost both games and was always behind. They were not even long games at 40-60 moves. Muller: I think you need to moderately raise the material value of your rook in CRC. It is out of its proper relation with the other material values within the set.
Harm, I think of a more simple formula, because it seems to be easier to find out an approximation than to weight a lot of parameters facing a lot of other unhanded strange effects. Therefore my less dimensional approach is looking like: f(s := sum of unbalanced big pieces' values, n := number of unbalanced big pieces, v := value of biggest opponents' piece). So I intend to calculate the presumed value reduction e.g. as: (s - v*n)/constant P.S.: maybe it will make sense to down limit v by s/(2*n) to prevent a too big reduction, e.g. when no big opponents' piece would be present at all. P.P.S.: There have been some more thoughts of mine on this question. Let w := sum of n biggest opponent pieces, limited by s/2. Then the formula should be: (s - w)/constant P.P.P.S.: My experiments suggest, that the constant is about 2.0 P^4.S.: I have implemented this 'Elephantiasis-Reduction' (as I will name it) in a new private SMIRF version and it is working well. My constant is currently 8/5. I found out, that it is good to calculate one more piece than being without value compensation, because that bottom piece pair could be of switched size and thus would reduce the reduction. Non existing opponent pieces will be replaced by a Knight piece value within the calculation. I noticed a speeding up of SMIRF when searching for mating combinations (by normal play). I also noticed that SMIRF is making sacrifices, incorporating vanishing such penalties of the introduced kind.
'I never found any effect of the time control on the scores I measure for some material imbalance. Within statistical error, the combinations I tries produced the same score at 40/15', 40/20', 40/30', 40/40', 40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I did not consider it worth doing just to prove that it was a waste of time...' _________ The additional time I normally give to playtesting games to improve the move quality is partially wasted because I can only control the time per move instead of the number of plies completed using most chess variant programs. This usually results in the time expiring while it is working on an incomplete ply. Then, it prematurely spits out a move representative of an incomplete tour of the moves available within that ply at a random fraction of that ply. Since there is always more than one move (often, a few-several) under evaluation as being the best possible move [Otherwise, the chosen move would have already been executed.], this means that any move on this 'list of top candidates' is equally likely to be randomly executed. Here are two typical scenarios that should cover what usually happens: A. If the list of top candidates in an 11-ply search consists of 6 moves where the list of top candidates in a 10-ply search consists of 7 moves, then only 1 discovered-to-be-less-than-the-best move has been successfully excluded and cannot be executed. Of course, an 11-ply search completion may typically require est. 8-10 times as much time as the search completions for all previous plies (1-ply thru 10-ply) up until then added together. OR B. If the list of top candidates in an 11-ply search consists of 7 moves [Moreover, the exact same 7 moves.] just as the preceding 10-ply search, then there is no benefit at all in expending 8-10 times as much time. ______________________________________________________________ The reason I endure this brutal waiting game is not for purely masochistic experience but because the additional time has a tangible chance (although no guarantee) of yielding a better move with every occasion. Throughout the numerous moves within a typical game, it can be realistically expected to yield better moves on dozens of occasions. We usually playtest for purposes at opposite extremes of the spectrum yet I regard our efforts as complimentary toward building a complete picture involving material values of pieces. You use 'asymmetrical playtesting' with unequal armies on fast time controls, collect and analyze statistics ... to determine a range, with a margin of error, for individual material piece values. I remain amazed (although I believe you) that you actually obtain any meaningful results at all via games that are played so quickly that the AI players do not have 'enough time to think' while playing games so complex that every computer (and person) needs time to think to play with minimal competence. Can you explain to me in a way I can understand how and why you are able to successfully obtain valuable results using this method? The quality of your results was utterly surprising to me. I apologize for totally doubting you when you introduced your results and mentioned how you obtained them. I use 'symmetrical playtesting' with identical armies on very slow time controls to obtain the best moves realistically possible from an evaluation function thereby giving me a winner (that is by some margin more likely than not deserving) ... to determine which of two sets of material piece values is probably (yet not certainly) better. Nonetheless, as more games are likewise played ... If they present a clear pattern, then the results become more probable to be reliable, decisive and indicative of the true state of affairs. The chances of flipping a coin once and it landing 'heads' are equal to it landing 'tails'. However, the chances of flipping a coin 7 times and it landing 'heads' all 7 times in a row are 1/128. Now, replace the concepts 'heads' and 'tails' with 'victory' and 'defeat'. I presume you follow my point. The results of only a modest number of well-played games can definitely establish their significance beyond chance and to the satisfaction of reasonable probability for a rational human mind. [Most of us, including me, do not need any better than a 95%-99% success to become convinced that there is a real correlation at work even though such is far short of an absolute 100% mathematical proof.] In my experience, I have found that using any less than 10 minutes per move will cause at least one instance within a game when an AI player makes a move that is obvious to me (and correctly assessed as truly being) a poor move. Whenever this occurs, it renders my playtesting results tainted and useless for my purposes. Sometimes this occurs during a game played at 30 minutes per move. However, this rarely occurs during a game played at 90 minutes per move. For my purposes, it is critically important above all other considerations that the winner of these time-consuming games be correctly determined 'most of the time' since 'all of the time' is impossible to assure. I must do everything within my power to get as far from 50% toward 100% reliability in correctly determining the winner. Hence, I am compelled to play test games at nearly the longest survivable time per move to minimize the chances that any move played during a game will be an obviously poor move that could have changed the destiny of the game thereby causing the player that should have won to become the loser, instead. In fact, I feel as if I have no choice under the circumstances.
Reinhard, if I understand you correct, what you basically want to introduce in the evaluation is terms of the type w_ij*N_i*N_j, where N_i is the number of pieces of type i of one side, and N_j is the number of pieces of type j of the opponent, and w_ij is an tunable weight. So that, if type i = A and type j = N, a negative w_ij would describe a reduction of the value of each Archbishop by the presence of the enemy Knights, through the interdiction effect. Such a term would for instance provide an incentive to trade A in a QA vs ABNN for the QA side, as his A is suppressed in value by the presence of the enemy N (and B), while the opponent's A would not be similarly suppressed by our Q. On the contrary, our Q value would be suppressed by the the opponent's A as well, so trading A also benefits him there. I guess it should be easy enough to measure if terms of this form have significant values, by playing Q-BNN imbalances in the presence of 0, 1 and 2 Archbishops, and deducing from the score whose Archbishops are worth more (i.e. add more winning probability). And similarly for 0, 1, 2 Chancellors each, or extra Queens. And then the same thing with a Q-RR imbalance, to measure the effect of Rooks on the value of A, C or Q. In fact, every second-order term can be measured this way. Not only for cross products between own and enemy pieces, but also cooperative effects between own pieces of equal or different type. With 7 piece types for each side (14 in total) there would be 14*13/2 = 91 terms of this type possible.
| And by that this would create just the problem I have tried to | demonstrate. The three Chancellors could impossibly be covered, | thus disabling their potential to risk their own existence by | entering squares already influenced by the opponent's side. You make it sound like it is a disadvantage to have a stronger piece, because it cannot go on squares attacked by the weaker piece. To a certain extent this is true, if the difference in capabilities is not very large. Then you might be better off ignoring the difference in some cases, as respecting the difference would actually deteriorate the value of the stronger piece to the point where it was weaker than the weak piece. (For this reason I set the B and N value in my 1980 Chess program Usurpator to exactly the same value.) But if the difference between the pieces is large, then the fact that the stronger one can be interdicted by the weaker one is simply an integral part of its piece value. And IMO this is not the reason the 4A-9N example is so biased. The problem there is that the pieces of one side are all worth more than TWICE that of the other. Rooks against Knights would not have the same problem, as they could still engage in R vs 2N trades, capturing a singly defended Knight, in a normal exchange on a single square. But 3 vs 1 trades are almost impossible to enforce, and require very special tactics. It is easy enough to verify by playtesting that playing CCC vs AAA (as substitutes for the normal super-pieces) will simply produce 3 times the score excess of playing a normal setup with on one side a C deleted, and at the other an A. The A side will still have only a single A to harrass every C. Most squares on enemy territory will be covered by R, B, N or P anyway, in addition to A, so the C could not go there anyway. And it is not true that anything defended by A would be immune to capture by C, as A+anything > C (and even 2A+anything > 2C. So defending by A will not exempt the opponent from defending as many times as there is attack, by using A as defenders. And if there was one other piece amongst the defenders, the C had no chance anyway. The effect you point out does not nearly occur as easily as you think. And, as you can see, only 5 of my different armies did have duplicated superpieces. All the other armies where just what you would get if you traded the mentioned pieces, thus detecting if such a trade would enhance or deteriorate your winning chances or not.
Hard to see. You will wait for White to lose because of insufficient material, and I will await a loss of White because of the lonely big pieces disadvantage. It will be the task then to find out the true reasons of that.
I will try to create two arrays, where each side think to have advantage.
And thus I am convinced, that I have to include this aspect into SMIRF's successor's detail evaluation function.
... This can still be done in a reasonably realistic mix of pieces, e.g. replacing Q and C on one side by A, and on the other side by Q and A by C, so that you play 3C vs 3A, and then give additional Knight odds to the Chancellors. ...
And by that this would create just the problem I have tried to demonstrate. The three Chancellors could impossibly be covered, thus disabling their potential to risk their own existence by entering squares already influenced by the opponent's side.
Derek Nalls: | Given enough years (working with only one server), this quantity of | well-played games may eventually become adequate. I never found any effect of the time control on the scores I measure for some material imbalance. Within statistical error, the combinations I tries produced the same score at 40/15', 40/20', 40/30', 40/40', 40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I did not consider it worth doing just to prve that it was a waste of time... The way I see it, piece-values are a quantitative measure for the amount of control that a piece contributes to steering the game tree in the direction of the desired evaluation. He who has more control, can systematically force the PV in the direction of better and better evaluation (for him). This is a strictly local property of the tree. The only advantage of deeper searches is that you average out this control (which highly fluctuates on a ply-by play basis) over more ply. But in playing the game, you average over all plies anyway.
100 comments displayed
Permalink to the exact comments currently displayed.