New Top Score on AI2 Reasoning Challenge (ARC) Is 53.84%

vxl · on Jan 5, 2019

The challenge appears to be to create an AI that can answer the most multiple choice questions correctly.

Here's some of the questions from the dataset:

• Which of these is inherited by a person from his or her parents? (A) short hair (B) long arms (C) pierced ears (D) scar on the leg

• Which object occupies the greatest amount of space? (A) a galaxy (B) a black hole (C) a neutron star (D) a solar system

• What is a similarity between sound waves and light waves? (A) Both carry energy. (B) Both travel in vacuums. (C) Both are caused by vibrations. (D) Both are traveling at the same speed.

There's an entry called "Guess All" that scored 25% as you might expect.

They provide a list of 14 million science-related sentences (presumably for training) but there's no requirement to rely solely on them to solve the challenge. The list has been scraped from web search results so looks quite noisy.

LeifCarrotson · on Jan 5, 2019

4-choose-1 makes 53% a lot more impressive than when I assumed it was true/false. I had, like the other poster, assumed it was a sum of random chance, statistical deviation, and publication bias, but 53% suggests it's doing a little better than that.

stakhanov · on Jan 5, 2019

I oversimplified in my previous post.

You're usually going to get to a result that's slightly better than random agreement by applying some really-not-all-that-impressive baseline strategy like looking for bag-of-words overlap between a question and a known factoid for which you have a stored answer. This strategy is easy to fool: If you have a factoid like "Dogs that chase cats are dangerous" and a question is asked like "Are cats that chase dogs dangerous?", then it might answer "Yes" because it matches the stored fact: But the answer will actually be "Yes" in more than 50% of all cases. Since dogs tend to chase cats, and cats tend not to chase dogs, the concept of cats chasing dogs is less likely to come up in a question answering context than the concept of dogs chasing cats. There are obviously numerous other ways that this is going to fail, like "Trump believes the world is flat" might be a factoid and the question might be "Is the world flat?".

The result I previously oversimplified should more correctly be stated as follows: Anwers coming from the "artificial intelligence" systems submitted for the conference were analyzed to see if any answer they gave represented a deviation from the baseline that fixed an error in the baseline, versus introduced an error that the baseline wouldn't have made. The submissions were then analyzed to see if there was a pattern whereby more errors would be fixed than introduced. It was found that the statistical pattern was basically that random deviation was introduced into the baseline, and systems that, as a result of such random deviations scored very badly weren't submitted at the conference, thus generating a scores that were consistently more likely to be better than baseline rather than worse-than-baseline, but not in a meaningful way.

To reiterate the source: It can be found on page 43 here:

http://richard.bergmair.eu/pub/thesis.pdf

But that applies to the old RTE conference. So it would be interesting to see if the same holds true here.

yorwba · on Jan 5, 2019

This leaderboard tries to prevent that kind of problem:

> The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. This leaderboard is for the Challenge Set.

Additionally, I don't think they let you try often enough to get a meaningful chance at significantly beating the baseline with just pure randomness.

CuriousSkeptic · on Jan 6, 2019

Are there fundamental limits on the size of black holes, solar systems or galaxies? Or is the question intentionally ambiguous?

stakhanov · on Jan 5, 2019

A bit reminiscent of the "Recognizing Textual Entailment Challenge (RTE)" which was run under the "Text Analysis Conference" umbrella and hosted by NIST until it was discontinued a few years back. An interesting insight from a qualitative analysis of the deviations of submitted answers versus gold standard answers is that it can be explained surprisingly well by: random choice minus publication bias. See here:

http://richard.bergmair.eu/pub/thesis.pdf [page 43]

That's what the number "53.84%" sounds like to me.