Govur University Logo
--> --> --> -->
...

What is the primary statistical measure used to determine if a subreddit's user base is disproportionately composed of bots?



The primary statistical measure is the Benford's Law analysis applied to the distribution of leading digits in user IDs and post timestamps. Benford's Law predicts that in many naturally occurring sets of numbers, the digit 1 will appear as the leading digit about 30% of the time, and larger digits occur as the leading digit with lower frequency. A significant deviation from Benford's Law in the leading digits of either user IDs or post timestamps within a subreddit suggests artificial activity. User IDs are sequential integers, thus if disproportionately generated by bots, the leading digit distribution becomes more uniform instead of following Benford's Law. A Chi-square test, a statistical test that compares observed data with expected data, can be applied to the observed leading digit distribution of user IDs and the expected distribution based on Benford's Law. A high Chi-square statistic and a correspondingly low p-value (typically below 0.05) indicates a statistically significant deviation from Benford's Law, suggesting a disproportionate presence of bots. Similarly, timestamps should generally follow daily/weekly patterns which a human creates naturally. Post timestamps, when analyzed for the distribution of their leading digits, can also reveal anomalous patterns if bot activity is prevalent. This is because bots often post at consistent, machine-generated intervals, leading to non-random distributions unlike human posting behaviors. Because bots can also post in a fashion to mimic human behavior, other advanced tests like calculating entropy, a measure of randomness, of posting times can give insight into how randomized the content posting of any particular account is. The lower the entropy of post times, the higher the likelihood that content creation and submissions are not organically created by human beings.