GitHub confirms AI training uses public data

▼ Summary
– GitHub will start using customer interaction data from Copilot Free, Pro, and Pro+ users to train its AI models beginning April 24, 2024.
– Copilot Business, Copilot Enterprise users, and verified students/teachers are exempt from this data use policy.
– Affected users can opt out via their account settings, following an opt-out model aligned with US industry practices.
– The data collected includes accepted model outputs, code snippets, cursor context, documentation, file names, and user feedback.
– The policy allows code from private repositories to be used for training if the user has not disabled the setting, altering the typical understanding of “private.”
Starting next month, GitHub will incorporate user interaction data into the training of its AI models. This data includes accepted or modified model outputs, code snippets shown as inputs, and the surrounding code context from a user’s cursor position. The policy, effective April 24, applies to users of Copilot Free, Pro, and Pro+ tiers. However, Copilot Business and Copilot Enterprise customers, along with verified students and teachers, are exempt due to their existing contract terms.
Affected users can opt out by navigating to their settings and disabling the feature labeled “Allow GitHub to use my data for AI model training.” This follows an opt-out model aligned with U. S. industry norms, rather than the stricter opt-in requirements common in Europe. GitHub’s chief product officer, Mario Rodriguez, encourages participation, arguing that contributing data helps models better understand development workflows and deliver more accurate, secure code suggestions. He points to internal improvements, such as a higher acceptance rate for AI suggestions, after incorporating interaction data from Microsoft employees.
The company justifies the move by noting that peers like Anthropic, JetBrains, and Microsoft itself have similar data-use policies. The specific interaction data collected extends beyond code to include comments, documentation, file names, repository structures, chat interactions with Copilot, and user feedback like thumbs-up or thumbs-down ratings.
This shift raises questions about the definition of private repositories. While traditionally described as accessible only to the owner and explicitly invited collaborators, the new policy means code from these repositories can be used for model training if the user has the setting enabled. GitHub’s FAQ clarifies that snippets from private repos may be collected during active Copilot sessions when the training setting is on.
Community reaction on GitHub’s discussion forum has been largely critical. At the time of writing, emoji reactions showed 59 thumbs-down votes compared to just three “rocket ship” emojis indicating support. Among dozens of comment threads, only GitHub VP Martin Woodward publicly endorsed the change. Some observers note that user indignation may be tempered by the reality that OpenAI’s Codex, the model powering Copilot, was originally fine-tuned on vast amounts of publicly available code from GitHub itself. This highlights a broader industry pattern where AI model training has long relied on data gathered without explicit, enthusiastic consent.
(Source: Theregister.com)


