社交媒体用户生成内容：人工智能认知与情绪反应的大规模数据集 | Large-scale Social Media UGC Dataset on AI Perception

数据集概览 | Dataset Overview

中文介绍

本数据集全面收录了社交媒体平台用户对生成式人工智能技术（尤其是ChatGPT）的真实用户生成内容，聚焦用户的认知、情绪反应与接受意愿等多个维度。数据集来源包括Reddit平台ChatGPT子版块和Twitter平台，时间跨度广泛，数据规模庞大，为研究人工智能社会感知与用户接受度提供了宝贵的数据资源。

English Introduction

This dataset comprehensively collects authentic user-generated content from social media platforms regarding generative artificial intelligence technologies (especially ChatGPT), focusing on multiple dimensions such as user cognition, emotional reactions, and acceptance willingness. The dataset sources include the ChatGPT subreddit and the Twitter platform, covering an extensive time span with a massive data scale, providing valuable data resources for researching social perception and user acceptance of artificial intelligence.

Reddit 帖子 | Posts

170K+

原创内容 (2021-2025)

Reddit 评论 | Comments

1.9M+

用户反馈 (2021-2025)

Twitter 推文 | Tweets

14B+

社交媒体数据 (2012-2022)

数据容量 | Data Volume

10TB+

全部数据总量

数据集构成 | Dataset Composition

Reddit 数据集 | Reddit Dataset

数据来源：Reddit平台ChatGPT相关子版块(r/ChatGPT)
时间范围：2021年1月至2025年2月
数据规模：约17万条发帖内容(post)，约190万条相关用户评论(comment)
数据内容：包含用户对ChatGPT的使用体验、功能探索、技术讨论、情感表达等多样化内容
文件格式：JSON, CSV
特点：反映用户对生成式AI的真实认知与情感反应，内容丰富多样

Twitter 数据集 | Twitter Dataset

数据来源：Twitter社交媒体平台
时间范围：2012年4月至2022年11月
数据规模：超过140亿条推文（约占Twitter全球总量的1%），数据容量超10TB
数据内容：涵盖早期AI讨论到ChatGPT发布前的社会反应
数据格式：SQL格式存储于AWS Athena
特点：具有长时间跨度的大规模历史数据，可追踪AI技术认知变化

数据价值 | Data Value: 这些数据完整记录了从早期AI技术讨论到ChatGPT等大型语言模型崛起的全过程中，普通用户的真实反应、情感变化与接受程度。数据具有时间连续性、来源多样性和规模全面性，为研究AI技术的社会接受度提供了独特视角。

数据集获取与使用 | Dataset Access & Usage

数据访问方式 | Data Access Methods

Reddit数据：所有Reddit数据已直接上传至OSF平台，研究者可通过下方链接直接免费下载使用。

Twitter数据：由于Twitter数据规模庞大（超过10TB），无法直接上传至OSF平台。目前数据通过AWS Athena托管，以SQL格式提供查询访问。需联系数据集维护人申请数据访问权限。

⚠️ OSF.io平台访问说明 | OSF.io Access Note

请注意：访问OSF.io平台上的数据需要使用ORCID账号进行登录，这是OSF平台本身的要求，并非数据提供者的设置。如果您没有ORCID账号，请先在orcid.org上注册一个账号，然后再访问数据链接。

Please note: Accessing data on the OSF.io platform requires login with an ORCID account. This is a requirement of the OSF platform itself, not a setting by the data provider. If you don't have an ORCID account, please register one at orcid.org before accessing the data link.

访问Reddit数据集 | Access Reddit Dataset

Twitter数据申请 | Twitter Data Application

由于Twitter数据规模超大，研究者需根据所在地区通过以下方式联系申请访问：

中国大陆地区学者：martin@hust.edu.cn
中国港澳台地区学者：martin.mar@my.cityu.edu.hk
其他国家和地区学者：yongchao@upenn.edu（请尽量在中国大陆工作时间内联系）

申请说明：请简要说明研究用途、所在单位、研究计划等基本信息。

数据使用协议 | Data Usage Agreement

本数据集采用 CC BY 4.0 国际共享协议（Creative Commons Attribution 4.0 International）。研究人员在使用本数据集时应遵循以下原则：

署名要求：使用数据时必须适当引用原始数据来源
开放共享：允许自由分享、复制、重新混合、转换和基于数据构建
学术道德：使用数据应遵循学术研究伦理原则

数据集引用与致谢 | Dataset Citation & Acknowledgment

引用格式 | Citation Format

在学术论文、报告或其他研究成果中使用本数据集时，请使用以下标准引用格式：

Ma, Y. (2025). Large-scale Social Media User-Generated Content Dataset on Perception and Emotional Reactions toward Generative Artificial Intelligence. Open Science Framework (OSF). https://osf.io/nbk36

数据集维护 | Dataset Maintenance

本数据集由华中科技大学马永超博士维护。数据获取背景包括与学界、业界专家的长期合作关系、高性能计算平台（HPC）的技术支持，以及在大模型部署实践中积累的经验。

The dataset is maintained by Dr. Yongchao Ma from Huazhong University of Science and Technology. The data acquisition background includes long-term cooperative relationships with experts in academia and industry, technical support from high-performance computing platforms (HPC), and experience accumulated in the practice of large-scale model deployment.

关键词 | Keywords

Generative AI ChatGPT Social Media Analysis Sentiment Analysis User-Generated Content Large-scale Text Data Reddit Twitter Text Mining Consumer Perceptions Emotional Reactions 生成式人工智能社交媒体数据情感分析用户生成内容文本挖掘大规模数据集