AWS Data Stores for Machine Learning
Redshift
-
Data Warehousing, SQL based analytics (OLAP - Online analytical processing)
-
anytime you want to run some massively parallel SQL queries to perform some analytics, then Redshift is the way to go
-
Load data from S3 to Redshift or use Redshift Spectrum to query data directly in S3 (no loading)
-
Redshift is something you have to provision in advance. It's like an entire big database and then you will run your SQL analytics on it.
-
Used mainly for Analytics
RDS, Aurora
-
Relational Store, SQL (OLTP - Online Transaction Processing)
-
Must provision servers in advance
-
The difference is that Redshift is column base, so data is organized in column, whereas RDS and Aurora is row based and data is organized in rows, hence the name OLTP.
DynamoDB
-
NoSQL data store
-
Serverless, so you don't need to provision server instance in advance
- You just say how much read/write capacity you want for it to work
-
Useful to store a machine learning model served by your application
- Will not be doing ML here, but ML output may be stored here
S3
-
Object storage
-
Serverless, infinite storage
-
Integration with most AWS Services
OpenSearch (previously ElasticSearch)
-
Indexing of data
-
Search amongst data points
-
Clickstream Analytics
-
Note, there is no ML capability directly embedded in OpenSearch
ElastiCache
-
Caching mechanism
-
Not really used for Machine Learning
-
Cache-ing data making sure the data can be easily and fastly accessed.