请教一种很奇怪的数据库状态

botang · 发表于 2016-12-13 21:01:26

botang 发表于 2016-12-13 21:00
不是的，在它下面有一段距离后你会看到“Completed checkpoint up to RBA“
上课时讲过：所有fast_sta ...

以上问题，改成11.2.0.4+redhat，不会改善。

lujiaguai · 发表于 2016-12-13 22:25:38

本帖最后由 lujiaguai 于 2016-12-14 09:28 编辑

唐老师，我检查了操作系统，发现是limits.conf没有做，不知道我有没有看错，这个参数在安装的时候会有提示执行一个脚本的，怎么可能忽略？
是这个参数导致了process问题吗？
下面是课堂上要求我们变更的LIMITS.CONF参数：
echo "oracle soft nproc 2047" >>/etc/security/limits.conf
echo "oracle hard nproc 16384" >>/etc/security/limits.conf
echo "oracle soft nofile 1024" >>/etc/security/limits.conf
echo "oracle hard nofile 65536" >>/etc/security/limits.conf
echo 'if [ $USER = "oracle" ]; then' >>  /etc/profile
echo ' if [ $SHELL = "/bin/ksh" ]; then' >> /etc/profile
echo '  ulimit -p 16384' >> /etc/profile
echo '  ulimit -n 65536' >> /etc/profile
echo ' else' >> /etc/profile
echo '  ulimit -u 16384 -n 65536' >> /etc/profile
echo ' fi' >> /etc/profile
echo 'fi' >> /etc/profile
最后这几行/etc/profile是什么意思，也是必须要执行的吗？

下面是机器上cat  /etc/security/limits.conf 的结果，全文如下，这里是空的。是我看错了吗？还是确实如此？
另外/etc/profile文件里没有上述的配置信息
/etc/sysctl.conf的参数核对过，是正确的，今天是想修改sem从32000改到320000，顺便核对其他几个文件，才发现limits.conf和profile缺了内容。
缺内容的这2个文件增加内容后可以立即生效吗，还是需要其他命令来让他生效？
[oracle@OADB1 ~]$ cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#<domain>       <type>  <item>  <value>
#
#Where:
#<domain> can be:
#       - an user name
#       - a group name, with @group syntax
#       - the wildcard *, for default entry
#       - the wildcard %, can be also used with %group syntax,
#                for maxlogin limit
#
#<type> can have the two values:
#       - "soft" for enforcing the soft limits
#       - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#       - core - limits the core file size (KB)
#       - data - max data size (KB)
#       - fsize - maximum filesize (KB)
#       - memlock - max locked-in-memory address space (KB)
#       - nofile - max number of open files
#       - rss - max resident set size (KB)
#       - stack - max stack size (KB)
#       - cpu - max CPU time (MIN)
#       - nproc - max number of processes
#       - as - address space limit (KB)
#       - maxlogins - max number of logins for this user
#       - maxsyslogins - max number of logins on the system
#       - priority - the priority to run user process with
#       - locks - max number of file locks the user can hold
#       - sigpending - max number of pending signals
#       - msgqueue - max memory used by POSIX message queues (bytes)
#       - nice - max nice priority allowed to raise to values: [-20, 19]
#       - rtprio - max realtime priority
#
#<domain>    <type>  <item>       <value>
#

#*             soft core          0
#*             hard rss          10000
#@student       hard nproc          20
#@faculty       soft nproc          20
#@faculty       hard nproc          50
#ftp          hard nproc          0
#@student       -    maxlogins    4

# End of file
出现这样的问题，我甚至担心这台机器那些操作系统的依赖包也是没有完整，直接忽略了部分，进行下一步安装了。
这些依赖包有没有办法可以检查，也就是调出安装数据软件时候的那个检查过程？还是说只能通过官方文档去一个个比对？
我感觉更换新的操作系统跟数据库还是可以解决问题的，只要按照唐SIR上课的步骤来做，不会出现这些莫名其妙的问题。

我也核对了另一个切换日志比这个更频繁的数据库，每分钟3次左右的切换，是MTTR值为0，同样也有增量检查点，也是出现大量的checkpoint not  complete
但是那个库的limits.conf和profile里面的内容都是正确的，也没有出现process died的问题

“不是的，在它下面有一段距离后你会看到“Completed checkpoint up to RBA“
上课时讲过：所有fast_start_mttr_target增量检查点，都是延迟登记的。启动该检查点的SCN时，内存中还有比这个SCN更老的脏块未写。如果很多的话当然会经常看到“Checkpoint not complete”，下一个或下下一个检查点事件去“Completed checkpoint up to RBA“
查下fast_start_mttr_target不要是0”

关于这个问题，实际数据库中MTTR=0的，也有增量检查点，这与有mttr值，在checkpoint not complete上什么区别吗？
从上面提到的另一个切换日志更频繁的数据库来看，那个库MTTR=0，日志里面也是大量连续不断的checkpoint not complete，但似乎跑的好好的，没有问题。

关于db_write_process参数的问题，这个我记得上课的说过，是oracle数据库自己定的，默认是CPU/8, 8核心及以下的CPU该值默认都是1。
但是课堂上修改过这个值，为了演示，当时记得至少增加到4还是8，我记不清了。
这个值在8核CPU以下的时候，生产环境中真的可以修改到2吗？

limits.conf文件，我看到文档中有stack segment，但是课堂的脚本里没有，只有nofile,nproc，这里需要补上吗？

Resource Shell Limit

Resource

Soft Limit

Hard Limit

Open file descriptors

nofile

at least 1024

at least 65536

Number of processes available to a single user

nproc

at least 2047

at least 16384

Size of the stack segment of the process

stack

at least 10240 KB

at least 10240 KB, and at most 32768 KB

botang · 发表于 2016-12-14 09:43:33

本帖最后由 botang 于 2016-12-14 09:47 编辑

lujiaguai 发表于 2016-12-13 22:25
唐老师，我检查了操作系统，发现是limits.conf没有做，不知道我有没有看错，这个参数在安装的时候会有提示 ...

echo 'if [ $USER = "oracle" ]; then' >>  /etc/profile
echo ' if [ $SHELL = "/bin/ksh" ]; then' >> /etc/profile
echo '  ulimit -p 16384' >> /etc/profile
echo '  ulimit -n 65536' >> /etc/profile
echo ' else' >> /etc/profile
echo '  ulimit -u 16384 -n 65536' >> /etc/profile
echo ' fi' >> /etc/profile
echo 'fi' >> /etc/profile
最后这几行/etc/profile是什么意思，也是必须要执行

不执行，是写到 /etc/profile 里去，官方文档的内容。oracle的login shell会自动执行以上内容。以上内容unlimit操作系统进程限制。

操作系统用rpm -qa 对照一下那些包。pdksh那个包跟进程有密切关系。

这个库是不是在高峰时段有很多的批处理或导入导出操作？

limits.conf

是LINUX PAM，写后立即生效，不必重启。

这个库既然报告了Private Strand Flush就可以尝试加一个 dbwr，暂不考虑cpu个数。但是不能加太多dbwr。

MTTR给他改成3600，生产库怎么能是0。“checkpoint not complete“只要不是很多，算正常。

lujiaguai · 发表于 2016-12-14 15:21:16

本帖最后由 lujiaguai 于 2016-12-14 15:30 编辑

我查阅了官方文档，以及核对了已经安装的包，缺这些：
•       compat-libcap1-1.10-1 (x86_64)
•       glibc-devel-2.12-1.7.el6.i686
•       libstdc++-4.4.4-13.el6.i686
•       libstdc++-devel-4.4.4-13.el6.i686
•       ksh
正好有唐老师说的KSH，这个跟PDKSH是一回事吗？文档里没有提到PDKSH
文档原文如下：
Note:
Starting with Oracle Database 11g Release 2 (11.2.0.2), all the 32-bit packages, except for gcc-32bit-4.3, listed in the following table are no longer required for installing a database on Linux x86-64. Only the 64-bit packages are required.
However, for any Oracle Database 11g release before 11.2.0.2, both the 32-bit and 64-bit packages listed in the following table are required.
•       The following or later version of packages for Oracle Linux 6, and Red Hat Enterprise Linux 6 must be installed:
•       binutils-2.20.51.0.2-5.11.el6 (x86_64)
•       compat-libcap1-1.10-1 (x86_64)
•       compat-libstdc++-33-3.2.3-69.el6 (x86_64)
•       compat-libstdc++-33-3.2.3-69.el6.i686
•       gcc-4.4.4-13.el6 (x86_64)
•       gcc-c++-4.4.4-13.el6 (x86_64)
•       glibc-2.12-1.7.el6 (i686)
•       glibc-2.12-1.7.el6 (x86_64)
•       glibc-devel-2.12-1.7.el6 (x86_64)
•       glibc-devel-2.12-1.7.el6.i686
•       ksh
•       libgcc-4.4.4-13.el6 (i686)
•       libgcc-4.4.4-13.el6 (x86_64)
•       libstdc++-4.4.4-13.el6 (x86_64)
•       libstdc++-4.4.4-13.el6.i686
•       libstdc++-devel-4.4.4-13.el6 (x86_64)
•       libstdc++-devel-4.4.4-13.el6.i686
•       libaio-0.3.107-10.el6 (x86_64)
•       libaio-0.3.107-10.el6.i686
•       libaio-devel-0.3.107-10.el6 (x86_64)
•       libaio-devel-0.3.107-10.el6.i686
•       make-3.81-19.el6
•       sysstat-9.0.4-11.el6 (x86_64)

另外，limits.conf中 oracle stack *** 这个要写吗？课堂脚本上没有，但是文档有

DBWR有机会增加试试，这库还有一些问题，无效对象有500多个，都是实际数据库操作用户下的，不是sys这类系统账户的。
这些无效对象需要处理吗？

上班期间没有导入导出的操作，数据库管理员就是用最土的办法，在晚上调用sh脚本做expdp操作。
然后测试环境如果有需要的情况下，用这个dump文件导进去，确保数据同步。
这都是最简单粗暴的办法，没有用OGG，也没有DG，洪老师说正式环境跟测试环境的一起跑，最佳实践就是DG。
很可惜，没有机会跟唐老师一起学一期DG的详细内容。
考试的时候DG的章节可以直接在GC上做了，万一失败，死在哪里都不知道，也没有能力检查修正了，属于不成功便成仁的状态。。。，心虚的很

这个库倒腾到目前为止，乙方的人昨天拿了AWR报告去，今天上午还在说内存不够，他们从头到尾就是内存不够一套说辞。
项目组目前也不接受这套说辞了，现在基本定下来操作系统按规范重做，数据库换到新版本。
数据文件只有5G多，dump粗来也只有3G左右，这么小的库，用数据泵导入新库，不会有问题吧~唐老师，给颗定心丸吃一下吧。

另外AUD$表已经达到接近4G,全部放在system表空间，导致现在system表空间比实际使用的数据表空间还大。
课堂上说这样是不对的，建议把aud$迁出来。那么我在新库上直接把表aud$迁到USERS表空间可以吗？还是说新建一个audit表空间更合理一些？

另外那个库1分钟切3次日志，“checkpoint not complete“刷屏的，表面上运行没有什么问题，但我现在是不敢再去多嘴了，考完OCM再说。

botang · 发表于 2016-12-15 09:59:05

lujiaguai 发表于 2016-12-14 15:21
我查阅了官方文档，以及核对了已经安装的包，缺这些：
• compat-libcap1-1.10-1 (x86_64)
& ...

1. ksh 就是pdksh
2. 就是教学环境也是有做limits.conf的。
kickstart文件：

# Oracle4
echo "oracle soft nproc 2047" >>/etc/security/limits.conf
echo "oracle hard nproc 16384" >>/etc/security/limits.conf
echo "oracle soft nofile 1024" >>/etc/security/limits.conf
echo "oracle hard nofile 65536" >>/etc/security/limits.conf
# Oracle5
echo 'if [ $USER = "oracle" ]; then' >> /etc/profile
echo ' if [ $SHELL = "/bin/ksh" ]; then' >> /etc/profile
echo ' ulimit -p 16384' >> /etc/profile

复制代码

3. 无效对象要 compile一下，优化器统计信息也要收集一下。
4. DG要手工做几次，才敢去考试。以防万一，ADG失败了，估计考试就失败了。
5. aud$要移动出来，不管在哪里。关掉一些没必要的默认audit。
6. “checkpoint not complete“在我的生产环境中，时有发生。算正常的。